Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions

Zuo, Hanke; Peng, Yanhong

doi:10.3390/pr14121900

Open AccessArticle

Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions

by

Hanke Zuo

^1,* and

Yanhong Peng

^2,*

¹

School of Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China

^*

Authors to whom correspondence should be addressed.

Processes 2026, 14(12), 1900; https://doi.org/10.3390/pr14121900

Submission received: 8 May 2026 / Revised: 4 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

(This article belongs to the Section Petroleum and Low-Carbon Energy Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of strong inter-well interference during multi-well fracturing in shale reservoirs, low efficiency of conventional numerical simulation, and the tendency of machine learning models to overfit and lack interpretability under small-sample conditions, this paper constructs an explainable ensemble learning framework for predicting hydraulic fracture asymmetry. A geology–engineering integrated numerical simulation is adopted to quantify the fracture asymmetry index η as an interference metric, and an initial dataset is constructed comprising natural fracture orientation, well spacing, and injection rate. Subsequently, Jensen–Shannon (JS) divergence-constrained Gaussian data augmentation and second-order interaction features are introduced, and the GBRT model parameters are optimized using particle swarm optimization (PSO). Furthermore, random forest and ridge regression are incorporated, and ensemble weights are determined via cross-validation to build a weighted ensemble prediction model. The results show that the proposed model achieves good predictive performance in repeated validation, with an average coefficient of determination R² of 0.8484 and a 95% confidence interval of 0.8179–0.8790, while also demonstrating favorable overall accuracy in multiple baseline model comparisons and regularization-controlled experiments. Through leave-one-simulation-scenario validation, prediction interval analysis, and interpretability robustness testing, the model’s generalization boundary, prediction uncertainty, and explanation reliability under small-sample conditions are further evaluated. SHAP analysis and grouped permutation importance results indicate that the natural fracture angle is the dominant factor controlling asymmetric fracture response, while the interaction between well spacing and the natural fracture angle also significantly affects the predictions, suggesting that asymmetric fracture propagation is primarily governed by the combined effects of natural fracture steering and inter-well stress interference. The proposed framework can serve as a fast surrogate model for evaluating inter-well interference and screening fracturing designs within a given simulation parameter space, providing an interpretable data-driven approach for fracturing design optimization in shale reservoirs under small-sample conditions.

Keywords:

data augmentation; geological–engineering integration; inter-well interference; interpretability analysis; SHAP

1. Introduction

Unconventional shale oil and gas resources, such as tight oil (shale oil) and shale gas, are characterized by abundant reserves and significant development potential. However, their typical low-porosity and low-permeability characteristics (porosity < 5%, permeability < 0.1 mD) result in high extraction difficulty and cost [1,2]. To enhance recovery, development using horizontal-well water-flooding patterns is commonly adopted. Nevertheless, poor reservoir physical properties and inadequate sand body connectivity often lead to rapid production decline in individual wells and low overall recovery. To mobilize remaining reserves, infill well placement and re-fracturing have become key measures, but these significantly increase the risk of inter-well interference [3,4,5,6]. Compounding this, the direction of interference exhibits spatiotemporal heterogeneity—child wells initially exert negative interference (production suppression) on parent wells during early production, which later transitions to positive interference (pressure maintenance effect) [7,8]. The essence of inter-well interference is the overlapping of fracture networks, which also leads to a decrease in the utilization efficiency of the stimulated reservoir volume. Studies on tight oil reservoirs indicate that when fracture half-length exceeds 100 m and fracture spacing is less than 40 m, the overlapping area of adjacent well fracture networks surpasses 40%, significantly reducing the effective drainage area [9,10,11,12]. This reduction in effective drainage area is a key factor leading to the decline in reservoir utilization efficiency in tight oil reservoirs.

Current research on inter-well interference primarily focuses on two directions: field diagnosis and numerical simulation. Regarding field diagnosis, microseismic monitoring and pressure response analysis are common methods for identifying fracture connectivity, but their accuracy is susceptible to reservoir heterogeneity [13,14,15,16]. To further improve diagnostic reliability, Jacobs proposed a quantitative analysis method based on Mechanical Specific Energy [17]; Kalinec et al. [18] developed improved algorithms building upon this foundation; and Alfataierge et al. [19] evaluated inter-well interference by inverting fracture network propagation based on microseismic data. However, field diagnosis is often affected by reservoir heterogeneity, monitoring uncertainty, incomplete data acquisition, and high operational cost. As a complementary approach, numerical simulation can explicitly describe fracture propagation, reservoir pressure evolution, and well-to-well communication under controlled geological and engineering conditions. Recently, Zhou et al. [20] analyzed inter-well interference based on transient-flow analysis using an improved embedded discrete fracture model, providing an effective numerical framework for representing complex fracture–reservoir interactions. Nevertheless, high-fidelity numerical simulation usually requires detailed geological characterization, complex parameter calibration, and considerable computational cost, which limits its efficiency for rapid design optimization and uncertainty analysis.

To overcome the limitations of purely field-based diagnosis and computationally expensive numerical simulation, machine learning has been increasingly applied to well-interference evaluation and hydraulic-fracturing analysis in unconventional reservoirs. For example, Liu et al. [21] developed a supervised machine-learning framework to detect fracture-hit events by integrating parent-well pressure, production data, and fracturing-operation records, while Zhang et al. [22] used machine-learning methods to evaluate and predict the interference degree of shale gas wells. Beyond direct interference diagnosis, Hui et al. [23] incorporated geological and operational parameters into machine-learning models for shale gas production forecasting, indicating that nonlinear data-driven methods can capture the combined effects of reservoir properties and stimulation design. In addition, several studies have constructed surrogate models to reduce the computational cost of numerical simulation: Sarkar et al. [24] combined hydraulic fracturing, reservoir flow, geomechanics, and machine learning for unconventional shale reservoir development; Chen et al. [25] proposed a deep-learning surrogate model for pressure-transient responses in shale wells with heterogeneous fractures; and Zhu et al. [26] developed a multiscale neural-network model to predict the equivalent permeability of discrete fracture networks. These studies demonstrate the potential of machine learning in accelerating prediction and supporting engineering decisions; however, relatively limited attention has been paid to interpretable surrogate prediction of hydraulic-fracture asymmetry under multi-well zipper-fracturing conditions, especially under limited-sample scenarios. More broadly, recent studies on Kolmogorov–Arnold Networks and transfer learning further indicate that intelligent algorithms are increasingly used to model complex nonlinear engineering systems under limited data or high simulation cost conditions [27,28,29,30,31].

Despite these advances, machine-learning-based studies on inter-well interference still face several challenges. First, interference samples obtained from field monitoring or high-fidelity numerical simulations are usually limited, making data-driven models prone to overfitting. Second, many models are used as black-box predictors, and their results are difficult to interpret in terms of physical mechanisms such as stress shadow, natural fracture activation, and fracture deflection. Third, previous studies mainly focus on direct interference classification or production prediction, whereas relatively few works construct an interpretable quantitative proxy to characterize hydraulic-fracture asymmetry under multi-well zipper-fracturing conditions.

To address the challenges of high computational cost in numerical simulation for evaluating inter-well interference in multi-well fracturing, the tendency of models to overfit under small-sample conditions, and the lack of interpretability in prediction results, this paper constructs an explainable ensemble learning framework for predicting hydraulic fracture asymmetry. The overall workflow of the proposed study is illustrated in Figure 1. First, a geology–engineering integrated numerical simulation model is established based on a horizontal-well pad. The fracture asymmetry index η is used to characterize the deviation of multi-well fracture propagation from an ideal symmetric state, and an initial dataset is constructed comprising natural fracture orientation, well spacing, and injection rate. Second, under small-sample constraints, a Gaussian data augmentation method constrained by Jensen–Shannon (JS) divergence is introduced, and second-order interaction features are constructed to enhance the model’s ability to capture nonlinear coupling relationships. On this basis, particle swarm optimization (PSO) is applied to determine the ensemble weights of Gradient Boosting Regression Trees (GBRT), Random Forest (RF), and Ridge Regression (Ridge), thereby establishing a fast prediction model for fracture asymmetry. Finally, through systematic comparisons with linear models, regularized models, probabilistic models, and nonlinear machine learning models, combined with repeated validation, leave-one-simulation-scenario validation, prediction interval analysis, and SHAP and grouped permutation importance analysis, the model’s predictive performance, generalization boundary, uncertainty, and interpretability robustness are evaluated within a given simulation parameter space. The main contributions of this paper are as follows:

(1): A method is proposed that uses the fracture asymmetry index η to characterize the degree of deviation in multi-well fracture propagation. A dataset suitable for small-sample machine learning modeling is constructed based on geology–engineering integrated numerical simulation.
(2): A small-sample modeling approach combining JS divergence-constrained Gaussian data augmentation and second-order interaction features is developed, with particle swarm optimization used to determine the ensemble model weights. Meanwhile, linear regression, regularized regression, Gaussian process regression, support vector regression, and various tree-based models are introduced as benchmarks to validate the comprehensive performance of the proposed method for small-sample fracture asymmetry prediction.
(3): Using the SHAP (Shapley Additive exPlanations) method, grouped permutation importance, and prediction interval analysis, the effects of natural fracture orientation, well spacing, and their interaction on fracture asymmetry are identified. Furthermore, the applicable boundary of the established model as a surrogate model for simulation training is clarified within a given parameter space for trend analysis, sensitivity evaluation, and preliminary scheme screening.

2. Materials and Methods

2.1. Development of an Integrated Geomechanical Model for a Representative Well Pad

This study takes a typical shale oil block in Shandong Province, China, as the research object. The target reservoir is located in the Jiyang Depression of the Bohai Bay Basin, where the Paleogene shale oil reservoirs are generally characterized by deep burial depth, strong heterogeneity, high carbonate content, well-developed natural fractures, and moderate fracturability. To improve computational efficiency while retaining the main geological and engineering characteristics of the target reservoir, a simplified numerical simulation model was established based on the following assumptions: (1) thermal effects are neglected; (2) matrix properties within each layer are represented by equivalent homogeneous isotropic parameters; (3) the water-based fracturing fluid is assumed to be incompressible; and (4) natural fractures are represented using a discrete fracture network (DFN). Numerical simulations were performed using the Mangrove hydraulic-fracturing simulation module embedded in Petrel 2018.2 (Schlumberger, Houston, TX, USA). This layer-scale equivalent treatment preserves the layered reservoir framework and the DFN-controlled fracture system while simplifying the matrix properties within each layer.

The geological and geomechanical parameters used in the numerical model were constrained according to reported field and experimental characteristics of the shale oil reservoir in the Jiyang Depression. Previous studies have shown that the Paleogene Shahejie Formation shale oil reservoir in this area is generally characterized by low porosity, strong heterogeneity, well-developed natural fractures, and complex in situ stress states [32,33]. Engineering studies on the development of multi-layer shale oil in the Jiyang Depression further provide practical bases for constructing a representative multi-layer simulation model [34]. The target reservoir depth was set to 4960 m, with its upper and lower adjacent layers located at 4930 m and 5040 m, respectively. This depth setting is consistent with the reported deep burial conditions of the Shengli–Jiyang shale oil reservoir, where burial depths generally range from about 3000 to 5000 m, locally reaching approximately 5500 m [32,35]. Therefore, the selected vertical depth represents a deep reservoir simulation case within the reported geological context of the Jiyang Depression.

The matrix permeability of the reservoir and adjacent layers was set to 0.1 mD. This value is consistent with the lower limit range of physical properties (approximately 0.03–0.12 mD) of the Paleogene Shahejie Formation shale reservoir in the Jiyang Depression [36]. Thus, the adopted permeability serves as an equivalent low-permeability parameter and an upper-bound representative value for tight shale oil reservoirs. On this basis, the numerical model was constructed as a simulation model constrained by field parameters. The porosity, permeability, and in situ stress values used for the target reservoir all fall within the reported ranges of geological and geomechanical parameters for shale oil reservoirs in the Jiyang Depression. Specifically, the maximum horizontal principal stress of the target layer is 98.92 MPa, which is close to the reported value of 96.9 MPa for the lower third member of the Shahejie Formation in well BYP5; the minimum horizontal principal stress is 89.42 MPa, which is of the same order of magnitude as the reported value of approximately 82 MPa [37]. Furthermore, well BYP5 is a high-yield shale oil well in the lower third member of the Shahejie Formation in the Jiyang Depression, with a peak daily oil production of 160 tonnes of oil equivalent [32]. Therefore, the horizontal stress levels adopted in this paper can serve as equivalent simulation parameters under high in situ stress conditions for the deep shale oil reservoir in the Jiyang Depression.

Based on the above settings, a three-layer geological model centered on the target reservoir depth of 4960 m was established. First, well trajectory data were imported to construct three horizontal wells, and then the layered stratigraphic structure was generated. Subsequently, experimentally obtained parameters (e.g., porosity, Young’s modulus) were assigned to the corresponding geological layers and coupled with the stratigraphic model [38,39]. The Mangrove module (a hydraulic fracturing design and simulation tool) was used to define the completion design for the three horizontal wells, including fracturing stages and perforation clusters. The stress shadow effect was enabled, the interference radius was defined, and a sequential fracturing scheme from Well 1 to Well 2 and then to Well 3 was adopted. The spatial arrangement of the three wells is shown in Figure 2b, and the geological platform model is shown in Figure 2. The geomechanical parameters used in the model (including porosity, permeability, Young’s modulus, and in situ stress) are listed in Table 1, and the natural fracture settings are listed in Table 2.

2.2. Feature Selection and Data Generation

2.2.1. Physical Meaning of Selected Features

Previous studies have shown that inter-well interference during multi-well hydraulic fracturing is strongly affected by well spacing, injection rate, and natural fracture orientation, because these parameters directly control stress-shadow interaction, fracture extension capacity, and the tendency of hydraulic fractures to intersect or divert along pre-existing weakness planes [4,40,41,42]. Therefore, these three variables were selected as the principal input features of the surrogate model.

Well spacing controls the degree of mechanical interaction among adjacent wells. A smaller spacing generally strengthens stress-shadow interference and increases the possibility of asymmetric fracture growth, whereas a larger spacing weakens the interaction between neighboring fracture systems. The injection rate affects the hydraulic driving force and fracture propagation capacity. A higher injection rate may promote fracture extension but can also intensify stress redistribution around the stimulated region. The natural fracture angle determines the geometric relationship between hydraulic fractures and pre-existing fractures, thereby influencing fracture deflection, branching, and fracture-network complexity.

Based on these considerations, a simulation dataset was constructed by varying well spacing, injection rate, and natural fracture angle within the prescribed engineering ranges. The resulting simulation program is listed in Table 3, and the complete dataset is provided in Appendix A.

2.2.2. Fracture Asymmetry Index for Inter-Well Interference

Previous studies have shown that during parent–child well interactions, hydraulic fractures often exhibit significantly asymmetric propagation, a phenomenon closely related to inter-well interference [43,44]. Under the influence of stress shadowing and fracture-induced interactions, fractures formed in subsequent fracturing wells may deviate from the reference propagation pattern of the first fractured well, leading to uneven fracture area distribution among different wells [45,46]. Therefore, this paper adopts fracture area as a quantitative geometric descriptor to evaluate the overall asymmetry of fracture propagation in a three-well system.

In the sequential fracturing scheme adopted in this study, Well 1 is fractured first and is not affected by stress shadows generated from prior adjacent wells; thus, it serves as a reference well under conditions without pre-existing inter-well interference. If Wells 2 and 3 are not significantly interfered with by the previously fractured wells under the same reservoir and treatment parameters, and given the large model scale and homogeneous layer properties, the fracture areas of Well 2 and Well 3 under interference-free conditions would be equal to that of Well 1, denoted as

S_{w e l l 1}

. Consequently, the reference total fracture area for the three-well system under ideal non-interfering conditions can be expressed as

3 S_{w e l l 1}

. Based on this, the deviation coefficient of fracture area based on the reference well fracture area is defined as

η = \frac{S_{w e l l 1} + S_{w e l l 2} + S_{w e l l 3} - S_{w e l l 1} * 3}{S_{w e l l 1} * 3}

(1)

where

η

represents the fracture asymmetry index in the three-well model;

S_{w e l l 1}

denotes the average fracture area per fracturing stage in Well 1; and

S_{w e l l 2}

and

S_{w e l l 3}

denote the average fracture areas per stage in Well 2 and Well 3, respectively. The larger the absolute value of this index, the stronger the deviation of the three-well fracture propagation from the reference symmetric state, and the more pronounced the inter-well interference.

η

> 0 indicates that the total fracture area of the three wells increases relative to the reference state, while

η

< 0 indicates that the total fracture area decreases relative to the reference state.

In this paper,

3 S_{w e l l 1}

is adopted as the reference rather than the average fracture area of the three wells. The main reason is that the physical meanings of the two are different.

S_{w e l l 1}

represents the reference fracture area of a single well without pre-existing inter-well interference or under the weakest influence of prior stress shadowing; therefore,

3 S_{w e l l 1}

represents the total reference fracture area for three wells under an ideal symmetric propagation state. This reference value is not calculated from the current average fracture area of the three wells but is directly based on the fracture area of the first fractured well (Well 1) as the single-well reference. It can be used to determine the degree of deviation of the actual three-well system’s total fracture area from the ideal non-interfering state.

If the average fracture area of the current three wells were used as the reference, that average value would itself be derived from the current simulation results. In that case, the total fracture area of the three wells would necessarily equal three times the average, and the deviation term would be weakened or even vanish by definition, making it difficult to effectively identify the fracture area imbalance among the three wells. Therefore, this paper adopts

3 S_{w e l l 1}

as the reference baseline instead of the current average fracture area of the three wells.

Furthermore,

η

is a dimensionless index and possesses scale invariance. When all fracture areas are multiplied by a constant k, the following holds:

η^{'} = \frac{{k S}_{w e l l 1} + {k S}_{w e l l 2} + {k S}_{w e l l 3} - S_{w e l l 1} * 3 k}{S_{w e l l 1} * 3 k} = η

(2)

Therefore, this index is insensitive to the absolute magnitude or unit scaling of the fracture areas and primarily reflects the relative deviation of the actual total fracture area of the three wells from the ideal symmetric reference state. Based on this characteristic,

η

can be used to compare the fracture propagation asymmetry and the intensity of inter-well interference across different simulation scenarios.

2.3. Machine Learning Model Development

2.3.1. Data Augmentation and Preprocessing

Fracture area and fracture asymmetry for the three wells were extracted from the Petrel simulations. Because each numerical simulation required substantial computation time, only 57 samples were available for model training and 7 for validation. The total dataset therefore contained fewer than 100 samples and only three original features—well spacing, injection rate, and natural fracture angle—reflecting the small-sample, low-dimensional conditions that commonly arise in geoscience and reservoir-engineering applications [47,48,49,50,51,52].

To compensate for the limited feature set, feature engineering was performed using the PolynomialFeatures generator to create second-order polynomial terms. Only interaction terms were retained, while squared terms were excluded. This expanded the feature space from 3 to 6 variables, comprising the original features plus all pairwise interactions. The aim was to capture the nonlinear coupling among injection rate, well spacing, and natural fracture angle through their interactions rather than through isolated nonlinear terms. This choice is consistent with the mechanics of hydraulic fracturing, in which fracture propagation depends on coupled effects, for example, the joint influence of injection rate and well spacing on stress interference, and the combined influence of injection rate and fracture orientation on natural fracture activation. Excluding squared terms also avoids problematic treatment of angular variables, reduces multicollinearity, and improves the model’s ability to identify key coupling mechanisms, as reported in related data-driven studies [53,54,55,56]. In this way, engineering complexity can be represented more effectively with only modest added model complexity.

Due to the limited number of original simulation samples, directly training machine learning models may lead to overfitting and unstable predictions. To improve the local sample coverage within the given simulation parameter space, this paper first adds Gaussian perturbations to the original input features. The perturbation standard deviation for the j-th input feature is defined as

σ_{j} = 0.05 s_{j}

(3)

where

σ_{j}

is the Gaussian perturbation standard deviation for the j-th input feature, and

s_{j}

is the standard deviation of that feature in the training data. This setting means that the perturbation amplitude is 5% of the original feature’s standard deviation.

To prevent the augmented samples from deviating excessively from the statistical distribution of the original simulation samples, this paper employs the Jensen–Shannon divergence (JSD) as a distribution consistency constraint. JSD is a measure of distribution difference based on Shannon entropy. Compared with the Kullback–Leibler divergence (KLD), JSD is symmetric and bounded, making it more suitable for comparing the feature distribution differences between original and augmented samples [57,58].

For each input feature, this paper constructs probability histograms for the original and augmented samples using the same number of bins and the same value range and then computes the JSD as follows:

D_{J S} (P ‖ Q) = \frac{1}{2} D_{K L} (P ‖ M) + \frac{1}{2} D_{K L} (Q ‖ M)

(4)

M = \frac{1}{2} (P + Q)

(5)

where

D_{J S D} (P, Q)

denotes the Jensen–Shannon divergence between the original distribution P and the augmented distribution Q for a given input feature; P and Q are the discrete probability distributions obtained by normalizing the original and augmented samples, respectively, under the same binning intervals; M is the average distribution of

P

and

Q

; and

D_{K L}

denotes the Kullback–Leibler divergence.

The generated augmented samples were accepted only when the maximum JSD among all input features was less than 0.05. This threshold is independent of the perturbation coefficient in Equation (3): the perturbation coefficient controls the magnitude of Gaussian noise, whereas the JSD threshold controls the allowable distributional deviation between the augmented and original samples. Following previous synthetic-data and data-augmentation studies [59,60], JSD < 0.05 was adopted as a conservative empirical criterion to allow limited local perturbations while avoiding excessive distributional shifts.

After satisfying the JSD constraint, this paper applies bootstrap resampling with replacement of the accepted pool of augmented samples to further increase training sample diversity and improve statistical robustness under small-sample conditions [61,62]. Specifically, two Gaussian-perturbed copies are generated from each original sample, and a bootstrap expansion factor of 1.5 is used for resampling, expanding the 57 original modeling samples to approximately 256 training samples. This process increases the number of training samples while maintaining essential consistency with the original simulation distribution. It should be emphasized that this augmentation strategy does not introduce new physical information and cannot replace independent field observations; its role is to improve local sample coverage and model training stability within the given simulation parameter space.

The data augmentation procedure under the JSD constraint is illustrated in Figure 3.

After augmentation, the three original engineering variables were expanded into six input features by introducing only the aforementioned second-order interaction terms. The specific final input features are shown in Table 4.

2.3.2. Model Development and Optimization

Gradient Boosted Regression Trees (GBRT) was selected as the primary nonlinear learner because it can capture complex nonlinear relationships through sequential residual fitting and has shown good adaptability in small- to medium-sized regression problems [63,64,65]. To reduce model-specific bias and improve prediction stability, Random Forest (RF) and Ridge regression were further introduced. RF reduces prediction variance through bootstrap aggregating, while Ridge regression provides a stable regularized linear component when the input features are partially correlated [66,67].

The hyperparameters of GBRT were optimized using Particle Swarm Optimization (PSO). PSO searches the predefined hyperparameter space through swarm-based iterations and is suitable for nonlinear and nonconvex optimization problems [68]. The optimized GBRT hyperparameters included the number of trees, maximum tree depth, learning rate, and subsampling ratio. The search ranges were set as follows: the number of trees ranged from 50 to 100, the maximum tree depth ranged from 1 to 10, the learning rate ranged from 0.01 to 0.30, and the subsampling ratio ranged from 0.50 to 1.00. The PSO swarm size and maximum number of iterations were set to 30 and 100, respectively. The cognitive and social acceleration coefficients, corresponding to the personal-best and global-best search components, were both set to 0.5.

The PSO objective was defined as the average normalized root mean square error over the five validation folds of the GBRT model. For the k-th validation fold, the normalized root mean square error is defined as

{N R M S E}^{k} (θ) = \frac{\sqrt{\frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} {(y_{i}^{(k)} - {\hat{y}}_{i}^{(k)} (θ))}^{2}}}{y_{m a x}^{(k)} - y_{m i n}^{(k)}}

(6)

where

y_{m a x}^{(k)}

and

y_{m i n}^{(k)}

are the maximum and minimum target values in the validation set of the k-th fold, respectively;

θ

denotes a candidate GBRT hyperparameter combination (including number of trees, maximum depth, learning rate, and subsampling ratio);

n_{k}

is the number of validation samples in the k-th fold;

y_{i}^{(k)}

is the true value of the i-th validation sample in the k-th fold; and

{\hat{y}}_{i}^{(k)} (θ)

is the corresponding prediction of the GBRT model with hyperparameter combination

θ

.

The optimal GBRT hyperparameter combination is obtained by minimizing the five-fold CV-NRMSE:

θ^{*} = a r g \min_{θ} \frac{1}{K} \sum_{k = 1}^{K} \frac{\sqrt{\frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} {(y_{i}^{(k)} - {\hat{y}}_{i}^{(k)} (θ))}^{2}}}{y_{m a x}^{(k)} - y_{m i n}^{(k)}}

(7)

where K is the number of cross-validation folds, and K = 5 was used in this study. Thus,

θ^{*}

is not an error metric but the hyperparameter combination that yields the lowest average normalized prediction error across the five validation folds.

After optimizing the GBRT hyperparameters, a hybrid ensemble model was constructed using GBRT, RF, and Ridge regression. The ensemble prediction is expressed as

{\hat{y}}_{e n s e m b l e} = \sum_{m = 1}^{M} ω_{m} {\hat{y}}_{m}

(8)

{\hat{y}}_{e n s e m b l e}

is the ensemble prediction; M is the number of base learners (here M = 3);

{\hat{y}}_{m}

is the prediction of the m-th base learner;

ω_{m}

is the ensemble weight for the m-th base learner; and the weights satisfy

\sum_{m = 1}^{M} ω_{m} = 1, ω_{m} \geq 0

.

This ensemble design was guided by the No Free Lunch theorem and the bias–variance decomposition principle. From the perspective of bias–variance decomposition, the expected squared prediction error of the ensemble model can be written as

\begin{matrix} E [(y - {\hat{y}}_{e n s e m b l e})^{2}] = Var ({\hat{y}}_{e n s e m b l e}) + {B i a s}^{2} ({\hat{y}}_{e n s e m b l e}) + σ_{ε}^{2} \end{matrix}

(9)

Here,

E [\cdot]

denotes the expectation over different training sets drawn from the same underlying distribution.

{B i a s}^{2} ({\hat{y}}_{e n s e m b l e})

represents the squared bias of the ensemble prediction (systematic error);

V a r ({\hat{y}}_{e n s e m b l e})

represents the prediction variance of the ensemble model (sensitivity to training data fluctuations); and

σ_{ε}^{2}

represents the irreducible noise inherent in the data.

Since the ensemble prediction is a weighted combination of the base learners, its variance can be further expressed as

\begin{matrix} Var ({\hat{y}}_{e n s e m b l e}) = \sum_{m = 1}^{M} {ω^{2}}_{m} V a r ({\hat{y}}_{m}) + 2 \sum_{m < l} ω_{m} ω_{l} C o v ({\hat{y}}_{m}, {\hat{y}}_{l}) \end{matrix}

(10)

V a r ({\hat{y}}_{e n s e m b l e})

is the prediction variance of the m-th base learner;

C o v ({\hat{y}}_{m}, {\hat{y}}_{l})

is the covariance between the predictions of the m-th and l-th base learners.

This formulation indicates that the ensemble error is affected not only by the bias and variance of individual learners but also by the correlation among their prediction errors. Therefore, combining learners with different model structures and error characteristics can improve robustness when no single algorithm is optimal for all problems.

The ensemble weights were determined adaptively through grid search combined with five-fold cross-validation. A feasible weight space was first defined: GBRT weight in [0.50, 0.70], RF weight in [0.20, 0.40], and Ridge weight in [0.005, 0.15], with the sum constrained to 1. This search space was designed to allow GBRT to dominate nonlinear fitting, while RF and Ridge regression contributed to variance reduction and prediction stability.

For each candidate weight vector

ϖ

= (ω₁,ω₂,ω₃), the ensemble prediction for the i-th validation sample in the k-th fold is

\begin{matrix} {\hat{y}}^{(k)}_{e n s e m b l e, i} (ϖ) = \sum_{m = 1}^{M} ω_{m} {\hat{y}}^{(k)}_{m, i} \end{matrix}

(11)

where

{\hat{y}}^{(k)}_{m, i}

is the prediction of the m-th base learner for the i-th validation sample in the k-th fold. The validation MSE of the ensemble model is

\begin{matrix} {M S E}_{v a l} (ϖ) = \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} [y_{i}^{(k)} - {\hat{y}}^{(k)}_{e n s e m b l e, i} (ϖ)]^{2} \end{matrix}

(12)

To determine the optimal contribution of the three base learners, the ensemble weights were selected through a constrained grid search combined with five-fold cross-validation. For each candidate weight vector

ϖ

= (ω_GBRT,ω_RF,ω_Ridge), the validation prediction was calculated as a weighted sum of the predictions from GBRT, RF, and Ridge regression. The optimal weight vector was obtained by minimizing the five-fold validation MSE:

\begin{matrix} ϖ^{*} = a r g \min_{ϖ} {M S E}_{v a l} (ϖ) \end{matrix}

(13)

subject to ω_GBRT ∈ [0.50, 0.70], ω_RF ∈ [0.20, 0.40], ω_Ridge ∈ [0.005, 0.15], and ω_GBRT + ω_RF + ω_Ridge = 1.

This constrained optimization strategy allows GBRT to play a dominant role in nonlinear fitting, while Random Forest and Ridge regression contribute to reducing prediction variance and improving model stability, respectively. Therefore, the bias–variance trade-off is reflected by the rationale for model selection and the constraints imposed during weight optimization.

Although Ridge regression has limited capability in nonlinear modeling, it provides high stability in representing linear relationships and helps control the overall model variance through regularization, especially when some features are linearly correlated. Therefore, the final ensemble model is dominated by GBRT for nonlinear fitting and bias reduction, while RF and Ridge regression jointly contribute to variance reduction and prediction stability. The resulting integrated model combines the strong nonlinear fitting capability of GBRT, the robustness of RF against perturbations in the training data, and the stability of Ridge regression through regularization.

3. Prediction Performance and Interpretability Analysis

3.1. Overall Prediction Performance of the Proposed Model

To evaluate the overall predictive capability of the proposed ensemble model on the test set, this section first analyzes the model performance from two aspects: prediction consistency and residual distribution. The actual-versus-predicted plot is used to determine whether the model can capture the overall variation trend of the target variable, while the residual plot is further employed to identify whether the model exhibits obvious systematic overestimation, underestimation, or structural errors that vary with the predicted values.

The overall predictive behavior of the proposed ensemble model was examined by comparing the actual and predicted values of the fracture asymmetry index η. As shown in Figure 4a, most sample points lie close to the 1:1 reference line, indicating that the ensemble model can capture the general variation trend of fracture asymmetry within the investigated simulation space. The samples are mainly concentrated in the near-zero region, where the predicted values are generally consistent with the actual values; the coefficient of determination

R^{2}

reaches 0.8484 (test set, n = 52), indicating that the model provides a stable approximation within the main range of the target variable.

However, several sample points deviate from the 1:1 reference line, especially near the margins of the sample distribution. These deviations indicate that local prediction errors still exist for some individual simulation cases. Under the current small-sample condition, this is reasonable, as the available training data may not fully cover all local response patterns of fracture asymmetry.

The residual distribution is further shown in Figure 4b. Most residuals are scattered around the zero-error line, and no obvious monotonic trend or systematic pattern is observed as the predicted values increase. This indicates that the model does not exhibit a clear overall tendency toward overestimation or underestimation. Although a few residuals are relatively large, they are limited in number and do not dominate the overall error distribution.

Overall, the actual-versus-predicted plot and the residual plot indicate that the proposed ensemble model can reasonably reproduce the simulated fracture asymmetry index η and maintain acceptable prediction stability within the predefined parameter space.

3.2. Expanded Model Comparison

To further evaluate the effectiveness of the proposed ensemble model, several representative baseline models and advanced regression models were selected for extended comparison. The compared models cover simple linear models, regularized linear models, kernel-based models, probabilistic models, and tree-based nonlinear models. By including models with different assumptions and function approximation capabilities, this comparison provides a more comprehensive evaluation of the proposed model in terms of nonlinear relationship capture, interaction-feature utilization, and prediction stability under small-sample conditions. Figure 5 presents the summarized predicted-versus-actual scatter plots of different models, where “interactions” denotes the second-order interaction features constructed from well spacing, discharge rate, and natural fracture angle. LR, RF, GPR, GBRT, and SVR represent linear regression, random forest, Gaussian process regression, gradient boosted regression trees, and support vector regression, respectively. SVR-RBF denotes support vector regression using the radial basis function kernel. The proposed ensemble model consists of GBRT, RF, and Ridge regression.

As shown in Figure 5, the proposed ensemble model achieves the best overall performance, with the highest

R^{2}

value of 0.874. These results indicate that, among all compared models, the proposed model provides the most accurate approximation of the fracture asymmetry index. Table 5 below presents a summary of the specific predictive performance values of the aforementioned compared models.

As shown in Table 5, the proposed ensemble model achieved the best overall performance among all compared models, with an R² of 0.874, RMSE of 0.0079, NRMSE of 0.0508, and MAE of 0.0044. These results indicate that the proposed ensemble model provides the most accurate approximation of the fracture asymmetry index. The GBRT model also achieved competitive performance, with an R² of 0.866, RMSE of 0.0082, NRMSE of 0.0524, and MAE of 0.0037, confirming its ability to capture nonlinear relationships in the current dataset. By integrating the complementary strengths of GBRT, RF, and Ridge regression, the ensemble model further improved the overall R², RMSE, and NRMSE. GPR also showed strong predictive performance, with an R² of 0.842, RMSE of 0.0089, NRMSE of 0.0570, and MAE of 0.0044. This result is consistent with the suitability of Gaussian process regression for small-sample regression problems. Nevertheless, its performance in terms of R², RMSE, and NRMSE was still lower than that of the proposed ensemble model, indicating that combining multiple learners can provide additional robustness. RF achieved an R² of 0.821, showing that tree-based ensemble learning can capture part of the nonlinear coupling among well spacing, discharge rate, and natural fracture angle. SVR-RBF achieved an R² of 0.780, but its prediction accuracy remained lower than that of the proposed ensemble model, GBRT, GPR, and RF.

In contrast, the linear and regularized linear models produced substantially lower prediction accuracy. To compare the effect of interaction features, a basic LR model without interaction features was designed, achieving an R² of 0.421; an LR model with interaction features was also implemented, which slightly improved the R² to 0.446. Ridge, Elastic Net, and Lasso showed similar performance, with R² values of 0.437, 0.429, and 0.429, respectively. These results indicate that, although second-order interaction features can provide additional information, linear models are still insufficient to describe the nonlinear relationship between engineering and geological parameters and fracture asymmetry. Therefore, nonlinear learners are necessary for this prediction task.

The performance of LightGBM and XGBoost was also lower than that of the proposed ensemble model. XGBoost achieved an R² of 0.738, while LightGBM achieved an R² of 0.722. This may be related to the limited sample size, under which highly flexible boosting models may not fully exploit their advantages and can become sensitive to the training data distribution. Overall, the comparison demonstrates that the proposed ensemble model achieves a better balance among nonlinear fitting capability, prediction robustness, and model stability. The results also confirm the necessity of combining nonlinear learners with regularized components under small-sample conditions.

3.3. Robustness and Generalization Analysis

Although the above results indicate that the proposed ensemble learning framework can predict fracture asymmetry reasonably well, the reliability of a small-sample surrogate model cannot be judged solely by the accuracy metrics obtained from a single training–testing split. For limited sample data, prediction results may be jointly affected by the data splitting method, model complexity, coverage of the training samples, and prediction uncertainty. In particular, after introducing data augmentation and nonlinear ensemble learning, a high test accuracy may arise either from effective feature learning or from local sample distribution artifacts and model overfitting. Therefore, further examination of the model is required from the perspectives of robustness, complexity control, generalization to different operating conditions, and uncertainty characterization.

Based on the above considerations, this section presents the analysis from four aspects. First, the sensitivity of the model to random data splitting is evaluated through repeated validation to assess whether the prediction performance is statistically stable. Second, a regularization control experiment is conducted to examine the influence of model complexity on prediction results, thereby distinguishing effective fitting from potential overfitting. Third, a leave-one-simulation-out cross-validation is performed to further test the extrapolation capability of the model under unseen operating conditions, evaluating its generalization potential with respect to changes in engineering parameter combinations. Finally, uncertainty assessment is introduced to analyze the error distribution and prediction interval characteristics beyond point predictions, providing supplementary evidence for evaluating model reliability under small-sample conditions.

3.3.1. Repeated Validation

To examine the sensitivity of model performance to random data splitting, a repeated validation analysis was conducted. In small-sample regression tasks, different combinations of training and test samples may lead to noticeable fluctuations in evaluation metrics; therefore, the results from a single training–test split are insufficient to fully characterize the statistical reliability of the model. To reduce the uncertainty caused by such data splitting, multiple random splits of the original samples were performed. In each repetition, the processes of feature construction, training sample augmentation, and model training were all re-executed independently. Meanwhile, the test samples were consistently kept as the original unaugmented data to avoid information leakage.

Based on the results of 50 repeated validations, the prediction performance was statistically summarized using

R^{2}

, RMSE, NRMSE, and MAE. The mean, standard deviation, and 95% confidence interval of each evaluation metric were further calculated to simultaneously quantify the average prediction accuracy of the model and its degree of fluctuation under different random splits. If these metrics remain stable across multiple splits, it indicates that the model performance is not dominated by any single specific sample split and that the model has good robustness under repeated data splitting conditions.

As shown in Table 6, the model achieved an average

R^{2}

of 0.8484 over 50 repeated validations, with a 95% confidence interval of [0.8179, 0.8790]. Meanwhile, the mean values of NRMSE, RMSE, and MAE were 0.0687, 0.0079, and 0.0043, respectively, indicating that the model maintained a low prediction error overall under repeated random splits. Furthermore, the standard deviation of

R^{2}

was 0.1075, reflecting that the model performance is still inevitably affected by data splitting under the small-sample condition. Nevertheless, the confidence interval remains at a consistently high level, suggesting that the proposed prediction framework exhibits good robustness to data splitting.

3.3.2. Regularization-Control Experiment

For small-sample surrogate models, performance improvement brought by data augmentation does not necessarily equate to enhanced generalization ability. Since Gaussian perturbations mainly generate local samples in the neighborhood of the original samples, their effect is closer to local densification of the existing sample distribution rather than the introduction of new physical information or additional engineering scenarios. Therefore, it is necessary to further determine whether the accuracy improvement of the final model can be reproduced by conventional complexity control strategies on the original data. Based on this, a regularization control experiment on the original data was set up, as described in this section. Without employing Gaussian data augmentation, shrinkage regularization models such as RidgeCV, LassoCV, and ElasticNetCV were introduced as controls to analyze the relative contributions of regularization constraints, data augmentation, and interaction features to prediction performance improvement. In this experiment, no Gaussian data augmentation or interaction feature expansion was applied; instead, several regularization models with shrinkage constraints were constructed based solely on the original data as the control group. This design distinguishes the regularization effect within the original sample space from the effects of JS-divergence-constrained data augmentation and nonlinear interaction features, thereby providing a clearer understanding of the sources of model performance improvement.

Among these cases, Case A is the baseline model using only the original data. Cases B1–B3 introduce RidgeCV, LassoCV, and ElasticNetCV, respectively, on the basis of the original data. RidgeCV shrinks the model coefficients via an L2 penalty to reduce coefficient instability; LassoCV imposes sparsity constraints and variable selection through an L1 penalty; ElasticNetCV combines L1 and L2 penalties to balance feature selection and coefficient shrinkage. Thus, Cases B1–B3 serve as the regularization control group under purely original data conditions. In contrast, Case E represents the modeling strategy adopted in this paper, namely JS-divergence-constrained Gaussian data augmentation combined with second-order interaction features. Since Case E does not introduce new physical variables or additional simulation scenarios, the focus of this comparison is on determining whether the final accuracy improvement can be reproduced solely by conventional regularization methods on the original data.

As shown in Table 7, regularization processing of the original data improved the model’s predictive performance to some extent. Compared with Case A, Case B1 (Original data + RidgeCV) increased the average R² from 0.6125 to 0.6782 while reducing the RMSE from 0.0186 to 0.0168. LassoCV and ElasticNetCV yielded moderate improvements, with average R² values of 0.6416 and 0.6539, respectively. This indicates that shrinkage regularization can partially alleviate the instability inherent in small-sample fitting, but performance gains remain limited when modeling solely with the original data.

Compared with the original-data regularization control group, Case E exhibits superior predictive performance across all evaluation metrics. Its average R² reaches 0.8484, with a 95% confidence interval of 0.8179–0.8790; the RMSE, NRMSE, and MAE are 0.0079, 0.0687, and 0.0043, respectively. Relative to Case B1, the best-performing original-data regularization model, Case E achieves an average R² increase of about 25.1%, while RMSE and MAE are reduced by approximately 53.0% and 65.6%, respectively. This suggests that the performance improvement does not stem solely from coefficient shrinkage or conventional regularization on the original data but rather benefits further from local sample augmentation under JS-divergence constraints, second-order interaction feature representation, and the nonlinear mapping capability of ensemble learning.

From a mechanistic perspective, the JS-divergence-constrained Gaussian data augmentation does not introduce new physical variables, additional simulation scenarios, or external engineering information. Its function is to generate controlled perturbed samples in the neighborhood of the original samples and to use distributional similarity constraints to prevent the augmented samples from deviating significantly from the original data distribution. Therefore, this augmentation process does not create new physical information but rather increases the local coverage density within the limited sample space. Meanwhile, second-order interaction features explicitly represent the coupling among well spacing, injection rate, and natural fracture angle, while ensemble learning reduces the sensitivity of a single model to local sample noise by blending diverse base learners. If the accuracy improvement mainly came from an interpolation structure made to more easily fit the augmented data, then the regularization models based on the original data (Cases B1–B3) should have achieved comparable performance. However, their improvements are clearly limited, indicating that simple shrinkage regularization cannot reproduce the predictive performance of Case E.

Thus, this comparison demonstrates that the performance gain of the final model does not merely arise from conventional regularization or the convenience of fitting due to altered data morphology, but more likely from the combined effect of distribution-controlled data augmentation, expression of parametric interaction relationships, and the stabilization mechanism of ensemble learning. Subsequent leave-one-simulation-scenario validation will further test whether this performance improvement extends to unseen simulation conditions, thereby providing a more rigorous assessment of the model’s cross-scenario generalization ability.

3.3.3. Leave-One-Simulation-Condition-Out Validation

In Section 3.3.1, repeated random partitioning validation was used to evaluate the sensitivity of model performance to different splits of training and test sets. The results showed that the proposed model maintains relatively stable predictive accuracy under different random splits. However, the training and test sets under random partitioning may still contain similar combinations of simulation conditions, so the results mainly reflect the model’s prediction stability within the distribution of the existing samples. To further examine the model’s generalization ability to unseen original simulation conditions, this section adopts a leave-one-simulation-condition-out validation.

In this validation, one specific simulation condition is held out from the original simulation database each time as an independent test sample. This condition consists of a particular combination of well spacing, injection rate, and natural fracture angle. The remaining original conditions are used as the training basis, and only under these training conditions are JS-divergence-constrained data augmentation and feature construction performed. The held-out test condition does not participate in data augmentation, feature construction, or model training and remains as an unaugmented original sample. If the original database contains N simulation conditions, the process is repeated N times so that each original condition sequentially serves as an independent test sample. This procedure prevents augmented samples generated from the same original condition from entering both the training and test sets, thereby reducing the risk of information leakage due to data augmentation.

This validation method holds out one specific combination of conditions at a time. This setup better matches the structure of the current small-sample database and, while preserving the size of the training set, allows testing of the model’s predictive ability under original conditions not involved in training. Finally, R², RMSE, NRMSE, and MAE are used to evaluate the overall predictive performance of the model across all held-out conditions.

As shown in Table 8, under this leave-one-simulation-condition-out validation, the Ensemble model achieves the best overall performance, with R², RMSE, NRMSE, and MAE of 0.58, 0.0142, 0.1235, and 0.0100, respectively. The prediction accuracies of RF and GBRT are slightly lower than that of the ensemble model, indicating that nonlinear tree models can still capture some of the asymmetric fracture response patterns. In contrast, the Ridge model yields an R² of 0.47, with RMSE and MAE of 0.0158 and 0.0112, respectively, suggesting that while the linear regularized model exhibits some stability, its ability to characterize nonlinear responses under varying complex conditions is relatively limited.

These results further indicate that the ensemble model maintains reasonably good predictive capability under strictly unseen condition validation, but its accuracy is notably lower than the previously reported predictive performance. This shows that JS-divergence-constrained data augmentation and interaction features can improve local learning stability within the sampled parameter space but cannot fully replace the physical response information provided by adding new real simulation conditions. Therefore, for boundary conditions or parameter combinations with strong response variations, further calibration using additional numerical simulations or uncertainty analysis is still necessary.

3.3.4. Uncertainty Assessment

The repeated validation described above provides the mean, standard deviation, and 95% confidence intervals of evaluation metrics such as R², RMSE, NRMSE, and MAE over 50 random splits, which are used to characterize the statistical stability of the overall model predictive performance. However, the object of such confidence intervals is the evaluation metrics themselves, not the possible range of variation in individual sample predictions. Therefore, even if the model exhibits a high average R² and low error levels under repeated validation, it is still necessary to further analyze the uncertainty of individual prediction results.

To this end, this section introduces prediction intervals as a supplementary evaluation of the reliability of model outputs. A prediction interval indicates the possible range of values that the model gives for a prediction on a given sample. A wider interval generally implies greater uncertainty for that sample’s prediction, whereas a narrower interval indicates a more concentrated prediction range, though an overly narrow interval may fail to cover the true response. PI coverage denotes the proportion of samples for which the true value falls within the prediction interval and is used to assess the coverage capability of the prediction interval. Mean PI width is the average width of the prediction intervals, and PINAW is the ratio of the average prediction interval width to the range of the target variable, serving as a measure of the relative width of the prediction intervals. In general, a higher PI coverage indicates that the prediction interval more adequately encloses the true response, while a lower PINAW indicates a relatively narrower interval; however, the two need to be analyzed together, because an overly narrow interval, despite having a low PINAW, may lead to insufficient coverage.

This paper adopts two methods for prediction interval estimation: Bootstrap ensemble and Quantile GBRT. The Bootstrap ensemble repeatedly resamples the training data with replacement and retrains the model on each resampled dataset, thereby obtaining a set of predictions for the same sample; the dispersion of this set of predictions can be used to construct empirical prediction intervals. Thus, the Bootstrap ensemble primarily reflects the prediction fluctuation of the model under sample perturbations. Quantile GBRT, on the other hand, employs quantile regression to estimate the lower and upper quantiles of the target response, directly yielding a prediction interval at a given confidence level. Unlike the Bootstrap ensemble, which relies on the prediction distribution from resampling, Quantile GBRT focuses more on characterizing the upper and lower bounds of the conditional response distribution. Therefore, these two methods provide complementary perspectives for evaluating the credible range of model outputs under small-sample conditions. Table 8 presents a comparison of point prediction and prediction-interval performance.

Table 9 summarizes the point prediction accuracy and prediction interval reliability under five-fold cross-validation on the augmented data. Point ensemble achieves the highest deterministic prediction accuracy, with an R² of 0.8637, RMSE of 0.0082, NRMSE of 0.0528, and MAE of 0.0058. These numerical values of the evaluation metrics lie within the intervals obtained from repeated validation, lending credibility to the results. This indicates that the constructed ensemble model can approximate the values of the fracture asymmetry indicator reasonably well. However, Point ensemble provides only a single predicted value for each sample and cannot directly reflect the uncertainty range of individual predictions; therefore, its reliability needs to be further evaluated using prediction intervals.

Bootstrap ensemble also maintains high point prediction accuracy, with an R² of 0.8290 and an NRMSE of 0.0592. However, its 95% prediction interval coverage is only 0.3594, while the mean PI width and PINAW are 0.0123 and 0.0792, respectively. This indicates that the prediction intervals given by Bootstrap ensemble are rather narrow, but the true values of many validation samples fall outside these intervals. Therefore, under the current small-sample conditions, constructing prediction intervals solely from the dispersion among Bootstrap resampling models may underestimate the uncertainty of model predictions. In contrast, the point prediction accuracy of the Quantile GBRT 95% prediction interval is lower than that of Point ensemble, with an R² of 0.6168, RMSE of 0.0138, NRMSE of 0.0886, and MAE of 0.0078, but its prediction interval coverage reaches 0.9102. The corresponding mean PI width and PINAW are 0.0515 and 0.3306, respectively, indicating that this method provides wider and more conservative uncertainty ranges. This result suggests that, under conditions of limited sample size and nonlinear fluctuations in the local response, wider prediction intervals can more adequately encompass the true responses of the validation samples, thereby avoiding overly optimistic interpretations of the model predictions.

Figure 6 further shows the 95% prediction intervals obtained by the Quantile GBRT method. The validation samples are sorted by the actual values of the fracture asymmetry indicator

η

, where the blue curve represents the actual values, the orange curve represents the predicted values, and the light blue shaded area represents the 95% prediction intervals. It can be seen that most of the actual values fall within the prediction intervals, which is consistent with the actual coverage (95% PI coverage = 0.9102) of Quantile GBRT in Table 8. The mean prediction interval width with this method is 0.0515 (PINAW = 0.3306), indicating that it covers about 91.02% of the true responses of the validation samples through a relatively conservative interval range. In contrast, the PINAW of the Bootstrap ensemble is only 0.0792, but its coverage is only 0.3594, indicating that its prediction intervals are too narrow and underestimate the prediction uncertainty under small-sample conditions.

Overall, Point ensemble provides high-accuracy point predictions, while Quantile GBRT can provide more reliable sample-level uncertainty ranges. Therefore, this paper not only uses R², RMSE, NRMSE, and MAE to evaluate the deterministic prediction accuracy of the model but also incorporates prediction interval information to further assess the reliability of the model in predicting fracture asymmetry.

3.4. Interpretability and Feature-Importance Robustness Analysis

3.4.1. Correlation and Multicollinearity Diagnostics

After evaluating the model’s prediction accuracy and prediction uncertainty, it is still necessary to further analyze the mechanism of feature influence underlying the model’s predictions. The previous results show that the proposed model can predict the fracture asymmetry indicator (η) well, but the error metrics and prediction intervals alone cannot explain how different input variables affect the model output. Therefore, this section proceeds to analyze feature relationships and model interpretability to reveal the contributions of well spacing, injection rate, natural fracture angle, and their interactions to fracture asymmetry. To avoid interference from input variable correlation or multicollinearity in the subsequent feature contribution analysis, a correlation and multicollinearity analysis of the input variables is first conducted.

Figure 7 below presents the Pearson correlation coefficients and Spearman rank correlation coefficients among well spacing, injection rate, and natural fracture angle. Pearson correlation mainly reflects the degree of linear correlation between variables, while Spearman rank correlation further determines whether a monotonic relationship exists between variables.

As shown in Figure 7a, the Pearson correlation coefficients among the three original input variables are all close to zero. Specifically, the correlation coefficient between well spacing and injection rate is 0.00256, that between well spacing and natural fracture angle is −0.00334, and that between injection rate and natural fracture angle is 0.041. The Spearman rank correlation coefficients in Figure 7b also remain at low levels, with a maximum absolute value of only 0.0825. These results indicate that there is no significant linear or monotonic correlation among the original input variables, suggesting that the main engineering control parameters in the dataset are statistically independent.

Furthermore, multicollinearity is diagnosed using the variance inflation factor (VIF). The definition of the variance inflation factor is given by the following formula:

{VIF}_{i} = \frac{1}{1 - R_{i}^{2}}

(14)

where

R_{i}^{2}

is the coefficient of determination obtained from a linear regression using the i-th feature as the dependent variable and the remaining features as independent variables.

According to commonly used empirical criteria in regression diagnostics, VIF < 5 generally indicates no significant harmful multicollinearity; 5 ≤ VIF < 10 suggests that some degree of multicollinearity may exist but is still acceptable; and VIF ≥ 10 typically indicates strong multicollinearity, requiring further consideration of variable selection, combination, or model re-specification. Figure 8 below shows the VIF plot for the original data in this paper.

As can be seen from the results in Figure 8, the VIF values of the three original input variables are all close to 1, far below the commonly used threshold of 5, indicating that there is no serious multicollinearity problem in the original feature space. Therefore, the high contributions exhibited by natural fracture angle, well spacing, and their interaction terms in the subsequent model interpretation are not caused by strong correlations or multicollinearity among the original input variables but more likely reflect the actual controlling effects of these engineering parameters on the fracture asymmetry response.

At the level of the original input variables, the Pearson and Spearman correlation analyses as well as the VIF results all indicate that there is no significant correlation or severe multicollinearity among well spacing, injection rate, and natural fracture angle. These findings suggest that the original feature space has good statistical independence, and that the subsequent feature contribution analysis is unlikely to be directly disturbed by strong correlations among the original variables. However, because this paper further introduces second-order interaction terms to characterize the coupling effects among engineering parameters, and these interaction terms are constructed by multiplying the original engineering parameters, they may introduce new variable correlations and feature redundancy while enhancing the nonlinear representation capability of the model. Therefore, this paper further conducts a correlation analysis between the original features and the interaction features to evaluate the impact of feature construction on model stability and interpretability. Figure 9 below shows the correlation diagnostics of second-order interaction features.

On this basis, the correlation structure after constructing the second-order interaction features is further analyzed. Figure 9 shows the Pearson and Spearman correlation matrices for the original variables together with their second-order interaction terms. Unlike the near-zero correlations among the original variables, moderate to strong correlations appear between some product features and their corresponding original variables after introducing the interaction terms. The Pearson correlation results show that the correlation coefficients between well spacing and “well spacing × injection rate” and between well spacing and “well spacing × natural fracture angle” are 0.721 and 0.641, respectively; between injection rate and “well spacing × injection rate” and between injection rate and “injection rate × natural fracture angle” they are 0.750 and 0.663, respectively; and between natural fracture angle and “well spacing × natural fracture angle” and between natural fracture angle and “injection rate × natural fracture angle” they are 0.719 and 0.781, respectively. The Spearman rank correlations show similar patterns: for example, the rank correlation coefficients between well spacing and its related interaction terms are approximately 0.710–0.714; between injection rate and its related interaction terms, approximately 0.615–0.696; and between the natural fracture angle and its related interaction terms, approximately 0.550–0.568.

Such correlations do not imply severe collinearity among the original variables but rather reflect structural correlations introduced by the construction of second-order interaction features. Since a product term inherently contains information about its constituent original variables, it will necessarily show high correlations with those variables. “Well spacing × natural fracture angle” carries information about both well spacing and natural fracture angle and therefore exhibits strong correlations with both variables. Similarly, the Pearson correlation coefficient between “injection rate × natural fracture angle” and natural fracture angle reaches 0.781, indicating that this interaction term contains strong information about the fracture angle. These results also demonstrate from a statistical perspective that the second-order interaction terms do not simply increase the number of variables but rather explicitly embed coupling information among engineering parameters into the feature space.

3.4.2. SHAP-Based Feature Contribution Analysis

Based on the correlation and multicollinearity diagnostics, to investigate the mechanisms controlling fracture asymmetry in greater detail, this section analyses SHAP-based feature importance, the interaction between natural fractures and well spacing, and feature dependence patterns.

Figure 10 presents the SHAP feature importance ranking. SHAP (SHapley Additive exPlanations) explains black-box models by quantifying the contribution of each input feature to individual predictions, thereby providing both global and local interpretability [69,70,71].

Figure 10 quantifies the influence of each input feature on fracture asymmetry using the mean absolute SHAP value (ranging from 0 to 0.012). This analysis covers all core input parameters considered in this study (see Table 5). The mean absolute SHAP value represents the average marginal contribution of a feature to the model’s predictions; therefore, a higher value indicates a stronger influence and provides a direct ranking of feature importance.

Figure 10 presents the global feature importance ranking based on the mean absolute SHAP value. The results show that the natural fracture angle is the most influential feature, with a mean absolute SHAP value of approximately 0.012, significantly higher than those of the other input variables. A larger SHAP value on the horizontal axis indicates a stronger effect of the corresponding feature on fracture asymmetry. The results demonstrate that the natural fracture angle contributes the most and is the dominant controlling factor.

Figure 11 presents the SHAP summary plot, which is used to analyze the contribution direction of different feature values to the prediction of the fracture asymmetry index. The horizontal axis represents the SHAP value, where positive values indicate that the feature increases the predicted value

\hat{η}

, and negative values indicate that it decreases

\hat{η}

. The color scale from blue to purple represents feature values from low to high. It can be observed that the natural fracture angle exhibits the widest SHAP value distribution, indicating its strongest impact on the model predictions. Low natural fracture angles mainly correspond to positive SHAP values, while high angles are more distributed near zero or in the negative SHAP region, suggesting that, under the current signed index definition, a smaller natural fracture angle tends to increase

\hat{η}

, whereas a larger angle tends to decrease

\hat{η}

or weaken its positive contribution. The interaction term between well spacing and the natural fracture angle ranks second in importance, indicating that well spacing modulates the influence of natural fracture orientation on the fracture asymmetry response. It should be noted that, since

η

is a signed index, “decrease” here refers to a reduction in the algebraic value of the prediction and does not necessarily imply a reduction in the degree of fracture asymmetry.

The above SHAP results are consistent with the interaction mechanism between hydraulic fractures and natural fractures. The natural fracture angle determines the intersection relationship between hydraulic fractures and pre-existing weak planes, thereby affecting whether the hydraulic fracture crosses the natural fracture, deflects along the weak plane, or forms local branches. Renshaw and Pollard [72] proposed an experimentally verified criterion for fracture propagation across unbounded frictional interfaces in brittle linear elastic materials, providing a mechanical basis for evaluating whether hydraulic fractures can cross or be arrested by pre-existing weak interfaces. Gu et al. [73] further extended this criterion to non-orthogonal intersections between hydraulic fractures and natural fractures, pointing out that the intersection angle is a key factor controlling whether the hydraulic fracture crosses or deflects. The multi-branch hydraulic fracture model by Dahi-Taleghani and Olson [74] shows that interactions between induced fractures and natural fractures lead to complex fracture geometries and asymmetric propagation behavior. Therefore, the natural fracture angle emerges as the dominant feature in the SHAP analysis, indicating that the model captures the controlling effect of natural fracture guidance on fracture propagation paths and asymmetric responses.

Well spacing mainly affects fracture asymmetry by modulating the stress shadow effect between fractures from adjacent wells. A smaller well spacing enhances the superposition of stress fields induced by fractures from neighboring wells, alters the local principal stress direction and the stress state near the fracture tip, and thus further influences the effective intersection relationship between hydraulic fractures and natural fractures. Existing complex fracture network models and three-dimensional hydraulic fracture stress shadow studies have shown that mechanical interactions between fractures significantly modify the local stress perturbation range, fracture propagation direction, and competitive fracture growth behavior [75,76]. Consequently, the effect of well spacing on fracture asymmetry is not primarily an independent control of fracture paths but rather a modulating role by changing the local stress environment in which natural-fracture-guided propagation occurs. This also explains why the “well spacing × natural fracture angle” interaction term emerges as a secondary but significant interactive feature in the SHAP analysis: it reflects the coupled modulation mechanism between inter-well stress interference and the guiding effect of natural fractures.

To further test whether the SHAP feature ranking is affected by a single data split or small-sample fluctuations, this paper employs a Bootstrap resampling method to analyze the stability of the mean absolute SHAP values. This method constructs multiple resampled datasets by sampling the training data with replacement, retrains the model on each resampled dataset, and computes the SHAP values. The resulting distribution of mean absolute SHAP values can be used to characterize the range of fluctuation of feature contributions under sample perturbations, thereby determining whether the feature importance ranking is stable. Figure 12 shows the Bootstrap mean absolute SHAP values with 95% confidence intervals.

The results show that the natural fracture angle consistently has the highest mean absolute SHAP value under bootstrap resampling, and its 95% confidence interval is clearly separated from those of the other features. This indicates that its dominant contribution is not driven by a single training run or a specific resampled dataset. The interaction term between well spacing and natural fracture angle ranks second in importance, and its confidence interval is also separated from those of the remaining lower-ranked features, suggesting that this interaction term provides a stable explanatory contribution under different sample perturbations. In contrast, the mean absolute SHAP values of the well spacing–discharge rate interaction term, discharge rate alone, well spacing alone, and the discharge rate–natural fracture angle interaction term are relatively small and show overlapping confidence intervals, indicating that these features mainly act as secondary moderating factors.

Overall, the bootstrap SHAP analysis provides additional evidence for the stability of the above interpretation from the perspective of constructed features. The prediction of fracture asymmetry is primarily controlled by the natural fracture angle and is further influenced by its interaction with well spacing. However, SHAP analysis evaluates the contribution of individual constructed features rather than the grouped effect of the original physical variables. Therefore, group permutation importance is further employed in the next section in a supplementary analysis to examine the robustness of this interpretation at the original-variable level.

3.4.3. Grouped Permutation Importance

To further examine the feature interpretation results from the perspective of the original physical variables, Bootstrap group permutation importance analysis is adopted. Unlike SHAP analysis, which evaluates the contribution of each constructed feature individually, the group permutation method perturbs all features related to each original input variable (i.e., the variable itself and its associated interaction terms) as a whole. If perturbing a variable group leads to a significant increase in model prediction error, it indicates that the original variable and its associated coupling information make a high contribution to the model predictions.

Figure 13 shows that permuting the natural fracture angle group causes the largest increase in RMSE, approximately 0.021, indicating that destroying information related to the natural fracture angle significantly reduces model prediction accuracy. Therefore, the natural fracture angle is the most important original controlling variable for fracture asymmetry prediction. Permuting the well spacing group leads to an RMSE increase of about 0.011, which is lower than that of the natural fracture angle but significantly higher than that of the injection rate, indicating that although well spacing is not the dominant independent controlling factor, it still makes an important contribution to fracture asymmetry prediction through its influence on inter-well stress shadow intensity and its coupling with the natural fracture angle. This result is consistent with the SHAP analysis conclusion: fracture asymmetry prediction is primarily controlled by the natural fracture angle and modulated by the coupling effects related to well spacing, while the contribution of injection rate is relatively weak.

Combining the correlation analysis, SHAP interpretation results, and group permutation importance results, it can be seen that fracture asymmetry prediction does not simply arise from statistical correlations among input variables but is closely related to natural fracture orientation and inter-well stress interference conditions. The natural fracture angle exhibits the most significant influence in both the SHAP analysis and the group permutation analysis, indicating that it plays a dominant role in hydraulic fracture crossing, deflection, and propagation along natural fractures. The influence of well spacing is mainly reflected in altering the stress shadow range and interference intensity between adjacent well fractures and further affects the asymmetric fracture response through its coupling with the natural fracture angle. In contrast, within the current parameter range, the injection rate and its associated interaction terms have a relatively weak impact on prediction results, suggesting that injection rate is not a primary factor controlling fracture asymmetry. These results demonstrate that the proposed interpretable learning framework not only achieves good predictive accuracy but also reveals the key influencing factors and their interactions that are consistent with the physical mechanisms of multi-well fracturing.

4. Discussion

Section 3 validates the effectiveness of the proposed framework in terms of prediction accuracy, robustness, generalization ability, and feature interpretability. To further clarify the engineering implications and applicable scope of the model results, this section discusses the physical mechanisms of fracturing underlying the predictions, the applicability boundaries imposed by the model assumptions, and potential engineering applications.

4.1. Engineering Interpretation of Model Results

Based on the comprehensive prediction performance, robustness tests, and feature interpretation results, it can be seen that fracture asymmetry is not determined by a single operational parameter but is jointly influenced by geological structural controls and inter-well mechanical interference. Within the current parameter range, the natural fracture angle has the most prominent effect on the asymmetric fracture response, indicating that the natural fracture angle is an important geological factor governing the difference in fracture propagation among multiple wells. Although well spacing is not the most important factor in the individual feature ranking, its interaction term with the natural fracture angle exhibits a high contribution, suggesting that inter-well distance primarily affects the imbalance of fracture propagation under natural-fracture guidance by altering the stress interference environment between adjacent fractures.

The influence of the natural fracture angle on fracture asymmetry is not a simple linear relationship but is closely related to the interaction mechanism between hydraulic fractures and natural fractures. When the natural fracture orientation is relatively aligned with the direction of the maximum principal stress or the dominant propagation direction of the hydraulic fracture, the natural fractures are more easily activated. They may provide a low-resistance propagation path for the hydraulic fracture, causing the fracture to deflect or extend along local weak planes, thereby enhancing the imbalance in fracture area distribution among different wells. As the natural fracture angle increases, the manner in which hydraulic fractures encounter natural fractures changes. The fracture propagation may gradually shift from deflecting along natural fractures to crossing, local stagnation, or branching. This transition in propagation mode typically exhibits threshold characteristics; therefore, the effect of the natural fracture angle on fracture asymmetry manifests as a nonlinear relationship rather than a monotonic linear change.

Furthermore, the role of the natural fracture angle is also modulated by the stress shadow effect controlled by well spacing. Under smaller well spacing conditions, the stress interference between fractures from adjacent wells is stronger, and the local principal stress direction and the stress field at the fracture tip are more easily disturbed, thereby altering the effective approach angle between the hydraulic fracture and natural fractures. In this case, even if the initial geometric angle of the natural fractures is the same, their influence on fracture deflection, branching, and asymmetric propagation may vary depending on the intensity of inter-well stress interference. Consequently, the interaction term between well spacing and the natural fracture angle shows a high contribution in both SHAP analysis and group permutation importance, indicating that fracture asymmetry is controlled not only by the natural fracture orientation alone but also by the coupled effect between inter-well stress interference and the guiding effect of natural fractures.

This result provides direct guidance for the design of multi-well fracturing parameters. For reservoirs with well-developed natural fractures, asymmetric fracture propagation cannot be controlled solely by adjusting the injection rate; the match between the natural fracture orientation and the well spacing arrangement must also be considered. It should be emphasized that the natural fracture angle itself is not a directly adjustable operational parameter but rather a constraint variable determined by geological conditions. In engineering applications, the dominant natural fracture orientation can be determined using imaging logs, core interpretations, seismic attributes, or existing geological models and used as an input condition, together with well spacing, wellbore azimuth, and injection schedule, for scheme screening. When a particular natural fracture orientation coincides with strong inter-well stress interference conditions, the difference in fracture propagation paths may be further amplified, thereby increasing the risk of asymmetric fracture propagation. Therefore, the natural fracture angle should be transformed from mere geological descriptive information into a constraint parameter in fracturing scheme evaluation, to identify high-risk well spacing combinations and optimize multi-well fracturing deployment.

In addition, the proposed model can provide rapid decision support. Once the basic geological and operational parameters are given, the model can quickly estimate the fracture asymmetry index and identify parameter combinations with a high potential for well-to-well interference. The SHAP interpretation and grouped permutation importance analysis can help engineers identify the key geological and operational factors governing asymmetric fracture propagation, rather than merely providing black-box predictions. The present model is mainly applicable to layer-scale equivalent homogeneous reservoir conditions within the current sample parameter range. It can serve as a baseline model for analyzing the influencing factors of hydraulic fracturing in more complex formations and provide a reference for future model extensions that further consider strong heterogeneity, complex natural-fracture connectivity, and dynamic field-monitoring data.

4.2. Applicability Boundaries of the Simulation-Trained Surrogate Model

Although the proposed ensemble learning framework shows good predictive stability in repeated validation, regularized control experiments, leave-one-simulation-condition-out validation, and uncertainty analysis, its applicability is still jointly limited by the data source, parameter space, and numerical simulation assumptions. The proposed model should be interpreted as a simulator-based surrogate model rather than a direct field-scale predictive model. In other words, the model learns the response relationship generated by the numerical simulator under the prescribed assumptions, rather than the full physical behavior of a real reservoir system.

The proposed model should be interpreted as a simulator-based surrogate model rather than a direct field-scale predictive model. Its applicability is limited to the numerical simulation space defined in this study. Specifically, the model is mainly applicable to the three-well sequential fracturing configuration considered here, where the well spacing ranges from 200 to 550 m, the injection rate ranges from 5 to 20 m³/min, and the natural fracture angle ranges from 30° to 90°. The prediction target is the fracture-asymmetry index η calculated from simulated fracture areas, rather than field production, microseismic responses, or directly measured fracture geometry.

It should also be noted that the augmented samples used in this study are only local perturbations and densifications around the original simulation samples. Therefore, JSD-constrained data augmentation alleviates the sparsity of the training space under small-sample conditions but does not generate new geological information, field observation data, or additional physical mechanisms. The resulting predictions are therefore mainly meaningful within the range of the original simulation parameters.

The simplified reservoir assumptions adopted in the numerical simulations further affect the applicability boundary of the surrogate model. The current numerical database uses homogeneous and isotropic rock mechanical and flow parameters and simplified natural-fracture characterization and does not include thermal effects. These assumptions help reduce the dimensionality of the simulation variables and isolate the effects of well spacing, injection rate, and natural fracture angle on fracture asymmetry under controlled conditions. However, actual shale reservoirs often contain bedding structures, mechanical anisotropy, heterogeneous stress fields, and complex natural-fracture networks, all of which can introduce directional effects into fracture propagation.

First, the homogeneous and isotropic assumption may underestimate the controlling effect of bedding planes and weak interfaces on fracture paths. In strongly bedded shales, hydraulic fractures may deflect, stagnate, cross, or propagate along bedding planes or weak interfaces, rather than following the propagation pattern predicted by a homogeneous medium. Therefore, the current model may underestimate bedding-induced fracture deflection and branching while overestimating the ability of hydraulic fractures to cross layers and extend effectively.

Second, the effect of the natural fracture angle on fracture asymmetry in real reservoirs is not only a geometric-angle effect. Whether a hydraulic fracture crosses, deflects along, or activates a natural fracture after intersection is also affected by horizontal stress difference, interface friction, natural-fracture aperture, filling state, and local pore pressure. Therefore, the conclusion that the natural fracture angle is the dominant factor should be understood as a result within the preset simulation parameter space. In strongly anisotropic shale reservoirs or reservoirs with complex natural-fracture networks, this role may be overestimated or underestimated, and feature importance may shift toward other factors such as bedding orientation, stress anisotropy, or natural-fracture connectivity.

Third, anisotropy may alter the spatial pattern of inter-well interference. In the present homogeneous and isotropic model, stress shadows and pressure disturbances vary relatively regularly with well spacing. In strongly anisotropic reservoirs, however, fractures may preferentially propagate along weak planes or high-permeability directions, and stress or pressure transmission may also become direction-dependent. As a result, wells aligned with bedding or highly connected natural fractures may experience stronger interference, whereas wells arranged perpendicular to weak planes or in low-permeability directions may show weaker interactions. Thus, direct application of the current model to such reservoirs may introduce directional bias in the evaluation of well-spacing effects and inter-well interference intensity.

Furthermore, statistical uncertainty under small-sample conditions still exists. Although repeated random splits and leave-one-simulation-condition-out validation reduce the chance effects of a single data split, the limited training dataset still restricts the model’s ability to fully characterize complex nonlinear responses. Prediction errors may increase, especially near the boundaries of the parameter space or in regions with strong response variations. Therefore, the present model is more suitable for trend identification, scenario comparison, and initial risk screening within the established simulation parameter range than for unconditional prediction of field fracturing responses. Further engineering application requires validation and recalibration using richer simulation samples and field observations.

4.3. Future Improvements and Field Validation

To further enhance the engineering applicability of the model, subsequent research can be carried out in three aspects: expansion of the numerical simulation database, enrichment of input variables, and field data constraints. First, the scale of the numerical simulation sample set should be further expanded, especially by supplementing boundary conditions, regions with large prediction errors, and parameter combinations exhibiting strong response variations. Since data augmentation can only increase the training space density within the neighborhood of the original samples and cannot replace truly new simulation samples, active learning or adaptive sampling strategies can be introduced in future work so that new simulation conditions are preferentially concentrated in areas with high prediction uncertainty or more pronounced nonlinear responses.

Second, the model input features should be further enriched. The current model mainly considers well spacing, injection rate, and natural fracture angle. However, in actual multi-well fracturing processes, asymmetric fracture propagation may also be affected by factors such as horizontal principal stress difference, rock mechanical anisotropy, bedding orientation, natural fracture density, fracture aperture, fracture toughness, fracturing fluid viscosity, and completion parameters. Introducing these geological and engineering variables will help improve the model’s ability to represent complex reservoir conditions and reduce interpretation bias caused by oversimplified input variables.

Moreover, field validation remains necessary for improving model reliability. In practical applications, the surrogate model can be constrained and calibrated by integrating microseismic monitoring, offset well pressure responses, tracer responses, production performance data, and image log interpretation results. These field data do not need to completely replace numerical simulations but can serve as independent evidence to determine whether the high-risk parameter combinations identified by the model are consistent with actual fracture propagation behavior. As new numerical simulations and field observations are continuously supplemented, the proposed framework can be further developed from its current numerical-simulation-based surrogate model into a decision-support tool for multi-well fracturing scheme screening and risk assessment.

5. Conclusions

This paper constructs an interpretable ensemble learning framework for predicting fracture asymmetry in multi-well fracturing. Based on small-sample numerical simulation data, combined with JS-divergence-constrained data augmentation, second-order interaction feature construction, and a PSO-optimized ensemble learning model, the effects of well spacing, injection rate, and natural fracture angle on the asymmetric fracture response are predicted and interpreted. The main conclusions are as follows:

(1): The proposed JS-divergence-constrained data augmentation and PSO-optimized ensemble learning framework can effectively improve the prediction accuracy of fracture asymmetry under small-sample conditions. Repeated validation results show that the model achieves stable predictive performance over 50 random splits, with average R², NRMSE, RMSE, and MAE of 0.8484, 0.0687, 0.0079, and 0.0043, respectively, and corresponding 95% confidence intervals of 0.8179–0.8790, 0.0606–0.0768, 0.0072–0.0087, and 0.0040–0.0047. Compared with models using only original data and conventional regularization controls, the proposed method exhibits clear advantages in prediction accuracy and error control, indicating that its performance improvement is not merely due to model complexity or regularization effects.
(2): Robustness and uncertainty analyses further demonstrate that the model has good predictive stability, but uncertainties remain in boundary conditions and regions of abrupt local response. The leave-one-simulation-condition-out validation shows that when a particular condition is completely unseen during training, the prediction error increases, but the model still maintains acceptable cross-condition predictive capability. Uncertainty analysis indicates that Point ensemble achieves the highest deterministic prediction accuracy, while the 95% prediction interval of Quantile GBRT covers 91.02% of the validation samples, suggesting that introducing prediction intervals under small-sample conditions helps avoid overly optimistic interpretations of model outputs. In contrast, although Bootstrap ensemble maintains high point prediction accuracy, its prediction interval coverage is only 35.94%, indicating that relying solely on the dispersion of resampled models may underestimate prediction uncertainty.
(3): Feature correlation analysis, SHAP attribution, and group permutation importance analysis collectively show that the natural fracture angle is the dominant factor influencing fracture asymmetry prediction, and the interaction between well spacing and the natural fracture angle also has a significant effect. The natural fracture angle mainly controls the intersection, deflection, and propagation path of hydraulic fractures relative to pre-existing weak planes; well spacing, by altering the stress shadow intensity between adjacent well fractures, modulates the local stress environment under which the natural-fracture-guided effect operates. In contrast, the influence of injection rate and its associated interaction terms is relatively weak, indicating that within the current parameter range, the asymmetric fracture response is jointly controlled by geological structural constraints and inter-well mechanical interference.

This study still has certain limitations. The current model is based on a limited number of numerical simulation samples; data augmentation only densifies the training space in the neighborhood of the original sample distribution and does not generate new geological information or field observations. Therefore, the model is more suitable for scheme comparison within the ranges of well spacing, injection rate, and natural fracture angle defined in this paper. For conditions significantly outside the coverage of the original samples, its predictions should be interpreted cautiously in conjunction with physical simulations and engineering experience. Future research should further expand the numerical sample size, introduce more controlling factors such as in situ stress difference, natural fracture density, reservoir heterogeneity, and treatment schedule, and calibrate the model with field data, including microseismic monitoring, offset well pressure responses, tracer and production performance data.

Author Contributions

Conceptualization, H.Z. and Y.P.; methodology, H.Z.; software, H.Z.; validation, H.Z. and Y.P.; formal analysis, H.Z.; investigation, H.Z.; data curation, H.Z.; writing—original draft, H.Z.; writing—review and editing, H.Z. and Y.P.; visualization, H.Z.; supervision, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yanhong Peng was employed by Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1 presents the complete data corresponding to Table 3 above.

Table A1. Numerical simulation design.

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
7	200	15	90
8	200	5	50
9	250	10	50
10	250	15	50
11	250	20	50
12	250	15	30
13	250	15	45
14	250	15	90
15	250	5	50
16	300	10	50
17	300	15	50
18	300	20	50
19	300	15	30
20	300	15	45
21	300	15	90
22	300	5	50
23	350	10	50
24	350	15	50
25	350	20	50
26	350	15	30
27	350	15	45
28	350	15	90
29	350	5	50
30	400	10	50
31	400	15	50
32	400	20	50
33	400	15	30
34	400	15	45
35	400	15	90
36	400	5	50
37	450	10	50
38	450	15	50
39	450	20	50
40	450	15	30
41	450	15	45
42	450	15	90
43	450	5	50
44	500	10	50
45	500	15	50
46	500	20	50
47	500	15	30
48	500	15	45
49	500	15	90
50	500	5	50
51	550	10	50
52	550	15	50
53	550	20	50
54	550	15	30
55	550	15	45
56	550	15	90
57	500	9	55

References

Rezaee, R. Editorial on Special Issues of Development of Unconventional Reservoirs. Energies 2022, 15, 2617. [Google Scholar] [CrossRef]
Liu, X. Research progress in the evaluation of structural characteristics for tight oil reservoirs. Adv. Resour. Res. 2023, 3, 1–16. [Google Scholar] [CrossRef]
Lu, B.; Hu, C.; Ma, J. Influencing factors and countermeasures of inter-well interference of fracturing horizontal wells in Nanchuan shale gas field. Reserv. Eval. Dev. 2023, 13, 330–339. [Google Scholar] [CrossRef]
Marongiu-Porcu, M.; Lee, D.W.; Shan, D.; Morales, A.N. Advanced modeling of interwell-fracturing interference: An Eagle Ford shale-oil study. SPE J. 2016, 21, 1567–1582. [Google Scholar] [CrossRef]
Chen, W.; Xu, R.; Zhang, Y.; Han, Z.; Tu, Z.; Zhao, G.; Cao, Z. Optimization Study on Integrated Energy Supplement and Fracturing Technology for Infill Wells and Parant Wells in Mature Areas. In SPE Asia Pacific Oil and Gas Conference and Exhibition; SPE: Richardson, TX, USA, 2025; D011S006R010. [Google Scholar] [CrossRef]
Xiao, Y.; Sun, Y.; Zheng, J.; Zhou, X.; Liu, W.; Shen, C.; Deng, Q.; Zhao, H. Geostress Evolution and Construction Parameter Optimization in Shale Gas Infill Well Development. Energy Eng. 2026, 123, 1. [Google Scholar] [CrossRef]
Liang, Y.; Cheng, Y.; Han, Z.; Yan, C. Numerical simulation analysis on production evolution laws in shale reservoirs considering a horizontal well interwell interference effect. Energy Fuels 2024, 38, 4076–4090. [Google Scholar] [CrossRef]
Wang, W.; Zhang, Q.; Yu, W.; Su, Y.; Li, L.; Hao, Y. Modeling and analysis for coupled multi-zone flow of frac hits in shale reservoirs. Appl. Math. Model. 2024, 129, 823–836. [Google Scholar] [CrossRef]
Al-Rbeawi, S. An approach for the performance-impact of parent-child wellbores spacing and hydraulic fractures cluster spacing in conventional and unconventional reservoirs. J. Pet. Sci. Eng. 2020, 185, 106570. [Google Scholar] [CrossRef]
Carpenter, C. Design optimization of horizontal wells with multiple hydraulic fractures. J. Pet. Technol. 2014, 66, 118–123. [Google Scholar] [CrossRef]
Saputelli, L.; Lopez, C.; Chacon, A.; Soliman, M.Y. Design optimization of horizontal wells with multiple hydraulic fractures in the Bakken Shale. In SPE/EAGE European Unconventional Resources Conference and Exhibition; SPE: Richardson, TX, USA, 2014; D021S018R003. [Google Scholar] [CrossRef]
Syed, F.I.; Muther, T.; Van, V.P.; Dahaghi, A.K.; Negahban, S. Numerical trend analysis for factors affecting EOR performance and CO2 storage in tight oil reservoirs. Fuel 2022, 316, 123370. [Google Scholar] [CrossRef]
Li, L.; Tan, J.; Wood, D.A.; Zhao, Z.; Becker, D.; Lyu, Q.; Shu, B.; Chen, H. A review of the current status of induced seismicity monitoring for hydraulic fracturing in unconventional tight oil and gas reservoirs. Fuel 2019, 242, 195–210. [Google Scholar] [CrossRef]
Li, L.; Tan, J.; Tan, Y.; Pan, X.; Zhao, Z. Microseismic analysis to aid gas reservoir characterization. In Sustainable Geoscience for Natural Gas Subsurface Systems; Elsevier: Amsterdam, The Netherlands, 2022; pp. 219–242. [Google Scholar] [CrossRef]
Warpinski, N.R.; Mayerhofer, M.J.; Davis, E.J.; Holley, E.H. Integrating Fracture Diagnostics for Improved Microseismic Interpretation. In SEG International Exposition and Annual Meeting; SEG: Houston, TX, USA, 2015; SEG-2015-5829305. [Google Scholar]
Warpinski, N.R.; Wolhart, S. A validation assessment of microseismic monitoring. In SPE Hydraulic Fracturing Technology Conference and Exhibition; SPE: Richardson, TX, USA, 2016; D021S004R001. [Google Scholar]
Jacobs, T. Three unconventional startups offer new clues on shale’s biggest well spacing mysteries. J. Pet. Technol. 2018, 70, 47–52. [Google Scholar] [CrossRef]
Kalinec, J.; Paryani, M.; Ouenes, A. Estimation of 3d distribution of pore pressure from surface drilling data-application to optimal drilling and frac hit prevention in the Eagle Ford. In SPE/AAPG/SEG Unconventional Resources Technology Conference; URTEC: Tulsa, OK, USA, 2019; D033S059R004. [Google Scholar]
Alfataierge, A.; Miskimins, J.L.; Davis, T.L.; Benson, R.D. 3D hydraulic-fracture simulation integrated with 4D time-lapse multicomponent seismic and microseismic interpretations, Wattenberg Field, Colorado. SPE Prod. Oper. 2019, 34, 57–71. [Google Scholar] [CrossRef]
Zhou, B.; Chen, Z.; Song, Z.; Wang, B.; Sepehrnoori, K. Well interference analysis based on transient-flow analysis using an improved embedded discrete fracture model. Geoenergy Sci. Eng. 2025, 252, 213950. [Google Scholar] [CrossRef]
Liu, G.; Wu, X.; Romanov, V. Unconventional wells interference: Supervised machine learning for detecting fracture hits. Appl. Sci. 2024, 14, 2927. [Google Scholar] [CrossRef]
Zhang, Q.; He, F.; He, Y. Well interference evaluation and prediction of shale gas wells based on machine learning. Pet. Reserv. Eval. Dev. 2022, 12, 487–495. (In Chinese) [Google Scholar] [CrossRef]
Hui, G.; Chen, S.; He, Y.; Wang, H.; Gu, F. Machine learning-based production forecast for shale gas in unconventional reservoirs via integration of geological and operational factors. J. Nat. Gas Sci. Eng. 2021, 94, 104045. [Google Scholar] [CrossRef]
Sarkar, P.; Yoon, S.; Kim, J.; Baek, S.; Sun, A.; Yoon, H. Surrogate models for development of unconventional shale reservoirs by an integrated numerical approach of hydraulic fracturing, flow and geomechanics, and machine learning. Geomech. Energy Environ. 2025, 43, 100691. [Google Scholar] [CrossRef]
Chen, Z.; Li, D.; Dong, P.; Sepehrnoori, K. A deep learning-based surrogate model for pressure transient behaviors in shale wells with heterogeneous fractures. Transp. Porous Media 2023, 149, 345–371. [Google Scholar] [CrossRef]
Zhu, C.; Wang, J.; Sang, S.; Wei, L. A multiscale neural network model for the prediction on the equivalent permeability of discrete fracture network. J. Pet. Sci. Eng. 2023, 220, 111186. [Google Scholar] [CrossRef]
Peng, Y.; Wang, Y.; Hu, F.; He, M.; Mao, Z.; Huang, X.; Ding, J. Predictive modeling of flexible EHD pumps using Kolmogorov–Arnold Networks. Biomim. Intell. Robot. 2024, 4, 100184. [Google Scholar] [CrossRef]
Huang, X.; Xiang, H.; Wu, C.; Xu, J.; Ding, J.; Peng, Y.; Funabora, Y. CRAB-EDM: A multi-modal underwater crab-inspired robot with temporally sequenced electro-discharge modulation. IEEE Robot. Autom. Lett. 2026, 11, 3939–3946. [Google Scholar] [CrossRef]
Li, B.; Wang, Q.; Yu, T.; Chen, Y.; Bin, Y.; Sun, B.; Huang, X.; Ding, J.; Gu, S.; Mao, Z.; et al. Quad-wing FWMAV with thrust vector yaw steering. Biomim. Intell. Robot. 2026, 100327. [Google Scholar] [CrossRef]
Peng, Y.; Jiang, Y.; Zuo, Z.; Gu, S.; Wang, Z.; Tian, Z. A rehabilitation design concept based on brain–computer interface and McKibben artificial muscle. Healthc. Rehabil. 2026, 2, 100066. [Google Scholar] [CrossRef]
Wang, L.; Shen, L.; Yi, J.; Yang, X.; Peng, Y.; Ding, J.; Tian, Y.; Yan, S. Prediction model of dynamic fracture toughness of nickel-based alloys: Combination of data-driven and multi-scale modelling. Eur. J. Mech.-A/Solids 2025, 116, 105892. [Google Scholar] [CrossRef]
Liu, H. Exploration practice and prospects of shale oil in the Jiyang Depression. China Pet. Explor. 2022, 27, 73–87. (In Chinese) [Google Scholar]
Bao, Y. Fracture diversity of continental shale under horizontal in-situ stress: A case study of Paleogene shale in the Jiyang Depression. Acta Pet. Sin. 2019, 40, 777–785. (In Chinese) [Google Scholar]
Yuan, J. Key engineering technologies for multi-layer stereoscopic development of shale oil in the Jiyang Depression. Pet. Drill. Tech. 2023, 51, 1–8. (In Chinese) [Google Scholar]
Yang, Y.; Zhang, S.; Lv, Q.; Li, W.; Jiang, L.; Liu, Z.; Lv, J.; Ren, M.; Lu, G. Exploration and practice of stereoscopic evaluation of shale oil in the Paleogene Es4–Es3 members in the Jiyang Depression. China Pet. Explor. 2024, 29, 31–44. (In Chinese) [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, G.; Cui, C.; Bao, H.; Ren, S.; Wang, J. Method and application of pore structure characterization and physical property lower limit determination for shale reservoirs. Spec. Oil Gas Reserv. 2024, 31, 96–102. (In Chinese) [Google Scholar]
Liu, H.; Li, Z.; Bao, Y.; Zhang, S.; Wang, W.; Wu, L.; Wang, Y.; Zhu, R.; Fang, Z.; Zhang, S.; et al. Geological characteristics of shale in the high-yield shale oil well BYP5 in the Jiyang Depression, Bohai Bay Basin. Oil Gas Geol. 2023, 44, 1405–1417. (In Chinese) [Google Scholar] [CrossRef]
Belyadi, H.; Fathi, E.; Belyadi, F. Rock mechanical properties and in situ stresses. In Hydraulic Fracturing in Unconventional Reservoirs; Elsevier: Amsterdam, The Netherlands, 2019; pp. 215–231. [Google Scholar]
Truong, V.T.; Nguyen, K.L.; Nguyen, T.V.; Nguyen, T.H.; Kieu, D.T.; Nguyen, V.D.; Le, Q.H.; Vinciguerra, S. Prediction of Young’s Modulus for Reservoir Rocks using Machine Learning. Inżynieria Miner. 2025, 1, 595–607. [Google Scholar] [CrossRef]
Chen, X.; Li, Y.; Zhao, J.; Xu, W.; Fu, D. Numerical investigation for simultaneous growth of hydraulic fractures in multiple horizontal wells. J. Nat. Gas Sci. Eng. 2018, 51, 44–52. [Google Scholar] [CrossRef]
Xiong, D.; Ma, X. Influence of natural fractures on hydraulic fracture propagation behaviour. Eng. Fract. Mech. 2022, 276, 108932. [Google Scholar] [CrossRef]
Qiu, G.; Chang, X.; Li, J.; Guo, Y.; Zhou, Z.; Wang, L.; Wan, Y.; Wang, X. Study on the interaction between hydraulic fracture and natural fracture under high stress. Theor. Appl. Fract. Mech. 2024, 130, 104259. [Google Scholar] [CrossRef]
Guo, X.; Ma, J.; Wang, S.; Jin, Y. Modeling interwell interference: A study of the effects of parent well depletion on asymmetric fracture propagation in child wells. In SPE Asia Pacific Oil and Gas Conference and Exhibition; SPE: Richardson, TX, USA, 2020; D032S008R012. [Google Scholar]
Wang, H.; Chen, Z.; Chen, S.; Hui, G.; Kong, B. Production forecast and optimization for parent-child well pattern in unconventional reservoirs. J. Pet. Sci. Eng. 2021, 203, 108899. [Google Scholar] [CrossRef]
Defeu, C.; Garcia Ferrer, G.; Ejofodomi, E.; Shan, D.; Alimahomed, F. Time Dependent Depletion of Parent Well and Impact on Well Spacing in the Wolfcamp Delaware Basin. In SPE Liquids-Rich Basins Conference-North America; SPE: Richardson, TX, USA, 2018; D021S006R003. [Google Scholar] [CrossRef]
Gupta, I.; Rai, C.S.; Devegowda, D.; Sondergeld, C.H. Fracture hits in unconventional reservoirs: A critical review. SPE J. 2021, 26, 412–434. [Google Scholar] [CrossRef]
Yuan, C.; Liu, C.; Fan, C.; Liu, K.; Chen, T.; Zeng, F.; Song, C. Estimation of water storage capacity of Chinese reservoirs by statistical and machine learning models. J. Hydrol. 2024, 630, 130674. [Google Scholar] [CrossRef]
Akrom, M.; Rustad, S.; Dipojono, H.K. A machine learning approach to predict the efficiency of corrosion inhibition by natural product-based organic inhibitors. Phys. Scr. 2024, 99, 036006. [Google Scholar] [CrossRef]
Adhab, A.H.; Bhogayata, A.; Yadav, A.; Meena, Y.R.; Kalia, R.; Priyadharshini, B.; Bisht, M.K.; Mahdi, M.S.; Mansoor, A.S.; Radi, U.K.; et al. Application of robust hybrid tree-based machine learning methods in accurate prediction of underground rock saturation exponent. Measurement 2025, 255, 117916. [Google Scholar] [CrossRef]
Sun, Y.; Chen, L.; Qi, Y.; He, Y.; Ji, H.; Shi, Y.; Feng, S. Impact of Sample Size and Pore Structure on Machine Learning Prediction of Petrophysical Properties in Low-Permeability Sandstone Reservoirs. Geoenergy Sci. Eng. 2025, 257, 214266. [Google Scholar] [CrossRef]
Sun, Y.; Pang, S.; Qiu, Z.; Zhang, Y. Efficient lithology classification from small-sample well logging data processed by wavelet thresholding algorithm: Integrating meta-learning with self-attention mechanism model. Geoenergy Sci. Eng. 2025, 246, 213629. [Google Scholar] [CrossRef]
Zhang, Z.; Li, B.; Zhao, J.; Song, X.; Yin, R.; Ren, H. Machine learning-guided optimization of UV/chlorine process for sustainable micropollutant abatement. J. Hazard. Mater. 2026, 507, 141711. [Google Scholar] [CrossRef]
Solarin, S.A.; Bello, M.O. Interfuel substitution, biomass consumption, economic growth, and sustainable development: Evidence from Brazil. J. Clean. Prod. 2019, 211, 1357–1366. [Google Scholar] [CrossRef]
Mittendorf, M.; Nielsen, U.D.; Bingham, H.B. Data-driven prediction of added-wave resistance on ships in oblique waves—A comparison between tree-based ensemble methods and artificial neural networks. Appl. Ocean Res. 2022, 118, 102964. [Google Scholar] [CrossRef]
Miranda, J.R.; Zavala-Romero, O.; Hiron, L.; Chassignet, E.P.; Subrahmanyam, B.; Meunier, T.; Helber, R.W.; Pallas-Sanz, E.; Tenreiro, M. Neural Synthetic Profiles from Remote Sensing and Observations (NeSPReSO)—Reconstructing temperature and salinity fields in the Gulf of Mexico. Ocean Model. 2025, 196, 102550. [Google Scholar] [CrossRef]
Akter, B.; Rahat, S.K.; Mredula, M.S.; Tamim, Z.H.; Rahman, M.S.; Hosen, A.S.M.S. SEWING: An Interpretable Ensemble ML Framework for Sewing Operators’ Wage Increments Prediction in Garment Industry. Expert Syst. Appl. 2026, 317, 131900. [Google Scholar] [CrossRef]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
Gupta, R.D.; Wu, X.; Liu, X.; He, J. Locally Interpretable Surrogate-Guided Neural Framework for Multi-Class Liver Cirrhosis Staging. In Proceedings of the 2025 4th International Conference on Health Big Data and Intelligent Healthcare, Wuhan, China, 12–14 December 2025; pp. 219–223. [Google Scholar] [CrossRef]
Brothers, T.; Adhikari, K.; Ramazani, M.; Imtiaz, A.; Al-Mamun, M. Synthetic data-augmented machine learning for 30-day readmission prediction in patients with chronic conditions: A retrospective real-world study. BMJ Open 2026, 16, e108273. [Google Scholar] [CrossRef] [PubMed]
Efron, B. Bootstrap methods: Another look at the jackknife. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY. USA, 1992; pp. 569–593. [Google Scholar] [CrossRef]
Tibshirani, R.J.; Efron, B. An introduction to the bootstrap. Monogr. Stat. Appl. Probab. 1993, 57, 1–436. [Google Scholar] [CrossRef]
Chao, Z.; Shi, S.; Gao, H.; Luo, J.; Wang, H. A gray-box performance model for apache spark. Future Gener. Comput. Syst. 2018, 89, 58–67. [Google Scholar] [CrossRef]
Wen, B.; Musa, N.; Onn, C.C.; Ramesh, S.; Liang, L.; Wang, W. Evolution of sustainability in global green building rating tools. J. Clean. Prod. 2020, 259, 120912. [Google Scholar] [CrossRef]
Ding, Z.; Liu, H.; Demartino, C.; Feng, M.; Sun, Z. Neighborhood component analysis-based feature selection in machine learning to predict tendon ultimate stress of unbonded prestressed concrete beams. Case Stud. Constr. Mater. 2024, 21, e03428. [Google Scholar] [CrossRef]
Zhang, J. Experimental parameter investigations on particle swarm optimization acceleration coefficients. Int. J. Adv. Comput. Technol. 2012, 4, 99–105. [Google Scholar] [CrossRef]
Spisak, B.R.; van der Laken, P.A.; Doornenbal, B.M. Finding the right fuel for the analytical engine: Expanding the leader trait paradigm through machine learning? Leadersh. Q. 2019, 30, 417–426. [Google Scholar] [CrossRef]
Rishabh, R.; Das, K.N. A fusion of decomposed fuzzy based decision-making and metaheuristic optimization system for sustainable planning of urban transport. Knowl.-Based Syst. 2025, 324, 113823. [Google Scholar] [CrossRef]
Ma, H.; Liu, Y.; Zhao, J.; Fei, F.; Gao, M.; Wang, Q. Explainable machine learning-driven predictive performance and process parameter optimization for caproic acid production. Bioresour. Technol. 2024, 410, 131311. [Google Scholar] [CrossRef]
Balaha, H.M.; Hassan, A.E.-S.; Ahmed, R.A.; Balaha, M.H. Advancing eye disease detection: A comprehensive study on computer-aided diagnosis with vision transformers and shap explainability techniques. Biocybern. Biomed. Eng. 2025, 45, 23–33. [Google Scholar] [CrossRef]
Tahir, M.H.; Ibrahim, M.A.A.; Sayed, S.R.M.; Magero, D.; Pembere, A. Dielectric constant prediction of polymers for organic solar cells and generation of library of new organic compounds. J. Solid State Chem. 2025, 345, 125213. [Google Scholar] [CrossRef]
Renshaw, C.E.; Pollard, D.D. An experimentally verified criterion for propagation across unbounded frictional interfaces in brittle, linear elastic materials. Int. J. Rock Mech. Min. Sci. Geomech. Abstr. 1995, 32, 237–249. [Google Scholar] [CrossRef]
Gu, H.; Weng, X.; Lund, J.B.; Mack, M.G.; Ganguly, U.; Suarez-Rivera, R. Hydraulic fracture crossing natural fracture at nonorthogonal angles: A criterion and its validation and applications. SPE Prod. Oper. 2012, 27, 20–26. [Google Scholar] [CrossRef]
Dahi-Taleghani, A.; Olson, J.E. Numerical modeling of multistranded-hydraulic-fracture propagation: Accounting for the interaction between induced and natural fractures. SPE J. 2011, 16, 575–581. [Google Scholar] [CrossRef]
Mkono, C.N.; Shen, C.; Mulashani, A.K.; Mwakipunda, G.C.; Nyakilla, E.E.; Kasala, E.E.; Mwizarubi, F. A novel hybrid machine learning and explainable artificial intelligence approaches for improved source rock prediction and hydrocarbon potential in the Mandawa Basin, SE Tanzania. Int. J. Coal Geol. 2025, 302, 104699. [Google Scholar] [CrossRef]
Ma, Y.Z. Unconventional resources from exploration to production. In Unconventional Oil and Gas Resources Handbook; Gulf Professional Publishing: Amsterdam, The Netherlands, 2016; pp. 3–52. [Google Scholar] [CrossRef]

Figure 1. Workflow of the proposed study.

Figure 2. Hydraulic fracturing model: (a) schematic of the wells and natural fracture (NF) distribution; (b) simulated hydraulic fracture geometry.

Figure 3. Flowchart of the data augmentation procedure.

Figure 4. Machine-learning model performance: (a) measured versus predicted values for the test set; (b) test-set residuals.

Figure 5. Summary plot of predicted versus actual values for different models. (a) Ensemble, (b) GBRT, (c) ElasticNet, (d) GPR, (e) Lasso, (f) LightGBM, (g) Linear + interactions, (h) Linear, (i) RF, (j) Ridge, (k) SVR, (l) XGBoost.

Figure 6. Actual and predicted fracture-asymmetry indices with the 95% prediction interval obtained by Quantile GBRT.

Figure 7. Correlation analysis of original input variables. (a) Pearson correlation; (b) Spearman rank correlation.

Figure 8. Variance inflation factor (VIF) results for the original input variables.

Figure 9. Correlation diagnostics of second-order interaction features. (a) Pearson correlation of interaction features (b) Spearman correlation of interaction features.

Figure 10. SHAP-based global feature importance ranking for fracture-asymmetry prediction.

Figure 11. SHAP summary plot of input features for fracture-asymmetry index prediction.

Figure 12. Bootstrap mean absolute SHAP values with 95% confidence intervals.

Figure 13. Bootstrap grouped permutation importance of original input variables.

Table 1. Measured basic model parameters.

Reservoir	True Vertical Depth (TVD)/m	Reservoir Thickness/m	Porosity (%)	Permeability/(mD)	Minimum Horizontal Principal Stress/MPa	Maximum Horizontal Principal Stress/MPa	Vertical Stress/MPa
Upper caprock	4930	30	6	0.1	88.5	98	115
Target reservoir	4960	80	3	0.1	89.42	98.92	116.617
Lower caprock	5040	30	6	0.1	90.341	99.841	118.235

Table 2. Natural fracture parameters.

Property	Length (m)	Azimuth (deg)	Spacing (m)
Average Value	40	50	10
Standard Deviation	20	20	5

Table 3. Simplified numerical simulation design.

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
…	…	…	…
56	550	15	45
57	550	15	90

Table 4. List of input features for SHAP analysis.

Serial Number	Feature	Unit	Description
1	Natural fracture angle	°	Angle between the natural fractures and the direction of maximum principal stress, affecting hydraulic-fracture propagation path and morphology.
2	Well spacing × natural fracture angle	Dimensionless	Interaction term reflecting the combined effect of well spacing and natural fracture orientation on inter-well interference and fracture propagation.
3	Well spacing × discharge rate	Dimensionless	Interaction term reflecting the combined effect of well layout and injection rate on reservoir stimulation and inter-well interference.
4	Well spacing	m	Distance between the centers of adjacent horizontal wells. This parameter controls interference intensity and production efficiency and is central to well-pattern optimization.
5	Discharge rate	m³/min	Instantaneous fracturing-fluid injection rate, which controls the scale and conductivity of hydraulic fractures and therefore influences stimulation effectiveness.
6	Discharge rate × natural fracture angle	Dimensionless	Interaction term reflecting the combined effect of injection rate and natural fracture orientation on hydraulic-fracture morphology.

Table 5. Performance comparison of different regression models for fracture-asymmetry prediction.

Model	R²	RMSE	NRMSE	MAE
Ensemble model	0.874	0.0079	0.0508	0.0044
GBRT	0.866	0.0082	0.0524	0.0037
GPR	0.842	0.0089	0.0570	0.0044
RF	0.821	0.0094	0.0606	0.0064
SVR-RBF	0.780	0.0105	0.0672	0.0085
LightGBM	0.722	0.0117	0.0755	0.0079
XGBoost	0.738	0.0114	0.0733	0.0066
LR+ interactions	0.446	0.0166	0.1066	0.0124
Ridge	0.437	0.0167	0.1074	0.0124
Elastic Net	0.429	0.0168	0.1081	0.0122
Lasso	0.429	0.0168	0.1082	0.0122
LR-no interactions	0.421	0.0169	0.1089	0.0127

Table 6. Statistical summary of model performance over 50 repeated runs.

Metric	N	Mean	Std	95% CI
R²	50	0.8484	0.1075	0.8179–0.8790
NRMSE	50	0.0687	0.0285	0.0606–0.0768
RMSE	50	0.0079	0.0027	0.0072–0.0087
MAE	50	0.0043	0.0013	0.0040–0.0047

Table 7. Repeated-validation comparison between original-data regularization controls and the augmented interaction model.

Case	Method	R² Mean 95% CI	RMSE Mean 95% CI	NRMSE Mean 95% CI	MAE Mean 95% CI
A	Original data	0.6125 (0.5748–0.6502)	0.0186 (0.0174–0.0198)	0.1617 (0.1513–0.1722)	0.0142 (0.0132–0.0152)
B1	Original data + RidgeCV	0.6782 (0.6454–0.7110)	0.0168 (0.0157–0.0179)	0.1461 (0.1365–0.1557)	0.0125 (0.0116–0.0134)
B2	Original data + LassoCV	0.6416 (0.6069–0.6763)	0.0175 (0.0163–0.0187)	0.1522 (0.1417–0.1626)	0.0132 (0.0122–0.0142)
B3	Original data + ElasticNetCV	0.6539 (0.6202–0.6876)	0.0172 (0.0160–0.0184)	0.1496 (0.1391–0.1600)	0.0129 (0.0119–0.0139)
E	Ensemble model	0.8484 (0.8179–0.8790)	0.0079 (0.0072–0.0087)	0.0687 (0.0606–0.0768)	0.0043 (0.0040–0.0047)

Table 8. Leave-one-simulation-out validation performance of individual base learners and the ensemble model.

Model	$R^{2}$	RMSE	NRMSE	MAE
GBRT	0.52	0.0152	0.1322	0.0108
RF	0.55	0.0148	0.1287	0.0105
Ridge	0.47	0.0158	0.1374	0.0112
Ensemble	0.58	0.0142	0.1235	0.01

Table 9. Comparison of point prediction and prediction-interval performance.

Method	R²	RMSE	NRMSE	MAE	95%PI Coverage	Mean PI Width	PINAW
Point ensemble	0.8637	0.0082	0.0528	0.0058	—	—	—
Quantile GBRT 95% PI	0.6168	0.0138	0.0886	0.0078	0.9102	0.0515	0.3306
Bootstrap ensemble 95% PI	0.8290	0.0092	0.0592	0.0064	0.3594	0.0123	0.0792

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zuo, H.; Peng, Y. Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions. Processes 2026, 14, 1900. https://doi.org/10.3390/pr14121900

AMA Style

Zuo H, Peng Y. Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions. Processes. 2026; 14(12):1900. https://doi.org/10.3390/pr14121900

Chicago/Turabian Style

Zuo, Hanke, and Yanhong Peng. 2026. "Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions" Processes 14, no. 12: 1900. https://doi.org/10.3390/pr14121900

APA Style

Zuo, H., & Peng, Y. (2026). Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions. Processes, 14(12), 1900. https://doi.org/10.3390/pr14121900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
7	200	15	90
8	200	5	50
9	250	10	50
10	250	15	50
11	250	20	50
12	250	15	30
13	250	15	45
14	250	15	90
15	250	5	50
16	300	10	50
17	300	15	50
18	300	20	50
19	300	15	30
20	300	15	45
21	300	15	90
22	300	5	50
23	350	10	50
24	350	15	50
25	350	20	50
26	350	15	30
27	350	15	45
28	350	15	90
29	350	5	50
30	400	10	50
31	400	15	50
32	400	20	50
33	400	15	30
34	400	15	45
35	400	15	90
36	400	5	50
37	450	10	50
38	450	15	50
39	450	20	50
40	450	15	30
41	450	15	45
42	450	15	90
43	450	5	50
44	500	10	50
45	500	15	50
46	500	20	50
47	500	15	30
48	500	15	45
49	500	15	90
50	500	5	50
51	550	10	50
52	550	15	50
53	550	20	50
54	550	15	30
55	550	15	45
56	550	15	90
57	500	9	55

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
…	…	…	…
56	550	15	45
57	550	15	90

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
7	200	15	90
8	200	5	50
9	250	10	50
10	250	15	50
11	250	20	50
12	250	15	30
13	250	15	45
14	250	15	90
15	250	5	50
16	300	10	50
17	300	15	50
18	300	20	50
19	300	15	30
20	300	15	45
21	300	15	90
22	300	5	50
23	350	10	50
24	350	15	50
25	350	20	50
26	350	15	30
27	350	15	45
28	350	15	90
29	350	5	50
30	400	10	50
31	400	15	50
32	400	20	50
33	400	15	30
34	400	15	45
35	400	15	90
36	400	5	50
37	450	10	50
38	450	15	50
39	450	20	50
40	450	15	30
41	450	15	45
42	450	15	90
43	450	5	50
44	500	10	50
45	500	15	50
46	500	20	50
47	500	15	30
48	500	15	45
49	500	15	90
50	500	5	50
51	550	10	50
52	550	15	50
53	550	20	50
54	550	15	30
55	550	15	45
56	550	15	90
57	500	9	55

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
…	…	…	…
56	550	15	45
57	550	15	90

Article Menu

Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions

Abstract

1. Introduction

2. Materials and Methods

2.1. Development of an Integrated Geomechanical Model for a Representative Well Pad

2.2. Feature Selection and Data Generation

2.2.1. Physical Meaning of Selected Features

2.2.2. Fracture Asymmetry Index for Inter-Well Interference

2.3. Machine Learning Model Development

2.3.1. Data Augmentation and Preprocessing

2.3.2. Model Development and Optimization

3. Prediction Performance and Interpretability Analysis

3.1. Overall Prediction Performance of the Proposed Model

3.2. Expanded Model Comparison

3.3. Robustness and Generalization Analysis

3.3.1. Repeated Validation

3.3.2. Regularization-Control Experiment

3.3.3. Leave-One-Simulation-Condition-Out Validation

3.3.4. Uncertainty Assessment

3.4. Interpretability and Feature-Importance Robustness Analysis

3.4.1. Correlation and Multicollinearity Diagnostics

3.4.2. SHAP-Based Feature Contribution Analysis

3.4.3. Grouped Permutation Importance

4. Discussion

4.1. Engineering Interpretation of Model Results

4.2. Applicability Boundaries of the Simulation-Trained Surrogate Model

4.3. Future Improvements and Field Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
7	200	15	90
8	200	5	50
9	250	10	50
10	250	15	50
11	250	20	50
12	250	15	30
13	250	15	45
14	250	15	90
15	250	5	50
16	300	10	50
17	300	15	50
18	300	20	50
19	300	15	30
20	300	15	45
21	300	15	90
22	300	5	50
23	350	10	50
24	350	15	50
25	350	20	50
26	350	15	30
27	350	15	45
28	350	15	90
29	350	5	50
30	400	10	50
31	400	15	50
32	400	20	50
33	400	15	30
34	400	15	45
35	400	15	90
36	400	5	50
37	450	10	50
38	450	15	50
39	450	20	50
40	450	15	30
41	450	15	45
42	450	15	90
43	450	5	50
44	500	10	50
45	500	15	50
46	500	20	50
47	500	15	30
48	500	15	45
49	500	15	90
50	500	5	50
51	550	10	50
52	550	15	50
53	550	20	50
54	550	15	30
55	550	15	45
56	550	15	90
57	500	9	55

Scheme	Well Spacing/m	Injection Flow Rate/(m³/min)	Natural Fracture Angle/°
1	200	5	50
2	200	10	50
3	200	15	50
4	200	20	50
5	250	5	50
6	250	10	50
…	…	…	…
56	550	15	45
57	550	15	90