3.2. Data Analysis
The collection includes 200 hydraulically fractured horizontal wells from a tight-sandstone formation in the Ordos Basin. Field records (petrophysics, rock mechanics, in situ stresses, and treatment schedules) were standardized and subjected to quality assurance to constitute the analysis sample.
Outliers are evident in permeability and treatment volumes, indicating operational disturbances and measurement noise characteristic of tight-gas projects.
Numerous engineering variables (e.g., permeability upper tail, total fluid, total proppant) display outliers and heavy tails, which are anticipated in tight-gas operations due to operational disturbances (screen-out management, rate escalation), measurement inaccuracies, and design variability between pads.
Due to the skewed, heavy-tailed characteristics of field data and the existence of outliers, all continuous predictors are standardized using RobustScaler (median–IQR) throughout the modeling process. This decision mitigates leverage from extremes and enhances numerical conditioning for scale-sensitive learners (SVR), while maintaining a consistent pipeline for tree-based models that are predominantly scale-insensitive.
3.3. Data Preprocessing
We assessed five normalizing methods frequently employed in petroleum production datasets—MinMaxScaler, MaxAbsScaler, StandardScaler (z-score), RobustScaler (median–IQR), and median normalization—and measured their impacts utilizing divergence measures.
where
is median,
is minimum value,
is maximum value,
is upper quartile,
is lower quartile,
is maximum absolute value,
is arithmetic mean, and
is standard deviation.
We calculated Kullback–Leibler (
KL) and Jensen–Shannon (
JS) divergences to evaluate distributional dissimilarity prior to and subsequent to scaling.
Tight-gas characteristics (rates, volumes, pressures) are generally skewed and heavy-tailed, with outliers resulting from operational disturbances and measurement noise; the KL/JS diagnostics (
Figure 7) indicate significant deviations from normality. In these circumstances, z-score and min–max scaling can enhance the influence of outliers, while RobustScaler maintains rank order and diminishes the impact of extremes by median centering and interquartile range scaling. Statistical analysis of dataset distribution characteristics is showon in
Table 3.
Following prior petroleum ML studies that adopt median–IQR robust scaling to mitigate outliers in field measurements [
13,
14], we employ RobustScaler for all continuous features before model training. Our model suite comprises scale-sensitive learners, such as support vector regression and back-propagation neural networks, in conjunction with predominantly scale-insensitive tree-based models. Consequently, we utilize RobustScaler as the standard preprocessing method to mitigate field outliers and heavy tails (
Figure 8) and to enhance numerical conditioning for the scale-sensitive components.
3.4. Accuracy Evaluation Results of Stacking Integrated Model
To further assess how base learner choice affects the generalization of the stacking ensemble, we added a configuration that uses all candidate algorithms as base learners and constructed ten stacking models (
Table 4).
In the stacking ensembles that used SVR as the meta-learner, Stacking #3 achieved the best performance.
In the ensembles that used Random Forest as the meta-learner, Stacking #6 was the strongest, but it still underperformed the best SVR-meta configuration.
We adopt a grouped nine-fold cross-validation at the pad/well level to avoid leakage across spatially correlated wells. In each outer fold, base- and meta-learner hyperparameters are tuned via PSO, the fold is evaluated, and metrics are logged. After completing nine folds, models are refit on the full training set and evaluated on a held-out test set (N = 20 wells). For CV, we report fold-wise mean ± 95% CI using a t-interval (t
0.975, 8). On the held-out test set, Stacking #3 achieved MAE = 38.06 [34.6, 41.9], RMSE = 64.61 [58.5, 71.5], MAPE = 15.13% [13.4, 16.9], and R
2 = 0.923 [0.901, 0.942] [
15]. Stacking #3, 9-fold cross-validation. is showon in
Table 5.
We substantiate these points by conducting statistical comparisons on the held-out test set using the Friedman test followed by the Nemenyi post hoc procedure. Specifically, we applied the Friedman test to per-well ranks of the 10 ensembles on the 20-well test set (k = 10, N = 20) and then used Nemenyi to identify pairwise performance differences [
16]. Table of prediction error ranking for the stacking model is shown in
Table 6N-numbers of wells, k-numbers of Model —the average rank of the j-th model on N.
Both the Friedman statistic (
= 42.2018,
= 9) and the Iman–Davenport adjustment (
= 5.819,
= 9,
= 171) exceed their 0.05 critical values; accordingly, we reject the null hypothesis of equal model performance (
p ≪ 0.001). Kendall’s W ≈ 0.234 indicates modest rank concordance across the 20 wells—systematic but not strong—therefore, we applied the Nemenyi post hoc test to identify statistically significant pairwise differences.
For α = 0.05, = 3.173, post hoc Nemenyi analysis indicates that the top performer (Stacking #3; average rank 2.95) is statistically indistinguishable from Stacking #9, #2, #6, and #10, whereas others are significantly worse (α = 0.05, CD = 3.173).
Comparing the two representative ensembles, Stacking #3 and Stacking #9, we observe that the smaller, more complementary set in Stacking #3 yields lower test errors than Stacking #9. On per-well ranks, the Nemenyi test indicates that #3 vs. #9 is not significant at α = 0.05, yet the direction consistently favors #3.
This pattern reflects the bias-variance trade-off in stacking: as the number of base learners grows, the level-two feature vector (OOF predictions) expands. When added learners are highly correlated or weak, they contribute little new signal but increase collinearity and variance at the meta level, raising overfitting risk. Conversely, a compact, diverse subset (as in #3) supplies complementary errors and improves generalization.
SVR establishes linear or nonlinear regression models by maximizing intervals, performs well in handling nonlinear problems and datasets with minimal noise, and has good generalization ability. Random Forest, on the other hand, constructs decision trees through random sampling and feature selection, demonstrating excellent performance in handling high-dimensional and large-scale datasets, and exhibiting robustness against missing data and noise. However, when used as a meta-learner, the sample set features it faces are the predicted values of the base learner, not high-dimensional data, so the performance of Random Forest as a meta-learner is not as good as SVR.
In addition, the research results indicate that when selecting all base learners to form a stacking ensemble model, its performance is not as good as the stacking ensemble model formed by selectively selecting base learners. Taking Stacking # 9 as an example, its error and goodness of fit are MAE = 44.425, RMSE = 64.703, MAPE = 15.538%, and R2 = 0.889, respectively, indicating poorer performance compared to Stacking # 3. The reason for this difference is that for base learners with similar algorithm principles, if their performance is relatively poor, their predicted values will have a significant impact on the error of the final prediction results.
3.5. Comparing the Accuracy of Integrated Models and Individual Models
In order to further analyze the performance of stacking ensemble models and individual machine learning models, this paper will compare the accuracy/error between Stacking #3 and Stacking# 9 ensemble models and individual machine learning models, and present the results in
Figure 9. Both ensemble models use SVR as the meta-learner, and Stacking #3 specifically selects the base learner, while Stacking 9 includes all the base learners involved in this paper. As shown in the figure, the root mean square error (RMSE), absolute error (MAE), and mean absolute percentage error (MAPE) of Stacking #3 and Stacking #9 are significantly lower than those of individual machine learning models. In addition, the R
2 values of these two ensemble models are 0.923 and 0.889, respectively, which are 26.09% and 21.44% higher than the optimal individual base learner XGBoost. Therefore, in terms of predicting production capacity after tight-gas pressure, the performance of these two integrated models has significantly improved compared to individual machine learning models.
The aforementioned results indicate that the stacking ensemble diminishes bias and enhances fit compared to individual learners.
Figure 10 illustrates the comparison between the anticipated values and the actual values derived from two distinct models. The figure clearly illustrates that the projected values of Stacking #3 and Stacking #9 in the test set are more closely aligned with the fitted line of y = x, but the fitted lines of individual machine learning models exhibit considerable deviation from y = x. This suggests that stacking ensemble models can proficiently rectify prediction biases introduced by individual machine learning models.
We conceptually assess our technique against two predominant families of industry-standard predictors.
History-oriented benchmarks (DCA/RTA): Empirical decline-curve analysis (including Arps-type and its contemporary extensions) and rate-transient analysis are extensively utilized once post-fracturing production data is accessible. These strategies are effective for surveillance and reserves; however, they are inapplicable during the pre-job design phase and are susceptible to the duration and quality of early-time data.
Physics-based simulations (analytical/numerical): Pseudo-3D and 3D fracture-reservoir simulators, along with analytical post-fracture proxies, are the standard design tools. Their projections in our sector deviate from data-driven results mainly due to their assumptions of uniform compensation, consistent cluster efficiency, and optimal leakoff, while the analyzed interbedded sandstone–shale displays significant heterogeneity and varied cluster performance.
Our model is positioned as a design-stage, data-driven complement: given pre-job geological/geomechanical descriptors and treatment parameters, it produces rapid forecasts of early-time productivity for screening and design optimization. A dedicated interpretability study presented later further examines the learned importance pattern (water saturation, number of clusters, total fracturing liquid volume) and shows its consistency with physics-based expectations while reflecting operational variability. Common approach contrast is showon in
Table 7.