1. Introduction
With the growing demand for energy and the decline in conventional oil and gas production, as an unconventional and promising cleaner fossil energy resource, shale gas reservoirs have become increasingly important for the global energy field due to their vast potential reserves [
1]. The economic viability of shale gas extraction relies in part on accurately simulating shale gas reservoirs, especially following horizontal drilling and hydraulic fracturing. Accurately evaluating the volume of recoverable reserves is critical for making business decisions, particularly in fluctuating oil price contexts [
2]. To maximize economic returns, accurate reservoir characterization and reliable production forecasting are essential. However, the presence of multiscale pore structures in the matrix, along with intricate fracture networks, complicates the simulation of gas flow and production forecasting [
3].
Accurate forecasting of shale gas production provides benefits in both economic assessments and decision-making, helping to optimize resource development and promote the sustainable growth of shale gas projects [
4]. From a resource optimization perspective, during the early production stage, a reliable production forecasting method enables companies to optimize drilling and hydraulic fracturing plans, reducing unnecessary costs and improving output productivity [
5]. From a management or decision-making perspective, accurate prediction methods provide investors with a methodology to evaluate return on investment and assist in adjusting long-term production planning and risk assessment strategies. This reduces uncertainty and errors, ensures the reliability of production plans, and ensures the feasibility of long-term projects [
6]. However, the accurate prediction of shale gas production remains a challenge, due to the complex relation between drilling and geological factors with gas reservoirs [
7].
In recent years, researchers have developed various methods to analyze and predict the production performance of shale gas wells [
8,
9], including (1) physics-based approaches [
10,
11,
12], (2) machine learning algorithms [
13,
14,
15] and (3) the decline curve analysis (DCA) method [
16,
17,
18]. Physics-based methods involve the use of fundamental physical principles to simulate the complex processes occurring within the reservoir; machine learning models utilize geological and engineering data to capture underlying factors influencing gas production; DCA methods focus on historical production data to fit parameters that simulate production decline curves, offering simplicity and practicality, but struggle to capture complex reservoir behaviors, like fracture networks or porosity [
19,
20,
21]. Complex reservoir parameters like porosity, permeability, fracture networks, and multi-phase flow play a crucial role in determining fluid migration patterns. In shale reservoirs, these complex fracture networks and multi-phase flow effects often introduce prediction biases, making it difficult for DCA models to accurately reflect actual production behaviors. Since DCA relies on simplified decline curves, its predictive performance is limited under complex reservoir conditions [
22,
23]. Each approach has its strengths and limitations, and combining traditional methods with modern techniques can enhance predictive accuracy and robustness in shale gas production forecasting [
24].
Physics-based methods are characterized by their ability to consider the physical processes that control fluid flow and rock–fluid interactions [
25]. Physics-based models could reflect the underlying mechanisms that influence production in complex shale reservoirs and can simulate various conditions, providing the dynamic process of flow regimes and fracture behaviors that empirical models cannot capture. However, detailed simulations often require significant computational resources and are sensitive to outliers [
26,
27,
28].
Machine learning models provide flexibility and can handle complex, nonlinear relationships but may suffer from overfitting and require large datasets, potentially lacking physical interpretability [
13]. Existing studies usually adopt geological and engineering parameters as input variables, with gas production data as the target variable. Wang et al. [
29] utilized the multi-layer perceptron (MLP) network and the long short-term memory (LSTM) network to predict shale gas production. The comparison results demonstrate that the LSTM model outperforms the MLP model, as it not only considers geological and fracturing reservoir parameters but also accounts for the time series relationships in production data. Wang et al. (2024) [
30] analyzed the production patterns of six deep coalbed methane (CBM) wells in the Ordos Basin, noting an initial decline in pressure followed by stable gas rates. They applied the LSTM model enhanced with Bayesian optimization (BO) for gas production prediction, demonstrating its effectiveness using only casing pressure and liquid level as inputs. This approach offers a valuable tool for managing and optimizing deep CBM reservoirs.
DCA is a classic production prediction method based on historical production data and is widely applied in shale gas and oil engineering fields [
31]. Compared with complicated physics-based simulations and machine learning algorithms, DCA models based on parameter linearization or direct curve regression are much more straightforward and efficient. Furthermore, unlike some machine learning algorithms, DCA provides explicit mathematical formulations that are easily interpretable, and the development of mathematical models requires only gas production data.
Among the various DCA models utilized in the industry, the Arps model [
32], the Duong model [
33], and the logistic growth curve modeling [
34] are the most commonly employed. The Arps decline equation, which originates from the concept of loss ratio, has gained widespread acceptance due to its simplicity and practicality. However, this model often yields overly optimistic production estimates because it assumes boundary-dominated flow conditions, an assumption that does not hold true for shale gas wells. Shale formations typically exhibit transient flow regimes over extended periods due to their low permeability [
35,
36].
To address the limitations of traditional DCA models, researchers have developed advanced models to enhance prediction accuracy under various production conditions. For example, the Stretched Exponential Decline (SEPD) model extends conventional decline analysis by accounting for more complex flow behaviors, which are often present in heterogeneous shale reservoirs [
37,
38]. The Fractional Decline Curve (FDC) model, on the other hand, introduces fractional calculus to better represent the irregular diffusion of fluids within shale formations, providing a more accurate prediction of production rates over time [
39,
40].
Despite these advancements, no single-DCA model has proven consistently effective across all production environments due to the inherent variability and complexity of shale gas reservoirs. Therefore, we propose an Improved DCA approach that leverages ensemble learning to combine the strengths of multiple models. This method integrates machine learning with traditional DCA, aiming to enhance forecasting accuracy and robustness by drawing on the complementary advantages of each model.
This paper proposes an Improved DCA-based prediction method for shale gas production using historical production data. The method’s structure consists of several base empirical DCA models and a meta-model. In this study, base empirical DCA models refer to individual DCA models that serve as foundational components within the ensemble learning framework. The meta-model combines and weighs the predictions of each base model, leveraging their strengths to enhance prediction accuracy and robustness. First, each base empirical DCA model is fitted to the historical production data of shale gas wells to obtain the corresponding fitting parameters. To address the limitations of individual DCA models, we employ a meta-learner to establish an ensemble framework that integrates the outputs of multiple DCA models. Finally, we validate the effectiveness of the proposed method using historical production data from 22 shale gas wells in region L, China, and evaluate the model’s performance based on three evaluation metrics: mean absolute error (MAE), mean squared error (MSE), and R-squared. The experimental results demonstrate that the ensemble-based DCA method not only outperforms other prediction models across the three evaluation metrics but also shows enhanced robustness and stability across all samples.
2. Methodology
In this paper, we propose an Improved DCA model integrating traditional models and ensemble learning techniques. The architecture of the proposed Improved DCA framework can be seen in
Figure 1.
Step 1: We need to collect raw production data from shale gas wells, including daily gas production rates and operational parameters (daily production duration (hours)), and, in the preprocessing stage, we need to eliminate anomalous data points to improve data quality and extracting the decline phase of the production data for analysis.
Table 1 provides an example of the raw production data collected from shale gas wells. The table includes key fields such as the well number, production days, daily production duration (in hours), and daily gas production (in units of
). Production days represent the number of days each well has been actively producing gas, while daily production duration (hours) indicates the number of hours that each well produces gas per day. This table format serves as the initial data structure before preprocessing and analysis.
Step 2: In the data preprocessing stage, we need to remove outliers. In actual production, shale gas wells sometimes require shutdown operations, which can result in shorter production times and lower gas production on shutdown days compared to normal operation. We defined these data points as outliers and excluded them. Additionally, we renumbered the production days of each gas well accordingly. After preprocessing, we develop three base empirical DCA models to capture different aspects of the production decline behavior: SEPD model, Duong model and Logistic Growth Model (LGM). Each model is independently fitted to the historical production data to obtain the corresponding fitting parameters and generate initial production forecasts.
Step 3: To overcome the limitations of individual DCA models, we adopt ensemble learning method to develop the Improved DCA method. Here, we select the Random Forest algorithm as the meta-learner due to its robustness and ability to model nonlinear relationships. Then, we feed the outputs of the three base empirical DCA models into the meta-learner as input features. The operation process of the ensemble model is illustrated in
Figure 2.
After fitting the LGM, Duong, and SEPD models to historical production data from shale gas wells, we use the fitted results from these models at each time
t as input features for the Random Forest model. The Random Forest model, consisting of multiple decision trees, constructs
n decision trees based on the fitted outputs of the LGM, Duong, and SEPD models at each time
t. It then aggregates the predictions from these
n decision trees to produce the final output. Additionally, we provide the detailed steps for model fitting and experiments based on simulated data, as detailed in
Appendix A.
Step 4: The ensemble model (meta-learner) is trained to minimize the loss function and produce the final prediction results. First, the ensemble model adjusts its parameters to minimize the difference between its predictions and the actual production data, typically using a suitable loss function such as MSE and MAE. After training, the ensemble model outputs the aggregated prediction results, which serve as the final forecast for shale gas production.
This methodology incorporates the strengths of both traditional DCA models and ensemble learning to enhance the predictive accuracy and stability of shale gas production forecasting. The details of the proposed improved framework are listed in Algorithm 1.
Algorithm 1: Improved Decline Curve Based on Ensemble Learning |
![Energies 17 05910 i001]() |
2.1. Stretched Exponential Production Decline Model
The Stretched Exponential Production Decline (SEPD) model was proposed to overcome the limitations of the traditional Arps model [
41]. The SEPD model assumes that the production rate declines according to a stretched exponential decay function, which captures the complexities of transient flow in unconventional reservoirs, such as oil and gas [
42].
The SEPD model is based on the premise that the production rate
at time
t follows a stretched exponential decay function, which is as follows:
where
n is the stretch exponent, which describes the deviation from a standard exponential decline and
is the characteristic time constant, reflecting the inherent time scale of the production decline.
By integrating this equation, the production rate at any time
t is given by
where
is the initial production rate at
. Based on the stretched exponential decay function and the production rate equation, the cumulative production
at time
t can be derived:
2.2. Duong Decline Model
The Duong decline model, proposed by [
33], differs from traditional decline models in that it accounts for the prolonged transient flow period, making it particularly effective in modeling the early production phase where hyperbolic or exponential decline models may not provide an accurate fit. The model is expressed as
where
a is the decline constant that adjusts the overall decline rate, and
m is the decline exponent, which shapes the curve and controls the transition between the transient and boundary-dominated flow. Based on the formula for daily production, the cumulative production at time
t can be expressed as
2.3. Logistic Growth Model
LGM was first proposed to model the regrowth of livers. Since then, LGM has been widely used in the research fields of biological population reproduction, organ growth and economic analysis, but there are certain differences in the models used in different fields [
43,
44].
Many studies have shown that the generalized LGM is flexible. It can fit many types of curves, including hyperbolic decline curves. Clark et al. [
34] proposed applying the LGM in the production decline analysis of unconventional gas wells and combined the production characteristics of gas wells to establish a cumulative gas production formula, which has been proven applicable for fitting the production of unconventional gas wells [
45]. Here, for LGM, we have
where
Q represents the cumulative gas production of the well,
K denotes the carrying capacity,
is the hyperbolic exponent, and
a is the constant parameter that controls the shape of the curve.
The daily gas production
can be derived from the cumulative gas production equation, which can be expressed as
2.4. Ensemble Learning
Ensemble learning is a powerful technique used to improve the predictive performance of machine learning models by combining the predictions of multiple base learners. The core idea behind ensemble learning is that, by leveraging the strengths of several models, the final prediction will be more accurate and robust than any single model [
46].
In this study, we adopt multiple DCA models as base learners and the Random Forest [
47] machine learning model as the meta-learner to combine the result of the base model. Then, we develop an improved hybrid prediction method with enhanced performance compared with a single-DCA model. The improved prediction method integrates predictions from various DCA models and passes them into a meta-learner, which will produce a final output by capturing information that individual DCA models could not extract.
Let the base learners be denoted as
, where
m is the number of base models, and let the output of the ensemble model
be expressed as a combination of the base learners,
where
is the function learned by the meta-learner, which can be trained using a loss function
. It can be determined by minimizing the error of the ensemble model. This is typically formulated as an optimization problem
where
is the actual production for the
i-th data point and
is the is the ensemble prediction for the
i-th data point.
is the loss function, where we utilize the mean absolute percentage error (MAPE) formula as follows:
The MAPE loss function provides an intuitive measure of prediction accuracy in percentage terms, making it easily interpretable in practical applications. Additionally, MAPE is scale-independent, which is crucial for handling the wide range of production rates in shale gas reservoirs.
4. Results and Discussion
In this study, we developed various models based on historical production data from 22 shale gas wells located in region L. The performance of these models was evaluated using three metrics. To enhance the reliability of our experiments, we introduced three baseline models for comparison with the proposed Improved DCA models. The average metric scores of the various models across the 22 wells are presented in
Table 4 and
Figure 3.
As depicted in
Table 4 and
Figure 3, the Improved DCA model demonstrates optimal average metric scores in terms of MAE, MSE, and
, outperforming the other six traditional DCA models. Specifically, the improved model significantly reduces both the MAE and MSE metric scores, achieving an MAE of 0.0660, an MSE of 0.0272, and an
value of 0.9882, indicating higher accuracy in predicting production rates. The high
value further confirms that the model explains a substantial portion of the variance in the observed data.
LGM demonstrated robust performance in predicting gas production rates, achieving an average metric score that, while lower than the Improved DCA model, surpassed all other traditional models evaluated, with an MAE of 0.3572, an MSE of 0.3615, and an value of 0.8424.
The baseline models (PLE model, Arps model, and EEDC model) exhibited limitations in capturing the complexities of shale gas production, while PLE and Arps could not adapt to changing dynamics and EEDC failed to provide accurate predictions compared to LGM and Improved DCA.
When compared with the three base learner DCA models (Duong, LGM, and SEPD), the performance of the Improved DCA model achieved better scores, suggesting that the proposed ensemble learning framework could enhance the overall efficiency of the models. This outcome implies that integrating multiple models through ensemble learning effectively captures the unique characteristics of the production data, leading to better predictive performance.
To further analyze the proposed model and the baseline model in terms of fitting speed and fitting accuracy across different data sizes, we designed and conducted experiments with varying sample sizes. The specific procedure is as follows: (1) Based on the total production days of each well in the original data, wells are categorized into different subsets with specific intervals: 501–1000 days, 1001–1500 days, 1501–2000 days, 2001–2500 days, and 2501–3000 days. See
Table 5 for details. (2) The proposed Improved DCA model and the baseline model are fit on each subset. (3) Each model’s MAE, MSE, R-squared and time consumption across different sample sizes are calculated.
Table 6 presents a comparative analysis of the proposed Improved DCA model and several baseline models (including SEPD, Duong, LGM, EEDC, Arps, and PLE) in terms of fitting accuracy and computational efficiency across different sample sizes. By examining metrics such as MSE, MAE, R-squared, and time consumption, the performance variations of these models with increasing data size are evaluated.
Table 6 provides a comparative evaluation of the Improved DCA model and several baseline models (SEPD, Duong, LGM, EEDC, Arps, and PLE) across different sample sizes, focusing on fitting accuracy and computational efficiency. Overall, the Improved DCA model demonstrates superior performance, consistently achieving lower MSE and MAE values across all sample sizes. For instance, with a sample size of 501–1000 days, the Improved DCA model achieves an MSE of 0.0131 and an MAE of 0.0498, whereas SEPD records an MSE of 0.1730 and an MAE of 0.2695. This trend persists with larger sample sizes, underscoring the Improved DCA’s accuracy advantage over the baseline models. As sample sizes increase, the Improved DCA model maintains a stable level of accuracy, while baseline models, particularly SEPD and Duong, experience significant increases in error. For example, with a sample size of 2501–3000 days, the Improved DCA model shows an MSE of 0.0242, while the Duong model’s MSE rises to 0.7743. This trend demonstrates that the Improved DCA model is more resilient to increasing data volume, whereas baseline models struggle to maintain accuracy under these conditions.
In terms of computational time, as shown in
Figure 4, the Improved DCA model incurs slightly higher time costs than some baseline models; however, it consistently requires less time than the PLE model across all sample sizes.
In addition, to exclude the impact of the data distribution imbalance, such as the presence of outliers, we conducted a comprehensive analysis of the performance of all models across the 22 shale gas wells. We calculated the scores for the three metrics and performed data visualization to provide a clearer insight into the models’ performance. The details of this analysis are illustrated in
Figure 5.
As depicted in
Figure 5, the horizontal axis represents the well numbers, while the vertical axis denotes the scores of the corresponding metrics. Different-colored lines in the visualization indicate the performance of various models. We can see that the Improved DCA models achieved the best metric scores across all wells, indicating superior predictive accuracy compared to other models. Furthermore, the scores for different well numbers are relatively close to each other, exhibiting a smaller degree of dispersion. This consistent performance suggests that the proposed model is more stable and robust compared to single traditional models. The reduced variability in the scores further reinforces the reliability of the Improved DCA model in accurately estimating production rates across varying well conditions.
In contrast, the performance of traditional DCA models across the 22 shale gas wells exhibited significant variability, indicating a lack of stability. For example, the Duong model demonstrated strong performance with an MSE score close to 0.2 for well number 6. However, this model’s predictive capability declines sharply for well number 19, where the MSE approaches 1.2. This discrepancy underscores the inherent challenges of using traditional DCA models, which can lead to unreliable predictions under varying conditions. The substantial difference in MSE scores highlights the necessity for more robust models that can maintain consistent performance across diverse production scenarios.
The results of all experiments demonstrate that the Improved DCA method, which combines ensemble learning with traditional models, enhances the ability to capture the dynamic changing trends between shale gas production and production times and compensates for the weaknesses of single-DCA models that cannot describe all fields. Specifically, the comprehensive analysis across all wells indicates that the proposed method also exhibits better robustness and stability compared to single-DCA models.
5. Conclusions
This study presents an Improved DCA method for shale gas production forecasting through ensemble learning. Integrating base empirical DCA models and using a meta-learner overcomes the limitations of conventional single-DCA models. For instance, the Arps model performs well under boundary-dominated flow conditions, typically observed in later production stages, but it often fails to capture the transient flow in the early stages due to its assumption of a constant decline rate [
50]. In contrast, the Duong model is suited for early production phases characterized by fracture-dominated flow in unconventional reservoirs, but its accuracy may decline as production stabilizes [
33]. The SEPD model addresses variability and heterogeneity within unconventional reservoirs, making it useful for highly fractured networks; however, it requires substantial historical data to estimate decline parameters reliably, which may limit its accuracy in wells with shorter production histories [
51].
Through evaluating the proposed method and six baseline models based on 22 shale gas wells in region L, China, we draw several conclusions.
(1) To overcome the limitations of traditional single-DCA models, this study proposes an Improved DCA method using ensemble learning for forecasting shale gas production. The proposed method integrates multiple base empirical DCA models and combines their outputs through a meta-learner. This approach effectively addresses the shortcomings of traditional DCA methods, such as their inability to capture complex production dynamics and handle uncertainties in the data.
(2) The proposed method was tested on historical production data from 22 shale gas wells in region L, China. The experimental results demonstrated that the Improved DCA model consistently outperforms traditional models such as the Arps, EEDC, and PLE models. Specifically, the improved model achieved an MAE of 0.0660, an MSE of 0.0272, and an value of 0.9882, all of which are significantly better than the scores obtained by the best-performing traditional model, LGM. Furthermore, a detailed comparative analysis across all gas wells confirmed the stability and robustness of the improved model. The model consistently delivered superior performance across various scenarios, highlighting its capacity to capture the complex dynamics of shale gas production.
(3) By leveraging the unique advantages of each base model, the ensemble approach significantly enhances the overall prediction accuracy. This hybrid method captures the nuances of production dynamics that single-DCA models may miss, leading to more reliable long-term forecasts. The results of this study demonstrate that combining single-DCA models with ensemble learning could balance simplicity and accuracy, making it highly applicable for real-world shale gas production forecasting and optimization.
While the Improved DCA model has demonstrated effectiveness in capturing shale gas production dynamics, several limitations remain. The model’s accuracy is highly dependent on data, requiring a substantial amount of historical data to reliably estimate decline parameters. In cases where production data are sparse or cover only a short period, the model’s performance may be impacted. Additionally, while the ensemble approach enhances accuracy across various production phases, it also increases computational demands, which may limit its practicality for real-time applications in large-scale deployments. To address these limitations, future work could focus on incorporating additional geological data inputs, such as total organic carbon (TOC) and porosity, to improve model flexibility and generalizability. Reducing the computational load by simplifying the ensemble model through feature selection or dimensionality reduction could also make it more suitable for real-time monitoring. These enhancements aim to expand the model’s applicability across diverse production environments and improve its responsiveness to evolving production conditions.