Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints

Wang, Haibiao; Pang, Mingyue; Yuan, Zheng; Dong, Changyin; Xu, Fengxiang; Xin, Yicheng

doi:10.3390/pr14040630

Open AccessArticle

Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints

by

Haibiao Wang

^1,2,

Mingyue Pang

^1,2,

Zheng Yuan

^1,2,

Changyin Dong

^3,*

,

Fengxiang Xu

^1,2 and

Yicheng Xin

^1,2

¹

China Oilfield Services Limited, Tianjin 300459, China

²

National Key Laboratory of Offshore Oil and Gas Exploitation, Beijing 102209, China

³

School of Petroleum Engineering, China University of Petroleum (East China), Qingdao 266000, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(4), 630; https://doi.org/10.3390/pr14040630

Submission received: 14 January 2026 / Revised: 2 February 2026 / Accepted: 9 February 2026 / Published: 11 February 2026

(This article belongs to the Topic Petroleum and Gas Engineering, 2nd edition)

Download

Browse Figures

Versions Notes

Abstract

Horizontal well fracturing serves as a critical technology for enhancing production from tight sandstone gas reservoirs, where accurate prediction of formation breakdown pressure is essential for optimizing fracture design and improving stimulation effectiveness. This study proposes a novel fusion-driven workflow for predicting breakdown pressure in horizontal wells by synergistically integrating physics-based mechanistic modeling with data-driven machine learning. The approach overcomes the computational limitations of conventional analytical models and mitigates the data scarcity constraints inherent in purely empirical methods by using high-fidelity mechanistic simulations to generate physically consistent training samples. Results demonstrate that the hybrid dataset, with an optimal fusion ratio of 1:1.5 between field data and mechanistic-derived samples, yields the highest predictive accuracy. The proposed model, built on an XGBoost algorithm whose hyperparameters are efficiently optimized via a tree-structured Parzen estimator (TPE), exhibits superior generalization capability and robustness, achieving an average prediction error of 7.45% on unseen well data. This work confirms that the fusion framework provides a reliable and practical tool for breakdown pressure prediction in cased horizontal wells, which can directly support the design and implementation of efficient and sustainable fracturing operations in tight gas reservoirs.

Keywords:

tight sandstone reservoirs; breakdown pressure prediction; mechanism-based modeling; machine learning integration; fracturing optimization

1. Introduction

The Ordos Basin Surig tight gas reservoir in China is characterized by low porosity, permeability, pressure, and gas abundance [1,2]. The main producing layer, the Shan 1 section, has an average porosity of 8.0% and an average permeability of 0.503 × 10⁻³ µm², while the Box 8 section shows an average porosity of 8.9% and an average permeability of 0.782 × 10⁻³ µm². The reservoir exhibits rapid lateral variations and multilayer vertical development, complicating the scaling of horizontal drillings. The formation breakdown pressure is a critical parameter in designing and executing hydraulic fracturing, making accurate prediction essential for effective fracturing design and improving operation success [3]. Breakdown pressure is typically determined by either empirical formulae based on mechanism models or machine learning models driven by field data. Early studies on breakdown pressure modeling began in 1957, when Hubbert and Willis proposed a calculation method for breakdown pressure in vertically fractured open-hole wells, based on the theory of linear elastic well wall tensile damage [4]. Later, Haimson et al. introduced the H-F model for permeable poroelastic formations with isotropic rock properties [5], while Eaton et al. adjusted this model [6,7]. Q. Dong’s model considered the impact of non-uniform horizontal tectonic stresses on breakdown pressure [8], and Li and colleagues examined the impact of shot hole dimensions on breakdown pressure in vertical wells [9]. Liu and collaborators proposed a model for fracture initiation pressure in inclined wells with naturally fractured rocks [10]. Wang et al. further enhanced the breakdown pressure model by incorporating fluid–rock interaction, fracture development, and anisotropic effects [11,12,13,14,15,16]. Despite these advances, calculating breakdown pressure remains challenging due to the intricate stratigraphic conditions in tight sandstones, and the complexity of the mechanism models leads to inefficient iterative calculations. Furthermore, no universally applicable mechanism model exists for accurate and efficient breakdown pressure prediction.

In recent years, the integration of machine learning within the oilfield industry has surged, fueled by big data and computing power [17,18]. Scholars have shifted from mechanism models to data-driven approaches, building relationships between geological parameters and breakdown pressure using real field data. Algorithms like radial basis function networks, BP neural networks, and multivariate linear regression have been used to develop prediction models for breakdown pressure [19,20,21,22]. Gao et al. proposed the MMGPT4LF model, which optimizes the GPT-2 architecture and the multimodal cross-attention mechanism to improve load forecasting. Chen et al. proposed a motion–appearance decoupling representation method to solve the problem of event camera representation learning. Wen et al. designed an adaptive degradation perception self-instruction model to achieve integrated restoration of various types of weather-degraded images. These studies have promoted the development of related fields but there is still room for optimization, providing an opportunity for this research [23,24,25]. Tariq et al. combined rock mechanics experiments with machine learning models for breakdown pressure prediction [26], and Zhaohui Lu et al. applied neural networks to predict breakdown pressure in horizontal drillings [27]. However, overfitting remains a challenge in data-driven models that rely solely on field data.

Although machine learning has shown potential in predicting fracturing pressures, purely ‘black box’ data-driven models are often met with caution by the engineering community due to their lack of physical interpretability. To address this, a series of hybrid methods aimed at integrating the advantages of physical mechanisms and data-driven approaches have been proposed. These methods can be broadly classified into two categories: one focuses on integrating physical laws as soft or hard constraints into the model training process (such as physical information neural networks) to ensure that the model outputs conform to basic physical principles; the other focuses on integrating multi-source information during the generation of training data to construct a more comprehensive and representative dataset. Currently, most research focuses on the former, while there is relatively less attention paid to how to systematically construct a high-quality training set that integrates mechanisms and field data. This paper aims to fill this gap and propose a systematic data fusion and enhancement strategy. Our core innovation lies in generating a mixed dataset by coupling numerical simulation with field data, which itself embodies both physical mechanisms and engineering reality, laying the foundation for training a more robust and reliable prediction model.

Drawing from the “data-driven + physical guidance” approach proposed by Academician Li Yang at the 3rd China Petroleum and Petrochemical Intelligent Technology Exchange Conference, this paper proposes a fusion-driven model [28]. This hybrid approach seeks to overcome the limitations of both methods. We establish a breakdown pressure computational model for horizontal drillings in the Surig block, leveraging machine learning to build a data mining model for breakdown pressure. By generating diversified mechanism samples from the breakdown pressure mechanism model, we constrain the machine learning algorithm to improve the accuracy and generalization ability of the breakdown pressure prediction model. This approach provides a critical decision-making tool for horizontal drilling fracturing design and construction. The specific technical roadmap is shown in Figure 1.

2. Study on Predicting Breakdown Pressure Using a Mechanistic Model-Based Approach

Hossain et al. employed the superposition method to develop a stress field model around the wellbore [29]. The process of horizontal drilling involves fracturing the rock by considering factors such as the initial ground stress, internal wellbore pressure, thermal effects on the rock, fluid seepage during fracturing, and the influence of the casing cement ring. An iterative algorithm is then applied to develop a model that predicts the pressure required to initiate rock breakdown, incorporating these factors alongside the rock’s breakdown criteria.

2.1. Modelling the Stress Field Surrounding Horizontal Drilling and Staged Fracturing

The force state of horizontal drilling and shot hole boreholes within the actual formation is highly complex. To simplify the calculation model, the following assumptions are made in this study:

(1): The rock is isotropic and exhibits uniform properties;
(2): The rock deforms under linear elastic conditions;
(3): A consistent and uniform fluid pressure is applied both inside the borehole and within the perforations;
(4): The interaction between the rock and the incoming fluid, and its effect on the mechanical properties, is not considered;
(5): The cement ring is assumed to be strongly bonded.

The model assumes that the rock and cement ring materials follow a linear elastic constitutive relationship during loading. This assumption is applicable to the stress state of brittle rocks below the fracture pressure, and its nonlinear deformation can usually be ignored. The formation is regarded as an isotropic homogeneous medium. Although the actual formation often exhibits anisotropy, considering that this study aims to evaluate the macroscopic fracture pressure and that the obtained field logging data mostly reflects vertical or radial averages, using an isotropic model is an effective engineering simplification and is consistent with the scale of the input data. The model assumes that the cement ring is perfectly bonded to the casing and the formation, with no micro-ring gaps or weak interfaces. This assumption is the basis for calculating the theoretical fracture pressure and represents the ideal state of wellbore integrity. This assumption is reasonable for screening data of well sections with good cementing quality for fusion.

The assumptions of this model have reasonable applicability in conventional reservoirs with good wellbore integrity, brittle rocks, and stress states without observable plastic behavior. The data fusion strategy of this study is to compare and complement the results of the mechanism model based on these idealized assumptions with the field data reflecting actual complexity, thereby constructing an enhanced dataset that includes both theoretical limits and engineering deviations. Future work can consider introducing more complex constitutive models to expand the application scope of this method.

An analysis of the stress state around the wellbore is conducted by employing coordinate transformation. The horizontal drilling wall is considered as a minuscule deformed porous medium, while the horizontal drilling bore and the shot hole borehole are treated as two distinct boreholes of varying sizes that intersect each other. By applying the principle of superposition, the stress distribution on the wall surface of the shot hole borehole in the horizontal drilling is determined, taking into account the combined influence of multiple factors [22] (Figure 1).

\{\begin{cases} σ_{r} = P_{w} - δ ϕ (P_{w} - P_{p}) - \frac{α_{T} E Δ T}{1 - 2 v} - P_{w} \frac{[2 R_{i}^{2} (1 + ν_{c}) (1 - ν_{c})]}{E_{C} (R_{o}^{2} - R_{i}^{2})} / [\frac{1 + ν}{E} + \frac{R_{o}^{2} [(R_{i}^{2} + (1 - 2 ν_{c})) (1 + ν_{c})]}{E_{C} (R_{o}^{2} - R_{i}^{2})}] \\ σ_{θ^{'}} = - 2 P_{w} (1 + \cos 2 θ^{'}) + (σ_{x x} + σ_{y y} + σ_{z}) + 2 (σ_{x x} + σ_{y y} - σ_{z}) \cos 2 θ^{'} - 2 (σ_{x x} - σ_{y y}) \cos 2 θ (1 + 2 \cos 2 θ^{'}) \\ - 4 σ_{z} \sin 2 θ^{'} - \frac{α_{T} E Δ T}{1 - 2 v} - 2 δ [\frac{α (1 - 2 v)}{(1 - v)} - ϕ] (p_{w} - p_{p}) (1 + \cos 2 θ^{'}) \\ + P_{w} \frac{[2 R_{i}^{2} (1 + ν_{c}) (1 - ν_{c})]}{E_{C} (R_{o}^{2} - R_{i}^{2})} / [\frac{1 + ν}{E} + \frac{R_{o}^{2} [(R_{i}^{2} + (1 - 2 ν_{c})) (1 + ν_{c})]}{E_{C} (R_{o}^{2} - R_{i}^{2})}] \\ σ_{z} = - c P_{w} + σ_{z z} - v [2 (σ_{x x} - σ_{y y}) \cos 2 θ] - \frac{α_{T} E Δ T}{1 - 2 v} - δ [\frac{α (1 - 2 v)}{(1 - v)} - ϕ] (P_{w} - P_{p}) \\ τ_{r θ} = 0 \\ τ_{θ z} = 2 τ_{y z} \cos θ \\ τ_{r z} = 0 \end{cases}

(1)

where

δ

is the permeability coefficient;

c

is the correction factor;

τ_{r θ}

,

τ_{θ z}

,

τ_{r z}

is the shear stress component at the well wall (MPa);

σ_{r}

is the wellbore radial stress (MPa);

σ_{z}

is the wellbore axial stress (MPa);

α

is the porous elasticity coefficient.

2.2. Mechanism Model Solution and Error Analysis

The equation for the major stresses under three different cased shot hole completion conditions:

\{\begin{array}{l} \begin{array}{l} σ_{1} = σ_{_{r}} \\ σ_{2} = \frac{1}{2} [(σ_{{θ^{'}}_{0}} + σ_{z}) + \sqrt{{(σ_{{θ^{'}}_{0}} - σ_{z})}^{2} + 4 τ_{θ_{z}}^{2}}] \end{array} \\ σ_{3} = \frac{1}{2} [(σ_{{θ^{'}}_{0}} + σ_{z}) - \sqrt{{(σ_{{θ^{'}}_{0}} - σ_{z})}^{2} + 4 τ_{θ_{z}}^{2}}] \end{array}

(2)

In Block S, tight gas reservoirs lack developed natural fracture systems and sedimentary stratigraphy. Therefore, hydraulic fractures are initiated by tensile stress along the rock body. Specifically, when the tensile stress at the well wall surpasses the tensile strength of the rock, the rock fractures and forms a fracture.

\{\begin{cases} σ_{\max} (θ_{0}) = \frac{1}{2} [(σ_{θ_{0}} + σ_{z}) - \sqrt{{(σ_{θ_{0}} - σ_{z})}^{2} + 4 τ_{θ_{0} z}^{2}}] - α P_{p} \geq σ_{t} \\ σ_{\max} ({θ^{'}}_{0}) = \frac{1}{2} [(σ_{{θ^{'}}_{0}} + σ_{z}) - \sqrt{{(σ_{{θ^{'}}_{0}} - σ_{z})}^{2} + 4 τ_{θ_{0} z}^{2}}] - α P_{p} \geq σ_{t} \end{cases}

(3)

Due to the implicit inclusion of breakdown pressure in the established mechanism model, a direct analytical solution is not possible. Therefore, the breakdown pressure at the tensile initiation of the rock under shot-hole conditions is determined using a trial calculation method, which gradually increases the pressure at the bottom of the well. Programming calculations were performed based on the known computational model for breakdown pressure. Using cased horizontal drillings A and B in the target block as examples, the fundamental data from Table 1 and the rock mechanical parameters at various fracture initiation points were input into the mechanism model to determine the breakdown pressure at each initiation location. The key input parameters and their value ranges in Table 1 are derived from the geological and engineering reports of the target block. The ranges of horizontal principal stress and pore pressure are based on the measured rock mechanics experimental data of the study block. The range of rock tensile strength is estimated through the logging data of the target layer segment using empirical formulas.

By analyzing the friction at different construction displacements and converting the wellhead breakdown pressure to bottomhole breakdown pressure using liquid column pressure, the calculated breakdown pressure from the mechanism model is compared and analyzed with the actual breakdown pressure observed in the field, as shown in Figure 2.

The figure demonstrates that the average absolute error between the calculated results of the mechanism model and the actual breakdown pressure observed in field construction is 13.96%. This indicates that the established mechanism model provides a relatively accurate prediction and offers reliable data to support the integration of the mechanism model with the data-mining-driven method for breakdown pressure prediction.

3. Investigation into the Estimation of Breakdown Pressure Using Machine Learning Models

Machine learning algorithms can establish accurate connections based on data-driven correlations between factors that may not be explicitly related or are difficult to quantify. Because they do not rely on predefined assumptions, these algorithms are better suited for analyzing complex geological phenomena. Therefore, this section presents a data mining study that investigates breakdown pressure in the field using machine learning algorithms.

3.1. Data Integration and Data Pre-Processing

The raw data collected consists of breakdown pressures measured at 236 fracture locations across 20 cased-completion units, along with rock mechanical properties measured at these specific locations in the target block. The machine learning algorithm used breakdown pressure as the output parameter. The input parameters for the algorithm were selected based on the mechanism model, resulting in eight parameters for modeling, including tensile strength and maximum horizontal principal stress.

The data used in this study is derived from real fracturing data of cased completions. The accuracy of the model’s predictions may be influenced by horizontal drillings in the target block, as well as the presence of missing values or outliers in the dataset. Prior to applying machine learning to predict breakdown pressure, data cleaning and other preprocessing steps were necessary to improve prediction accuracy. After removing outliers and filling in missing values, a total of 200 valid samples were compiled.

3.2. Indicators for Model Evaluation

The mean absolute percentage error (MAPE), root mean square error (RMSE), and correlation coefficient (R²) were chosen as metrics to evaluate the predictive accuracy of the model’s breakdown pressure predictions.

M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{{\hat{y}}_{i} - y_{i}}{y_{i}}|

(4)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

(5)

R^{2} = 1 - \frac{\sum_{i = 0}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 0}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(6)

Cross-validation is a technique used to assess the accuracy of a machine learning model during the modeling process. K-fold cross-validation is a widely employed method for model evaluation. This approach helps to reduce the likelihood of prediction errors and enhances the model’s overall applicability. It involves calculating the average RMSE over K iterations of the cross-validation process. In this study, a fivefold cross-validation method was used to assess the model’s performance.

3.3. Investigation into the Estimation of Breakdown Pressure Using Traditional Machine Learning Techniques

A breakdown pressure prediction model for casing horizontal drillings in the target block was constructed using four machine learning algorithms (linear regression, KNN, SVR, and Lasso). The model was based on 200 sets of breakdown pressure field samples obtained through data preprocessing. Eight parameters, including tensile strength and maximum horizontal principal stress, were used as inputs, and breakdown pressure was predicted as the output. Figure 3 displays the prediction results of the four traditional machine learning techniques for breakdown pressure.

Table 2 demonstrates that the four conventional machine learning algorithms are often effective in predicting the breakdown pressure of casing completion. Horizontal drillings were conducted in the target block, resulting in an average correlation coefficient of 0.724 and an average absolute percentage error of 27.01%. However, all four conventional machine learning methods exhibited significant overfitting issues. This problem arises from the limited availability of only 200 sets of breakdown pressure field data for model training. These samples are characterized by data scarcity, lack of diversity, poor quality, and higher unpredictability. Furthermore, the correlation between geological and engineering parameters in the target block and the breakdown pressure is not well established. As a result, when classical learning algorithms are used to model and predict the breakdown pressure of the target block, prediction accuracy may be compromised. Thus, relying on traditional learning techniques to simulate and forecast breakdown pressure for the target block leads to inadequate prediction accuracy and overfitting.

3.4. Study on the Estimation of Breakdown Pressure Using an Integrated Learning Algorithm

Ensemble learning is a powerful approach that combines multiple weak learners to enhance model performance. Predictions generated by ensemble learning typically outperform those of individual models, particularly when the dataset contains a limited number of samples. Therefore, using an ensemble learning system to predict breakdown pressure can yield better results compared to relying on a single learner.

In this study, four common ensemble learning techniques—Random Forest, AdaBoost, GBDT, and XGBoost—are employed to build the breakdown pressure prediction model. The prediction results of the four ensemble algorithms are shown in Figure 4. The predictive capabilities of these models are evaluated using techniques such as fivefold cross-validation to identify the most suitable algorithm for predicting breakdown pressure in cased-completion scenarios. The extraction of oil and gas from tight sandstones is achieved using horizontal drillings.

Table 3 presents a comparison of the model evaluation results for different ensemble learning algorithms. The average coefficient of determination for the four ensemble learning algorithms is 0.840, which is 13.8% higher than the average coefficient of determination for the four single-learner algorithms. The ensemble learning algorithms have RMSE values of 9.500 and 16.39%, respectively, which are 29% and 26% lower than the RMSE values of the single-learner algorithms. This indicates that the ensemble learning algorithms offer more accurate predictions for breakdown pressure than the single learners.

Since no single machine learning algorithm is definitively superior to others, the selection of the most suitable algorithm is based on analyzing the prediction performance on the dataset. Among the four ensemble learning methods, the GBDT algorithm shows the greatest efficiency. The running times of the AdaBoost and XGBoost algorithms are similar when using default hyperparameters. The XGBoost algorithm outperforms the others, with an RMSE of 0.208 on the training set and 9.223 on the test set, which is lower than the RMSE values of the other models. The root mean square error is 13.31%, and the coefficient of determination is 0.887, which is higher than that of the other models. Therefore, the XGBoost algorithm demonstrates better prediction capabilities than the other learning algorithms, with potential for further parameter tuning.

While the XGBoost algorithm effectively modeled and tested breakdown pressure for horizontal drillings in Block S, it is important to validate its ability to predict unknown data. Therefore, the XGBoost algorithm was employed to predict the breakdown pressure of 10 fracturing points in the new well C. The prediction results are shown in Figure 5.

Compared to the actual breakdown pressure in the field, the maximum absolute percentage error is 36.62%, the minimum is 3.93%, and the average is 17.90%. This indicates that the generalization ability and stability of the model are weak in practical application. The primary reasons for this include the small fluctuation amplitude of the breakdown pressure samples in the field data, the lack of diversity and quantity of the data, and the fact that when simulation is carried out solely through data mining, the underlying function space of the approximation is undefined and directionless. As a result, the model can only fit the data blindly using parameters, and the stability of this fit cannot be guaranteed.

4. Prediction of Breakdown Pressure Using the Integration of Data Mining and Mechanistic Modelling

Experience has demonstrated that relying solely on data-driven approaches is insufficient in the oil and gas industry. To achieve the research objectives of oil and gas artificial intelligence, a combination of multiple approaches and models is necessary. A more effective strategy is to integrate data-driven methods with mechanistic models. The data-driven modeling process can incorporate domain knowledge by integrating it into the design of the deep neural network model structure, embedding it into the model evaluation, and developing a novel loss function that maintains a balanced weight for the regularization term.

4.1. Study on the Estimation of Breakdown Pressure Using an Integrated Learning Algorithm

Chaochao Zhao et al. [30] analyzed the principle behind improving the generalization ability of machine learning algorithms by incorporating mechanism samples, providing a theoretical foundation for fusion-driven modeling of field data and physical mechanisms. The generalization capability of a learning algorithm refers to the ability of a model trained by that algorithm to predict unseen data. This property is inherently crucial, as the generalization error represents the expected risk associated with the learned model.

R_{e x p} (\hat{f}) = E_{P} [L (Y, \hat{f} (X))] = \int_{x \times y} L (y, \hat{f} (x)) P (x, y) d x d y

(7)

In order to achieve an organic fusion of data mining and mechanism models, it is assumed that the number of field data samples is N1 and the number of mechanistic simulation samples obtained from the established breakdown pressure mechanism models is N2. Therefore, the fusion of mechanistic samples and field data is substituted into the loss model of machine learning to obtain

E (\bar{w}) = \sum_{j = 1}^{N_{1}} {(y_{j} - φ (x_{j}, \bar{w}))}^{2} + ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {(y_{j} - φ (x_{j}, \bar{w}))}^{2}

(8)

Take a linear regression model as an example.

{\bar{w}}^{*} = \underset{\bar{w}}{\arg m i n} E [\sum_{j = 1}^{N_{1}} {(y_{j} - \bar{w} x_{j})}^{2} + ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {(A^{*} x_{j} - \bar{w} x_{j})}^{2}]

(9)

Of this,

ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {(A^{*} x_{j} - \bar{w} x_{j})}^{2} \leq ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {∥ \bar{w} - A^{*} ∥}^{2} {∥ x_{j} ∥}^{2}

(10)

ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {∥ x_{j} ∥}^{2}

can be seen as a whole and is noted as

ψ = ξ \sum_{j = N_{1} + 1}^{N_{1} + N_{2}} {∥ x_{j} ∥}^{2}

(11)

Thus, it can be determined that

{\bar{w}}^{*} = \underset{\bar{w}}{\arg m i n} E [\sum_{j = 1}^{N_{1}} {(y_{j} - \bar{w} x_{j})}^{2} + ψ {∥ \bar{w} - A^{*} ∥}^{2}]

(12)

where the first term represents the fit error between the learning results and the field data, and the second term represents the error between the learning results and the theoretical results of the physical mechanism model. It is assumed that

\bar{w}

obeys the high-dimensional normal distribution of mean

A^{*},

and

A^{*}

can be regarded as the mean of

\bar{w}

obtained from the physical mechanism, so that

\bar{w} - A^{*}

is a vector obeying a normal distribution with a mean of 0 and a covariance of

G^{*}

.

{\bar{w}}^{*}

is obtained by optimization is the maximum a posteriori probability estimate based on

A^{*}

. Since the maximum a posteriori probability estimate is equivalent to the minimum expected risk, the meaning represented by the above equation is to improve the generalization capability of the model through the regular term

ψ {∥ \bar{w} - A^{*} ∥}^{2}

. As

A^{*}

is obtained from the mechanism model, the goal of using physical mechanisms to constrain the machine learning algorithm and enhance the model generalization capability is achieved.

4.2. Breakdown Pressure Prediction Driven by Fusion of Mechanistic Models and Field Data

The breakdown pressure mechanistic model is an implicit model that requires complex, repeated computations, making it difficult to directly express in the form of regular terms during the training process of machine learning algorithms. An effective method for integrating mechanistic models with data-driven fusion involves constructing a dataset using outputs from the mechanistic model. In this approach, parameters derived from the mechanistic model serve as inputs when building the model, and the model generates data that is incorporated into the dataset. By integrating domain knowledge and mechanistic models, the type and distribution of the data are enhanced, leading to improved accuracy and convergence of data-driven methods. The existing mechanistic model is used to calculate breakdown pressure and provide a sufficient number of diverse mechanistic samples that constrain the machine learning model. To mitigate the impact of sample size on the model results and ensure consistency with the total number of real-world data samples. In this study, the total sample size was fixed at 200 groups. This included 100 groups of mechanistic sample data and 100 groups of field breakdown pressure sample data. The mechanism data was generated after determining the range of input parameters. We used the Latin Hypercube Sampling method to create N sets of input parameter combinations within the preset multi-dimensional parameter space. LHS is a quasi-random sampling strategy that can achieve uniform coverage across each parameter dimension, ensuring that the generated sample set can efficiently explore the entire parameter space. This sampling process was independent of the subsequent field observation dataset used for fusion, to ensure that the mechanism dataset has wide representativeness, rather than being limited to the distribution of the existing observation data.

The breakdown pressure for casing completion was predicted using four ensemble algorithms, with models applied to horizontal drilling in the target block. To prevent overfitting, the average of fivefold cross-validation was used to assess the accuracy of the learning algorithms. The prediction results of the four ensemble learning algorithms are shown in Table 4. Furthermore, the breakdown pressure test results from the fusion of field data and mechanistic data samples are compared with the breakdown pressure from pure field data samples in Figure 6, providing a comparative analysis of the findings.

The comparison graph of model evaluation metrics shows that, after integrating the dataset, the four algorithm models exhibit higher prediction accuracy for breakdown pressure compared to the pure field data model. This improvement is particularly noticeable when mechanistic model samples are added to the field data samples. Among these models, the XGBoost algorithm delivers the most accurate predictions. When the ratio of field data samples to mechanistic model samples is 1:1, the average absolute percentage error of the breakdown pressure prediction is 10.35%, with a correlation coefficient of 0.915 and a root mean square error (RMSE) of 7.796. All model evaluation metrics surpass those of the prediction model based solely on field breakdown pressure data. This further demonstrates that incorporating a well-established mechanistic model in the training process can significantly enhance the predictive performance of the model.

XGBoost, as a gradient boosting tree model, can automatically capture the complex nonlinear interaction effects among input features through the combination of tree structures, which is crucial for simulating the complex phenomenon of fracture pressure that is influenced by multiple physical field couplings. Secondly, its built-in regularization term can effectively prevent overfitting and enhance the robustness of the model in the mixed dataset that combines mechanism data and field data. Finally, the feature importance scores provided by XGBoost contribute to physical interpretability, allowing us to identify the engineering and geological parameters that contribute the most to the prediction, thereby forming a mutual verification with physical cognition. Compared to other ensemble methods, the sequential optimization feature of gradient boosting usually enables it to achieve higher prediction accuracy; and compared to deep neural networks, XGBoost often has better sample efficiency and less burden of hyperparameter tuning on medium- and small-scale structured data, which is highly compatible with the application scenario of this study.

4.3. Fusion Sample Set Proportion Preferences

Integrating mechanistic samples can enhance the model’s quality. However, it is important to note that the impact of incorporating varying amounts of mechanistic samples on model quality can differ. This section presents a study examining how different ratios of mechanistic samples to field data influence model performance. The analysis is based on actual breakdown pressure data from cased wells. The study involved conducting horizontal drillings in the target block and calculating breakdown pressure mechanistic samples using a mechanistic model. The field data and mechanistic model data were compared at different ratios, with a total of 200 samples. These datasets, with varying fusion ratios, were then trained using the XGBoost algorithm to develop a predictive model for breakdown pressure. The XGBoost technique was employed to train the datasets with variable fusion ratios to construct the breakdown pressure prediction model.

For each training set formed by integrating mechanism data and field data in a specific ratio, we conducted K-fold cross-validation to select the model and optimize the hyperparameters. During this process, the mechanism samples and field samples were completely shuffled and mixed, and then randomly divided. That is to say, each fold contained subsets from both types of data, approximately forming a proportion based on the overall size. This mixed partition strategy ensures that the model can simultaneously learn the physical laws and the features of the field data in each round of training/verification, avoiding the bias that may be caused by isolated partitioning of data sources.

After the optimal hyperparameters determined by the model in the cross-validation are fixed, the final performance evaluation is conducted on a completely new independent test set that has not participated in any training or validation process. This test set consists of two parts: (1) all data from a new well, used to evaluate the model’s generalization ability for new spatial positions; (2) additional samples reserved from the known wells at the beginning of the construction of the integrated dataset, randomly selected and always sealed, used to evaluate the model’s predictive ability for new data points in the known area.

To investigate the impact of different data source proportions, we constructed multiple fused datasets. For instance, a 1:1 fused dataset refers to simply merging N mechanism samples with N randomly selected samples from the field dataset, resulting in a training set with a total of 2 N samples. Similarly, a 1:1.5 fused dataset contains N mechanism samples and 1.5 N field samples. Before merging, both types of data were standardized in the same way. For each specified fusion ratio, we randomly selected five field samples to construct five different dataset replicas. The final model performance results are the average of the five runs to reduce the bias from random sampling.

Figure 7 demonstrates the impact of the proportion of mechanistic samples to field data samples on the model. For a given number of samples, the average of the 50/50 cross-validation test set is minimized when the ratio of field data to computed data in the mechanistic model is 1:1.5. It is evident that the model exhibits the least amount of overfitting at this ratio. Thus, the XGBoost method was set to its default parameters, and the dataset for model construction was chosen with a ratio of 1:1.5 for field data to mechanism model data. The XGBoost model’s prediction results, using the default settings, are displayed in Figure 8. The largest absolute percentage error is 21.2%, the minimum is 0.11%, and the average is 7.62%.

5. Hyperparameter Optimization of Fusion Models Based on the TPE Bayesian Algorithm

The primary goal of both machine learning and statistical models is to improve prediction accuracy and enhance the generalization ability of the model. The parameters and hyperparameters play a critical role in determining the outcomes of machine learning models. In this context, parameters are typically determined through mathematical computations. Once the method is selected, the calculation process is largely automated and does not require human intervention. However, model tuning involves adjusting the hyperparameters of the model. Modern machine learning and deep learning algorithms have numerous hyperparameters, and it is not possible to find the optimal solution through a purely mathematical approach. Therefore, human intervention is essential for fine-tuning the model.

5.1. Determining the Parameter Space for XGBoost Optimization

This study employed an optimization method for hyperparameters based on the tree-structured Pareto estimator. The TPE algorithm is widely used due to its efficiency in handling high-dimensional, continuous, and discrete mixed hyperparameter spaces. This is particularly crucial for the scenario of this study, as the training data comes from a combination of mechanisms and the field, and its inherent complexity and potential noise make the response surface of the model’s hyperparameter space more rugged and the evaluation cost higher. TPE sequentially guides sampling by constructing a probability model and can converge to a region with excellent performance with a relatively small number of iterations, thereby efficiently determining a set of robust hyperparameter combinations for the XGBoost model trained with mixed data.

The objective of hyperparameter optimization is to maximize the model’s performance. When employing the TPE technique for XGBoost, it is necessary to first identify the hyperparameters that require optimization and establish the spatial range for each hyperparameter (Table 5).

XGBoost distinguishes itself from other tree ensemble algorithms due to its wide array of parameters, which significantly influence the model by affecting the tree construction process. These parameters interact in a non-linear manner, and their impact on the final model outcome may not be immediately apparent during the tuning process, as some parameters are adjusted in conjunction with runtime factors.

To begin, select the desired block casing completion and gather data on the breakdown pressure for horizontal drilling. The goal is to build a mechanism model using a dataset with a 1:1.5 ratio for training. Additionally, a function must be defined to check for overfitting after the model iteration is complete. The learning rate curves for the training set, test set, and overfitting test function will be plotted for various factors to determine the initial parameter ranges. As an example, consider the number of iterations:

The default number of iterations was set to 100, with initial trials extended up to 200. A learning rate curve for the number of iterations can then be displayed, as shown in Figure 9.

Figure 9 shows that the number of iterations reaches approximately 75, beyond which its impact on the model becomes minimal. Furthermore, the reduction in loss from 100 trees becomes insignificant as the fraction (RMSE) drops below 8. Based on this, it is recommended to initially set the range for the number of iterations to (50, 150, 10).

The learning rate curve is used to determine the parameter space for several key parameters, including the permissible sample size at each node and the regularization term coefficient. For other bounded parameters (e.g., sample proportions) or parameters with fixed values (e.g., weak evaluators), defining the parameter space is not necessary. For parameters with small values (e.g., the learning rate) or those typically adjusted downward (e.g., maximum depth), the parameter space is generally defined by expanding it around the default value in both directions. Typically, during the initial search, a wider, less dense range of parameters is explored. As the search progresses, the range is gradually narrowed, and the dimensionality of the parameter space is reduced. The final parameter space for all parameters is summarized in Table 6 below.

5.2. Optimization of the XGBoost Algorithm Based on the TPE Approach

Bayesian optimization is a search algorithm used to automatically optimize the hyperparameters of a model. It works by producing alternative functions based on probabilistic models, which are derived from the goal function and the results of prior assessments. The primary task of the hyperparameter optimization technique is to optimize the Expected Improvement, as described in Equation (13). The TPE algorithm is a non-standard Bayesian optimization algorithm based on the estimation of tree-structured Parzen densities proposed by Bergstra et al., which employs simultaneous modelling of both the

p (x ∣ y)

and

p (y)

in place of the Gaussian process of only modelling

p (y ∣ x)

.

E I_{y^{*}} (x) = \int \begin{matrix} + \infty \\ - \infty \end{matrix} \max (y^{*} - y, 0) p_{M} (y ∣ x) d y

(13)

where

y^{*}

is the threshold of the objective function;

y

is the measured value of the objective function;

x

is the hyperparameter sets;

x

is an alternative probabilistic model representing the probability of y under the hyperparameter set x.

According to Bayes’ theorem,

p (y ∣ x) = \frac{p (x ∣ y) p (y)}{p (x)}

(14)

In the TPE algorithm,

p (x ∣ y)

is defined as follows:

p (x ∣ y) = \{\begin{matrix} l (x), y < y^{*} \\ g (x), y > y^{*} \end{matrix}

(15)

where l(x) is the composition of densities for which the loss function of the observation x(i) is smaller than

y^{*}

, and g(x) is the composition of densities for which the loss function of the observation x(i) is larger than

y^{*}

.

That is, there are distributions of TPEs that are constructed differently for observations x on either side of a threshold y. Setting a hyperparameter

y^{*}

, which is a quantile with respect to y, thus produces the following:

p (y < y^{*}) = γ

(16)

This can be obtained by dividing Equation (15):

p (x) = γ l (x) + (1 - γ) g (x)

(17)

Bringing Equations (16) and (17) into Equation (13) yields the final expression for the desired increment:

E I_{y^{*}} (x) = \frac{γ y^{*} l (x) - l (x) \int \begin{matrix} y^{*} \\ - \infty \end{matrix} p (y) d y}{γ l (x) + (1 - γ) g (x)} \infty {(γ + \frac{g (x)}{l (x)} (1 - γ))}^{- 1}

(18)

From Equation (18), it can be seen that in order to maximize the expected increment to obtain the optimal hyperparameters, x should be found such that

g (x) / l (x)

takes the minimum value, i.e., approximating

g (x)

with minimum probability and

l (x)

with maximum probability.

In order to assess the stability of the XGBoost method, it is necessary to conduct further iterations of Bayesian optimization. Initially, five Bayesian optimizations were conducted, and the outcomes are presented in Table 7 below.

Table 7 reveals that “reg: squared error” was consistently chosen as the evaluation measure for all iterations of Bayesian optimization. Consequently, no more searches were conducted for this parameter. Consistently, the weak evaluator parameter was consistently chosen as “gbtree” in all five iterations, thus confirming that utilizing the “gbtree” tree is the superior option for the present data. For the remaining parameters: if the selected optimal value is at the upper limit, the overall parameter space is adjusted in a larger direction; if it is at the lower limit, the overall parameter space is adjusted in a smaller direction; if it is between the lower and upper limits, the range of the optimal value is expanded and the step size is reduced to increase the parameter density. For example, the number of iterations has bottomed out once and approached the upper limit twice, so the original range (50, 150, 5) can be modified to (20, 180, 5); the results of the feature sampling ratio before node branching are more biased towards 1.0, so we can consider lifting the lower limit (0.5, 1, 0.05); the results of the feature sampling ratio before tree building are uniformly spread out between 0.3 and 1, so we can consider not replacing the range but reducing the step size (0.3, 1, 0.05), and so on for the other parameters, shown in Table 8.

Five Bayesian searches were performed again on the tuned parameter space, and the results are shown in Table 9 below.

Table 9 displays the results of five Bayesian optimizations on the amended parameter space. Among these searches, the highest score achieved is 5.861. We then attempted to verify the validity of this collection of parameters using the validation function: we conducted a fivefold cross-validation after each iteration and recorded the average values of the training set and test set in the cross-validation. The validation findings are displayed in Figure 10 below.

The comparative analysis in Figure 11 shows that after optimizing XGBoost using the TPE algorithm, the average RMSE value of the fivefold cross-validation training set is 2.624, and the average RMSE value of the test set is 5.936, shown in Figure 12. This represents a 6.62% reduction compared to the pre-optimization period, indicating a significant improvement in model quality. Additionally, the scores of the test set have improved while the scores of the training set have decreased, reducing the occurrence of model overfitting.

The comparative analysis in Figure 13 and Figure 14 shows that after optimizing the CPT and CNN models using the TPE algorithm, the average RMSE values of the fivefold cross-validation training sets are 3.114 and 2.954, respectively, and the average RMSE values of the test sets are 7.835 and 8.643, respectively. Compared with the models before optimization, the RMSE values have decreased by 16.85% and 12.09%, respectively, indicating a significant improvement in model quality. Additionally, the scores of the test set have increased, while the scores of the training set have decreased, thereby reducing the occurrence of model overfitting.

The comparative analysis in Table 10 reveals that the fusion model TPE-XGBoost algorithm outperforms the traditional mechanistic model, the machine learning model based on field data, the XGBoost model without parameter optimization, the TPE-GPT model, and the TPE-CNN model in terms of breakdown pressure prediction and evaluation indexes. Figure 15 compares the breakdown pressure predicted by the optimised XGBoost model with the actual field values, at a field to model data ratio of 1:1.5. The average relative error of breakdown pressure prediction is 5.87%, indicating the capability of accurately predicting the breakdown pressure of the horizontal drillings of a target block with casing completion.

To test the generalization ability of the breakdown pressure prediction model that combines data mining and mechanistic modelling, the optimised model was used to predict the breakdown pressure at 10 fracturing points in new wells. The prediction results, shown in Figure 16, were compared with the actual breakdown pressure in the field. The maximum absolute percentage error was 11.58%, the minimum was 2.21%, and the average was 7.45%. This is a 58.34% improvement compared in generalization ability. The addition of mechanism restrictions considerably improves the generalization ability of the data mining model, as seen by a 58.34% reduction in MAPE. This enhancement results in a superior prediction effect on unknown breakdown pressure data.

6. Conclusions

(1) A computational model for calculating breakdown pressure in cased horizontal drillings is developed based on elastodynamics and seepage mechanics principles. The model incorporates stress superposition and accounts for tensile fracture along the casing. The average absolute error of the model is 13.96%, demonstrating improved accuracy in predicting breakdown pressure.

(2) For predicting breakdown pressure in cased completion, four traditional machine learning methods (KNN, linear regression, SVR, Lasso) and four ensemble algorithms (Random Forest, AdaBoost, GBDT, XGBoost) were applied to perform data mining on horizontal drillings in Block S. The results show that ensemble learning algorithms outperform single models in terms of accuracy. Specifically, the XGBoost algorithm exhibits the highest prediction accuracy for breakdown pressure.

(3) The proposed method enhances the generalization ability of machine learning models for predicting breakdown pressure in horizontal drillings by combining mechanistic sample data with field data. This fusion-driven modeling approach integrates mechanistic models and data mining techniques. The dataset for predicting breakdown pressure is analyzed using four ensemble algorithms, which achieve superior prediction accuracy compared to the pure field data-driven model. The results confirm that incorporating mechanistic model data into the field dataset effectively constrains the machine learning algorithm. When a sufficient number of samples is ensured, different data fusion ratios are explored for modeling predictions. The highest prediction accuracy is achieved when the field data to mechanistic model data ratio is 1:1.5, highlighting the varying constraining effects of different fusion ratios.

(4) The hyperparameters of the fusion model were optimized using the TPE algorithm. Compared to the fusion model with default hyperparameters, the optimized model reduced the MAPE by 22.93% and the RMSE by 6.62%. Consequently, the overall model quality improved significantly. The generalization ability of the TPE-XGBoost fusion model was validated, with an average percentage error of 7.45%, indicating that the optimized fusion model offers higher prediction accuracy and stronger generalization ability than both the single-mechanism model and the field-data-driven learning model.

(5) This study not only resulted in a highly accurate TPE-XGBoost model for predicting fracture pressure but also proposed and verified a systematic and engineering-practice-oriented ‘physical mechanism–data fusion’ machine learning workflow. For oil and gas field development engineers who are confronted with similar issues of data scarcity and balance between physical constraints, this workflow can be directly applied to other blocks or similar engineering parameter prediction problems, helping to integrate data-driven methods more robustly and systematically into engineering decisions.

Author Contributions

H.W.: writing—original draft, investigation, formal analysis, data curation, methodology. M.P.: writing—review and editing, investigation, formal analysis. Z.Y.: supervision, project administration. C.D.: investigation, formal analysis. F.X. and Y.X.: investigation, formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This work is a major national science and technology project on new oil and gas exploration and development. It is funded by China Oilfield Services Limited. Project number: 2025ZD1407402.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Acknowledgments

We acknowledge the High-Performance Computing platform of Southwest Petroleum University for providing the computational resources and Materials Studio software.

Conflicts of Interest

Authors Haibiao Wang, Mingyue Pang, Zheng Yuan, Fengxiang Xu and Yicheng Xin were employed by the companys China Oilfield Services Limited and National Key Laboratory of Offshore Oil and Gas Exploitation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Guo, T.; Gong, F.; Shen, L.; Qu, Z.; Qi, N.; Wang, T. Multi-fractured stimulation technique of hydraulic fracturing assisted by radial slim holes. J. Pet. Sci. Eng. 2019, 174, 572–583. [Google Scholar] [CrossRef]
Chen, Z.; Jiang, M.; Ai, C.; Wu, J.; Xie, X. A Data-Oriented Method to Optimize Hydraulic Fracturing Parameters of Tight Sandstone Reservoirs. Energy Eng. 2024, 121, 1657–1669. [Google Scholar] [CrossRef]
Liu, Z.; Moussa, T.; Dehghanpour, H. Estimating formation and fracturing water production using simple diagnostic plots. Geoenergy Sci. Eng. 2023, 231, 212127. [Google Scholar] [CrossRef]
Hubbert, M.K.; Willis, D.G. Mechanics of hydraulic fracturing. Pet. Trans. AIME 1957, 210, 153–168. [Google Scholar] [CrossRef]
Haimson, B.; Fairhurst, C. Initiation and extension of hydraulic fractures in rocks. Soc. Pet. Eng. J. 1967, 7, 310–318. [Google Scholar] [CrossRef]
Eaton, B.A. Fracture Gradient Prediction and Its Application in Oilfield Operations. J. Pet. Technol. 1969, 21, 1353–1360. [Google Scholar] [CrossRef]
Daines, S.R. Prediction of Fracture Pressures for Wildcat Wells. Int. J. Rock Mech. Min. Sci. Geomech. Abstr. 1982, 34, 863–872. [Google Scholar] [CrossRef]
Dong, Q.; Wang, Y.; Chen, B. Mesoscopic interpretation of hydraulic fracture initiation and breakdown pressure using discrete element method. Comput. Geotech. 2023, 163, 105739. [Google Scholar] [CrossRef]
Li, Y.; Hu, B.; Wu, J.; Zhang, J.; Yang, H.; Zeng, B.; Xiao, Y.; Liu, J. Optimization method of oriented perforation parameters improving uneven fractures initiation for horizontal well fracturing. Fuel 2023, 349, 128754. [Google Scholar] [CrossRef]
Liu, H.; Lan, Z.; Wang, S.; Xu, J.; Zhao, C. Hydraulic fracture initiation mechanism in the definite plane perforating technology of horizontal well. Pet. Explor. Dev. 2015, 42, 869–875. [Google Scholar] [CrossRef]
Wang, S.; Zhou, J.; Zhang, L.; Han, Z.; Kong, Y. Numerical insight into hydraulic fracture propagation in hot dry rock with complex natural fracture networks via fluid-solid coupling grain-based modeling. Energy 2024, 295, 131060. [Google Scholar] [CrossRef]
Guo, J.; Zeng, F.; Zhao, J. A model for predicting reservoir fracturing pressure of perforated wells after acid damage. Pet. Explor. Dev. 2011, 38, 221–227. [Google Scholar] [CrossRef]
Liu, N.; Zhang, Z.; Zou, Y.; Ma, X.; Zhang, Y. Propagation law of hydraulic fractures during multi-staged horizontal well fracturing in a tight reservoir. Pet. Explor. Dev. 2018, 45, 1129–1138. [Google Scholar] [CrossRef]
Min, C.; Wen, G.; Gou, L.; Li, X.; Yang, Z. Interpretability and causal discovery of the machine learning models to predict the production of CBM wells after hydraulic fracturing. Energy 2023, 285, 129211. [Google Scholar] [CrossRef]
Xie, Z.J.; Hou, B.; He, M.; Liu, X.; Wei, J. Fracture-controlled fracturing mechanism and penetration discrimination criteria for thin sand-mud interbedded reservoirs in Sulige gas field, Ordos Basin, China. Pet. Explor. Dev. 2024, 51, 1327–1339. [Google Scholar] [CrossRef]
Kong, L.; Chen, H.; Ping, H.; Zhai, P.; Liu, Y.; Zhu, J. Formation pressure modeling in the Baiyun Sag, northern South China Sea: Implications for petroleum exploration in deep-water areas. Mar. Pet. Geol. 2018, 97, 154–168. [Google Scholar] [CrossRef]
Xiao, C.; Wang, G.; Zhang, Y.; Deng, Y. Machine-learning-based well production prediction under geological and hydraulic fracture parameters uncertainty for unconventional shale gas reservoirs. J. Nat. Gas Sci. Eng. 2022, 106, 104762. [Google Scholar] [CrossRef]
Li, L.; Zhou, F.; Zhou, Y.; Cai, Z.; Wang, B.; Zhao, Y.; Lu, Y. The prediction and optimization of Hydraulic fracturing by integrating the numerical simulation and the machine learning methods. Energy Rep. 2022, 8, 15338–15349. [Google Scholar] [CrossRef]
Yan, H.; Zhang, J.; Zhou, N.; Li, B.; Wang, Y. Crack initiation pressure prediction for SC-CO₂ fracturing by integrated meta-heuristics and machine learning algorithms. Eng. Fract. Mech. 2021, 249, 107750. [Google Scholar] [CrossRef]
Liang, L.J.; Du, X.; Fang, H.; Li, B.; Wang, N.; Di, D.; Xue, B.; Zhai, K.; Wang, S. Intelligent prediction model of a polymer fracture grouting effect based on a genetic algorithm-optimized back propagation neural network. Tunn. Undergr. Space Technol. 2024, 148, 105781. [Google Scholar] [CrossRef]
Kiss, A.; Fruhwirth, R.K.; Pongratz, R.; Maier, R.; Hofstätter, H. Formation Breakdown Pressure Prediction with Artificial Neural Networks. In Proceedings of the SPE International Hydraulic Fracturing Technology Conference and Exhibition, Muscat, Oman, 16–18 October 2018. [Google Scholar]
Yang, H.; Xie, B.; Liu, X.; Chu, X.; Ruan, J.; Luo, Y.; Yue, J. Breakdown Pressure Prediction of Tight Sandstone Horizontal Wells Based on the Mechanism Model and Multiple Linear Regression Model. Energies 2022, 15, 6944. [Google Scholar] [CrossRef]
Gao, M.; Zhou, S.; Gu, W.; Wu, Z.; Liu, H.; Zhou, A.; Wang, X. MMGPT4LF: Leveraging an optimized pre-trained GPT-2 model with multi-modal cross-attention for load forecasting. Appl. Energy 2025, 392, 125965. [Google Scholar] [CrossRef]
Chen, N.; Li, B.; Wang, Y.; Ying, X.; Wang, L.; Zhang, C.; Guo, Y.; Li, M.; An, W. Motion and Appearance Decoupling Representation for Event Cameras. IEEE Trans. Image Process. 2025, 34, 5964–5977. [Google Scholar] [CrossRef] [PubMed]
Wen, Y.; Gao, T.; Li, Z.; Zhang, J.; Zhang, K.; Chen, T. All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model. IEEE Trans. Multimed. 2025, 27, 3343–3355. [Google Scholar] [CrossRef]
Tariq, Z.; Yan, B.; Sun, S.; Gudala, M.; Aljawad, M.S.; Murtaza, M.; Mahmoud, M. Machine Learning-Based Accelerated Approaches to Infer Breakdown Pressure of Several Unconventional Rock Types. ACS Omega 2022, 7, 41314–41330. [Google Scholar] [CrossRef]
Lu, Z.; Lai, H.; Zhou, L.; Shen, Z.; Ren, X.; Li, X. Prediction of hydraulic fracture initiation pressure in a borehole based on a neural network model considering plastic critical distance. Eng. Fract. Mech. 2022, 274, 108779. [Google Scholar] [CrossRef]
Yang, H.; Liu, X.; Chu, X.; Xie, B.; Zhu, G.; Li, H.; Yang, J. Optimization of tight gas reservoir fracturing parameters via gradient boosting regression modeling. Heliyon 2024, 10, e27015. [Google Scholar] [CrossRef]
Hossain, M.M.; Rahman, M.K.; Rahman, S.S. Hydraulic fracture initiation and propagation: Roles of wellbore trajectory, perforation and stress regimes. J. Pet. Sci. Eng. 2000, 27, 129–149. [Google Scholar] [CrossRef]
Zhao, C. Study on the Optimization of the Whole Procedure of Heavy Oil Steam Stimulation Driven by the Fusion of Data and Physical Mechanism. Ph.D. Thesis, Southwest Petroleum University, Chengdu, China, 2022. [Google Scholar]

Figure 1. This research’s technical roadmap.

Figure 2. Schematic diagram of shot hole borehole and horizontal drilling bore model.

Figure 3. Comparison of site construction breakdown pressure and mechanism model breakdown pressure for different perforation location in wells (A,B).

Figure 4. Fivefold cross-validation root mean square error for 4 classical machine learning algorithms.

Figure 5. Root mean square error of fivefold cross-validation of 4 integrated learning algorithms.

Figure 6. Analysis of the predictive effect of pure data mining models on unknown data.

Figure 7. Comparison of model evaluation indicators before and after dataset incorporation. (a) Root mean square error comparison; (b) comparison of correlation coefficients; (c) comparison of mean absolute percentage error.

Figure 8. Comparative analysis of different sample proportions on model scores.

Figure 9. Breakdown pressure prediction results for the default hyperparameter case of the fusion-driven model.

Figure 10. Learning rate curve for the number of iterations.

Figure 11. Fivefold cross-validation training set and test set RMSE plots.

Figure 12. Comparative analysis before and after model optimization.

Figure 13. Comparative analysis before and after TPE-GPT model optimization.

Figure 14. Comparative analysis before and after TPE-CNN model optimization.

Figure 15. Breakdown pressure prediction results after model optimization.

Figure 16. Prediction results of the fusion-driven model for unknown data.

Table 1. Basis date table.

Parametric	Numerical Value	Parametric	Numerical Value
Horizontal drilling azimuth (°)	90	Stratigraphic temperature increment (°C)	3.07
Coefficient of thermal expansion (°C⁻¹)	0.000045	Correction factor (uncaused)	0.9
Casing inner diameter (mm)	97.18	Casing outer diameter (mm)	114.3
Casing modulus of elasticity (MPa)	206,000	Casing Poisson’s ratio (unfactored)	0.3
Perforation azimuth (°)	60	Pore pressure (MPa)	29.3

Table 2. Comparative analysis of the prediction results of four classical machine learning algorithms on field breakdown pressure data.

Arithmetic	Linear Regression	Lasso	SVR	KNN	Average
fivefold verified test set RMSE	14.064	12.782	14.253	13.457	13.639
fivefold verified training set RMSE	1.599	3.396	5.440	1.876	3.078
fivefold verified MAPE	28.36%	21.61%	31.92%	26.15%	27.01%
fivefold verified R²	0.735	0.757	0.682	0.723	0.724

Table 3. Comparative analysis of the results of the four integrated learning algorithms for the prediction of field breakdown pressure data.

Arithmetic	Random Forest	GBDT	AdaBoost	XGBoost	Average
fivefold verified running time	1.841	1.432	1.812	1.809	1.724
fivefold verified test set RMSE	9.409	9.791	9.575	9.223	9.500
fivefold verified training set RMSE	3.450	2.061	6.055	0.208	2.944
fivefold verified MAPE	15.65%	19.16%	17.46%	13.31%	16.39%
fivefold verified R²	0.859	0.791	0.822	0.887	0.840

Table 4. Comparative analysis of the results of the four integrated learning algorithms for the prediction of fusion breakdown pressure data.

Arithmetic	Random Forest	GBDT	AdaBoost	XGBoost
RMSE	9.216	8.931	8.459	7.796
MAPE	15.24%	14.69%	14.32%	10.35%
R²	0.863	0.846	0.871	0.915

Table 5. List of some parameters of the algorithm.

Typology	Parametric	Account for
Iterative process/loss function	num_boost_round	Actual number of iterations
	eta	Learning rate, a weighted summation process that affects the results of weak classifiers
	objective	Select the loss function to be optimised
	base_score	Initializing the settings for the prediction result H0
	max_delta_step	Maximum iteration value allowed in one iteration
	gamma	Multiplying the coefficient in front of the number of leaves and scaling up controls overfitting
	lambda	L2 regular term coefficients; amplification controls overfitting
	alpha	L1 regular term coefficient; amplification controls overfitting
Weak evaluator	booster	Select the type of weak evaluator for the iterative process, including gbtree, DART and linear models
	max_depth	Maximum depth of weak evaluator allowed
	min_child_weight	Minimum sample weights on leaf nodes
	sample_type	Specific ways of sampling the sample
	subsample	Specific proportion of the sample to be sampled
	colsample_bytree	Proportion of features sampled in the tree-building process
	colsample_bylevel
	colsample_bynode

Table 6. Preliminary determination of parameter space.

Parametric	Realm
Number of iterations	Learning curve exploration, finalized at (50, 150, 5)
Learning rate	Extended in both directions centered on 0.3 and finally set at (0.05, 2.05, 0.05)
Weak evaluator	Two options [“gbtree”, “dart”]
Characteristic sampling ratio: pre-construction	Set to a value between (0, 1) as (0.3, 1, 0.1)
Feature sampling ratio: before node branching	Set to a value between (0, 1) as (0.1, 1, 0.1)
Coefficient before the number of leaf nodes $γ$	Learning curve exploration, designated as (0, 3000, 100)
Regular term coefficient $λ$	Learning curve exploration, designated as (0, 1500, 50)
Samples allowed on any node	Learning curve exploration, designated as (0, 100, 2)
Maximum depth	Extending in both directions with 6 at the center, with the right range set wider at (2, 30, 2)
Proportion of samples sampled	Set to a value between (0, 1) as (0.1, 1, 0.1)
Assessment of indicators	[“reg: squarederror”, “reg: squaredlogerror”]

Table 7. Comparison of the results of 5 Bayesian optimizations.

	First	Second	Third	Forth	Fifth
Number of iterations	75	50	60	120	145
Weak evaluator	0	0	0	0	0
Assessment of indicators	0	0	0	0	0
Feature sampling ratio: before node branching	1	0.9	0.5	0.7	0.5
Characteristic sampling ratio: pre-construction	0.5	0.7	0.8	0.3	0.6
Learning rate	1.8	1.65	1	0.8	1.3
Coefficient before the number of leaf nodes	0	0	100	0	100
Regular term coefficient	150	200	150	200	650
Maximum depth	5	6	6	8	3
Samples allowed on any node	12	6	30	8	46
Proportion of samples sampled	0.8	0.9	1	0.8	0.8
Mark	6.053	5.982	6.350	6.143	7.106

Table 8. Adjusted parameter space.

Parametric	Range
Number of iterations	(20, 180, 5)
Learning rate	(0.5, 2, 0.05)
Weak evaluator	“gbtree”
Characteristic sampling ratio: pre-construction	(0.3, 1, 0.05)
Feature sampling ratio: before node branching	(0.5, 1, 0.05)
Coefficient before the number of leaf nodes	(0, 200, 10)
Regular term coefficient	(100, 700, 25)
Samples allowed on any node	(0, 50, 1)
Maximum depth	(2, 15, 1)
Proportion of samples sampled	(0.5, 1, 0.05)
Assessment of indicators	“reg: squared error”

Table 9. Comparison of the results of five Bayesian optimizations on the tuned parameter space.

Parametric	First	Second	Third	Forth	Fifth
Number of iterations	165	110	170	160	150
Feature sampling ratio: before node branching	0.75	0.9	0.7	0.7	0.65
Characteristic sampling ratio: pre-construction	0.4	0.6	0.8	0.6	0.55
Learning rate	1.8	1.05	0.55	0.65	1.85
Coefficient before the number of leaf nodes	0	10	80	30	0
Regular term coefficient	275	275	150	125	650
Maximum depth	5	3	8	9	9
Samples allowed on any node	4	9	1	9	6
Proportion of samples sampled	0.5	0.75	0.9	0.85	0.9
Mark	5.861	5.919	6.097	5.942	5.897

Table 10. Comparative analysis of breakdown pressure prediction by different models.

Methodologies	Mean Absolute Percentage Error	RMSE
Mechanistic model	13.96%	7.892
Machine learning model	13.31%	9.223
Fusion model	7.62%	6.357
TPE-GPT	8.42%	7.835
TPE-CNN	9.13%	8.643
TPE-XGBoost	5.87%	5.936

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Pang, M.; Yuan, Z.; Dong, C.; Xu, F.; Xin, Y. Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints. Processes 2026, 14, 630. https://doi.org/10.3390/pr14040630

AMA Style

Wang H, Pang M, Yuan Z, Dong C, Xu F, Xin Y. Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints. Processes. 2026; 14(4):630. https://doi.org/10.3390/pr14040630

Chicago/Turabian Style

Wang, Haibiao, Mingyue Pang, Zheng Yuan, Changyin Dong, Fengxiang Xu, and Yicheng Xin. 2026. "Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints" Processes 14, no. 4: 630. https://doi.org/10.3390/pr14040630

APA Style

Wang, H., Pang, M., Yuan, Z., Dong, C., Xu, F., & Xin, Y. (2026). Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints. Processes, 14(4), 630. https://doi.org/10.3390/pr14040630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of the TPE-XGBoost Model in Predicting Breakdown Pressure for Horizontal Drilling Based on Physical Constraints

Abstract

1. Introduction

2. Study on Predicting Breakdown Pressure Using a Mechanistic Model-Based Approach

2.1. Modelling the Stress Field Surrounding Horizontal Drilling and Staged Fracturing

2.2. Mechanism Model Solution and Error Analysis

3. Investigation into the Estimation of Breakdown Pressure Using Machine Learning Models

3.1. Data Integration and Data Pre-Processing

3.2. Indicators for Model Evaluation

3.3. Investigation into the Estimation of Breakdown Pressure Using Traditional Machine Learning Techniques

3.4. Study on the Estimation of Breakdown Pressure Using an Integrated Learning Algorithm

4. Prediction of Breakdown Pressure Using the Integration of Data Mining and Mechanistic Modelling

4.1. Study on the Estimation of Breakdown Pressure Using an Integrated Learning Algorithm

4.2. Breakdown Pressure Prediction Driven by Fusion of Mechanistic Models and Field Data

4.3. Fusion Sample Set Proportion Preferences

5. Hyperparameter Optimization of Fusion Models Based on the TPE Bayesian Algorithm

5.1. Determining the Parameter Space for XGBoost Optimization

5.2. Optimization of the XGBoost Algorithm Based on the TPE Approach

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI