The primary goal of both machine learning and statistical models is to improve prediction accuracy and enhance the generalization ability of the model. The parameters and hyperparameters play a critical role in determining the outcomes of machine learning models. In this context, parameters are typically determined through mathematical computations. Once the method is selected, the calculation process is largely automated and does not require human intervention. However, model tuning involves adjusting the hyperparameters of the model. Modern machine learning and deep learning algorithms have numerous hyperparameters, and it is not possible to find the optimal solution through a purely mathematical approach. Therefore, human intervention is essential for fine-tuning the model.
5.1. Determining the Parameter Space for XGBoost Optimization
This study employed an optimization method for hyperparameters based on the tree-structured Pareto estimator. The TPE algorithm is widely used due to its efficiency in handling high-dimensional, continuous, and discrete mixed hyperparameter spaces. This is particularly crucial for the scenario of this study, as the training data comes from a combination of mechanisms and the field, and its inherent complexity and potential noise make the response surface of the model’s hyperparameter space more rugged and the evaluation cost higher. TPE sequentially guides sampling by constructing a probability model and can converge to a region with excellent performance with a relatively small number of iterations, thereby efficiently determining a set of robust hyperparameter combinations for the XGBoost model trained with mixed data.
The objective of hyperparameter optimization is to maximize the model’s performance. When employing the TPE technique for XGBoost, it is necessary to first identify the hyperparameters that require optimization and establish the spatial range for each hyperparameter (
Table 5).
XGBoost distinguishes itself from other tree ensemble algorithms due to its wide array of parameters, which significantly influence the model by affecting the tree construction process. These parameters interact in a non-linear manner, and their impact on the final model outcome may not be immediately apparent during the tuning process, as some parameters are adjusted in conjunction with runtime factors.
To begin, select the desired block casing completion and gather data on the breakdown pressure for horizontal drilling. The goal is to build a mechanism model using a dataset with a 1:1.5 ratio for training. Additionally, a function must be defined to check for overfitting after the model iteration is complete. The learning rate curves for the training set, test set, and overfitting test function will be plotted for various factors to determine the initial parameter ranges. As an example, consider the number of iterations:
The default number of iterations was set to 100, with initial trials extended up to 200. A learning rate curve for the number of iterations can then be displayed, as shown in
Figure 9.
Figure 9 shows that the number of iterations reaches approximately 75, beyond which its impact on the model becomes minimal. Furthermore, the reduction in loss from 100 trees becomes insignificant as the fraction (RMSE) drops below 8. Based on this, it is recommended to initially set the range for the number of iterations to (50, 150, 10).
The learning rate curve is used to determine the parameter space for several key parameters, including the permissible sample size at each node and the regularization term coefficient. For other bounded parameters (e.g., sample proportions) or parameters with fixed values (e.g., weak evaluators), defining the parameter space is not necessary. For parameters with small values (e.g., the learning rate) or those typically adjusted downward (e.g., maximum depth), the parameter space is generally defined by expanding it around the default value in both directions. Typically, during the initial search, a wider, less dense range of parameters is explored. As the search progresses, the range is gradually narrowed, and the dimensionality of the parameter space is reduced. The final parameter space for all parameters is summarized in
Table 6 below.
5.2. Optimization of the XGBoost Algorithm Based on the TPE Approach
Bayesian optimization is a search algorithm used to automatically optimize the hyperparameters of a model. It works by producing alternative functions based on probabilistic models, which are derived from the goal function and the results of prior assessments. The primary task of the hyperparameter optimization technique is to optimize the Expected Improvement, as described in Equation (13). The TPE algorithm is a non-standard Bayesian optimization algorithm based on the estimation of tree-structured Parzen densities proposed by Bergstra et al., which employs simultaneous modelling of both the
and
in place of the Gaussian process of only modelling
.
where
is the threshold of the objective function;
is the measured value of the objective function;
is the hyperparameter sets;
is an alternative probabilistic model representing the probability of y under the hyperparameter set x.
According to Bayes’ theorem,
In the TPE algorithm,
is defined as follows:
where l(x) is the composition of densities for which the loss function of the observation x(i) is smaller than
, and g(x) is the composition of densities for which the loss function of the observation x(i) is larger than
.
That is, there are distributions of TPEs that are constructed differently for observations x on either side of a threshold y. Setting a hyperparameter
, which is a quantile with respect to y, thus produces the following:
This can be obtained by dividing Equation (15):
Bringing Equations (16) and (17) into Equation (13) yields the final expression for the desired increment:
From Equation (18), it can be seen that in order to maximize the expected increment to obtain the optimal hyperparameters, x should be found such that takes the minimum value, i.e., approximating with minimum probability and with maximum probability.
In order to assess the stability of the XGBoost method, it is necessary to conduct further iterations of Bayesian optimization. Initially, five Bayesian optimizations were conducted, and the outcomes are presented in
Table 7 below.
Table 7 reveals that “reg: squared error” was consistently chosen as the evaluation measure for all iterations of Bayesian optimization. Consequently, no more searches were conducted for this parameter. Consistently, the weak evaluator parameter was consistently chosen as “gbtree” in all five iterations, thus confirming that utilizing the “gbtree” tree is the superior option for the present data. For the remaining parameters: if the selected optimal value is at the upper limit, the overall parameter space is adjusted in a larger direction; if it is at the lower limit, the overall parameter space is adjusted in a smaller direction; if it is between the lower and upper limits, the range of the optimal value is expanded and the step size is reduced to increase the parameter density. For example, the number of iterations has bottomed out once and approached the upper limit twice, so the original range (50, 150, 5) can be modified to (20, 180, 5); the results of the feature sampling ratio before node branching are more biased towards 1.0, so we can consider lifting the lower limit (0.5, 1, 0.05); the results of the feature sampling ratio before tree building are uniformly spread out between 0.3 and 1, so we can consider not replacing the range but reducing the step size (0.3, 1, 0.05), and so on for the other parameters, shown in
Table 8.
Five Bayesian searches were performed again on the tuned parameter space, and the results are shown in
Table 9 below.
Table 9 displays the results of five Bayesian optimizations on the amended parameter space. Among these searches, the highest score achieved is 5.861. We then attempted to verify the validity of this collection of parameters using the validation function: we conducted a fivefold cross-validation after each iteration and recorded the average values of the training set and test set in the cross-validation. The validation findings are displayed in
Figure 10 below.
The comparative analysis in
Figure 11 shows that after optimizing XGBoost using the TPE algorithm, the average RMSE value of the fivefold cross-validation training set is 2.624, and the average RMSE value of the test set is 5.936, shown in
Figure 12. This represents a 6.62% reduction compared to the pre-optimization period, indicating a significant improvement in model quality. Additionally, the scores of the test set have improved while the scores of the training set have decreased, reducing the occurrence of model overfitting.
The comparative analysis in
Figure 13 and
Figure 14 shows that after optimizing the CPT and CNN models using the TPE algorithm, the average RMSE values of the fivefold cross-validation training sets are 3.114 and 2.954, respectively, and the average RMSE values of the test sets are 7.835 and 8.643, respectively. Compared with the models before optimization, the RMSE values have decreased by 16.85% and 12.09%, respectively, indicating a significant improvement in model quality. Additionally, the scores of the test set have increased, while the scores of the training set have decreased, thereby reducing the occurrence of model overfitting.
The comparative analysis in
Table 10 reveals that the fusion model TPE-XGBoost algorithm outperforms the traditional mechanistic model, the machine learning model based on field data, the XGBoost model without parameter optimization, the TPE-GPT model, and the TPE-CNN model in terms of breakdown pressure prediction and evaluation indexes.
Figure 15 compares the breakdown pressure predicted by the optimised XGBoost model with the actual field values, at a field to model data ratio of 1:1.5. The average relative error of breakdown pressure prediction is 5.87%, indicating the capability of accurately predicting the breakdown pressure of the horizontal drillings of a target block with casing completion.
To test the generalization ability of the breakdown pressure prediction model that combines data mining and mechanistic modelling, the optimised model was used to predict the breakdown pressure at 10 fracturing points in new wells. The prediction results, shown in
Figure 16, were compared with the actual breakdown pressure in the field. The maximum absolute percentage error was 11.58%, the minimum was 2.21%, and the average was 7.45%. This is a 58.34% improvement compared in generalization ability. The addition of mechanism restrictions considerably improves the generalization ability of the data mining model, as seen by a 58.34% reduction in MAPE. This enhancement results in a superior prediction effect on unknown breakdown pressure data.