Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm

: With the continuous advancement of petroleum extraction technologies, the importance of horizontal and inclined wells in reservoir exploitation has been increasing. However, accurately predicting oil–water two-phase flow regimes is challenging due to the complexity of subsurface fluid flow patterns. This paper introduces a novel approach to address this challenge by employing


Introduction
In the direction of fluid flow, the study of multiphase flow patterns has consistently been a focal point for researchers [1][2][3][4][5], with oil-water two-phase flow research forming the foundation of this field.As oilfield development advances into its later stages, the influence of water on flow dynamics becomes increasingly significant, necessitating accurate prediction of oil-water two-phase flow behaviours during extraction.However, compared to vertical wells, the flow states of oil-water two-phase flows in horizontal and inclined wells are more challenging to predict.This difficulty is exacerbated by the constantly changing development environment, where the drilling angle varies according to the actual situation, and the rapidly changing downhole conditions lead to significant variations in fluid flow velocity, which in turn significantly affect the flow patterns.
Despite ongoing research, consensus on flow pattern definitions remains elusive due to varying influencing factors and research emphases.Current research primarily relies on subjective observation and flow pattern maps, which are influenced by the observer's subjective factors, resulting in qualitative rather than quantitative identification methods.Therefore, accurately predicting oil-water two-phase flow patterns is crucial for process design, operational safety, and economic efficiency.It can also promote technological innovation and development, enhance production efficiency, reduce risks, and optimise resource utilisation.
In recent years, scholars have adopted computer numerical simulation methods to study fluid flow patterns.Through numerical simulations, researchers can model the effects of different factors on flow patterns.However, the impact of logging instruments on fluids is often overlooked in actual wells, leading to discrepancies between simulation results and actual downhole flow patterns.Physical experiments, while needing to be consistent with real wells, are inconvenient, as they must replicate challenging factors such as temperature and pressure, provide limited data points, and are labour intensive and error prone.
In recent years, many scholars have adopted computer numerical simulation methods to study fluid flow patterns.Through numerical simulations, researchers can model the effects of different factors on flow patterns.However, in actual wells, the impact of logging instruments on fluids is often overlooked, leading to discrepancies between simulation results and actual downhole flow patterns.Physical experiments, on the other hand, need to be consistent with real wells, which is inconvenient.Factors such as temperature and pressure are also challenging to replicate fully, and physical experiments provide limited data points, are labour-intensive, and are prone to errors.
The advent and continuous development of deep learning and machine learning have made data processing and analysis more efficient and accurate.These technologies have improved productivity and enhanced traditional methods.In many fields, machine learning algorithms are widely used for data prediction.For example, in the financial sector, machine learning has been applied to predict goodwill impairment [6], helping investors identify goodwill impairment risks and mitigate its market impact.Researchers like Zhang Yanan [7] and Zhang Xiangrong [8] have used optimisation algorithms and multi-core learning methods to improve the accuracy of financial risk predictions.Recent advances in the energy sector include the application of machine learning to optimise biodiesel production, as demonstrated by Sukpancharoen et al. (2023) [9], who explored the potential of transesterification catalysts through machine-learning approaches.In addition, Şahin (2023) conducted a comparative study of machine learning algorithms to predict the performance and emissions of diesel/biodiesel/isoamyl alcohol blends [10].These studies highlight the growing importance of machine learning in improving the efficiency and sustainability of biofuel production and use.
Extreme gradient boosting (XGBoost) is an emerging machine learning algorithm known for its exceptional modelling capabilities and fast computation speed, which surpasses many other algorithms.Currently, XGBoost has been widely applied in the field of petroleum geology.For instance, Tang Qinxin et al. [11] employed the XGBoost algorithm to build a model for predicting the productivity of fractured horizontal wells.At the same time, Zhao Ranlei et al. [12] used XGBoost for lithology identification in volcanic rocks.However, the application of this algorithm for predicting downhole fluid flow patterns still needs to be improved.
This study aims to leverage the XGBoost algorithm to predict downhole fluid flow patterns and evaluate its performance.Given that the effectiveness of the algorithm is influenced by hyperparameters [13], we utilised the Bayesian optimisation algorithm (BO) to optimise the hyperparameters of XGBoost.The Bayesian optimisation algorithm, known for its global parameter search capability and high efficiency, has been successfully applied across various domains.
For example, in the study by [14], the Bayesian optimisation algorithm was used for precise detection and localisation of targets in remote sensing images, significantly enhancing the accuracy of detection boundaries.In the study of [15], the Bayesian optimisation algorithm was applied to optimise the hyperparameters of XGBoost, resulting in the optimal parameter combination for constructing a grain temperature prediction model.The findings indicated that this model had low prediction error and high accuracy, providing a valuable decision-making tool for temperature control management in granaries.Additionally, the research of [16] proposed a coal spontaneous combustion grading warning model based on Bayesian optimised XGBoost (BO-XGBoost), demonstrating superior stability and classification accuracy of the BO-XGBoost model.
In this study, a multiphase flow simulation experimental apparatus was used to conduct oil-water two-phase flow simulation experiments, collecting 64 sets of flow pattern data.Subsequently, the Bayesian optimisation algorithm was employed to optimise the hyperparameters of XGBoost, thereby aligning the prediction results more closely with actual conditions.This approach provides an effective method for predicting downhole fluid flow patterns, offering a scientific basis for practical engineering applications and fostering the integration of traditional industrial technology with cutting-edge innovations.
The novelty of this work lies in the integration of Bayesian optimisation with the XGBoost algorithm to enhance the prediction accuracy of oil-water two-phase flow patterns.Unlike traditional methods, this approach optimises hyperparameters more efficiently, improving model performance.By systematically combining experimental data with advanced machine learning techniques, this study introduces a robust methodology for accurately predicting complex subsurface fluid dynamics.

XGBoost Algorithm
XGBoost is an iterative supervised learning algorithm based on the classification regression tree (CART) model [17].This algorithm enhances calculation accuracy by performing a second-order Taylor expansion on the loss function (the loss function measures how well the model's predictions match the actual data, with lower values indicating better performance).It adds a regularisation term to the objective function to reduce model complexity, effectively avoiding overfitting and improving generalisation ability.XGBoost's working principle involves constructing multiple decision trees iteratively and summing their predicted results to obtain the final output.This process is represented by Equation ( 1 In Equation (1), n represents the total number of decision trees; f k (x) denotes the predicted outcome generated by the k-th tree model; and x i refers to the i-th sample.
Prediction accuracy is determined by both bias and variance.Thus, the objective function consists of the loss function and the regularisation term.The loss function measures the difference between the predicted and actual values, while the regularisation term suppresses model complexity, as expressed in Equation (2).
In the equation, N represents the number of samples; y i and ŷi are the actual and predicted values of the i-th sample, respectively; l(y i , ŷi ) is a loss function that reflects the deviation between the predicted value and the actual value; and Ω( f k ) is a regularisation term that represents the sum of the complexities of n tree models.The specific calculation formula is shown in Equation (3).
In the specific calculation equation, γ represents the penalty coefficient for leaf nodes; (the penalty coefficient is a factor used to impose a cost on the complexity of the model, discouraging overly complex models and helping to prevent overfitting); T denotes the number of leaf nodes in the tree; γ is the regularisation penalty coefficient; and ω signifies the weights of the leaf nodes.By substituting Equation (3) into Equation ( 2) and applying the forward distribution additive principle along with a second-order Taylor expansion on the loss function, the approximate objective function is derived, as expressed in Equation (4).
In the equation, g i and h i are the first and second derivatives of the loss function, respectively.
For the decision tree model, it is defined by the branching structure and the weights of the leaf nodes, as provided in Equation ( 5).
In the equation, q x represents the leaf node index or sample x, and R T is a set of leaf node weights with T real-valued dimensions.The complexity of the decision tree is determined jointly by the number of leaf nodes and the L2 norm of the vector composed of all weights.The new expression is provided in Equation (6) (the L2 norm of a vector is a measure of its magnitude, calculated as the square root of the sum of the squares of its components).
To simplify the calculation, the set of all samples x i in leaf node j is defined as I j = {i|q(x i ) = j}.The objective function is then reformulated with the new expression provided in Equation (7).
In the equation, G j and H j are the sums of the first and second derivatives, respectively, of the samples contained in leaf node j.The weight corresponding to leaf node j is given in Equation (8).
During the construction of the decision tree, a greedy algorithm is employed to find the optimal split points for the leaf nodes in the model.This approach involves enumerating all features of each leaf node.For each feature, the feature values of the training samples in that node are sorted to determine the best split point and calculate the split gain.The feature with the highest split gain is selected, and the best split point for this feature is used as the split location, creating new leaf nodes at that position.Thus, a gain function is defined to compute the split gain for the features, as expressed in Equation (9).
In the equation, the three fractions represent the score of the left, the score of the right, and the score when there is no split.G L and H L denote the sums of g i and h i for the left subtree after the split, respectively, while G R and H R denote the sums of g i and h i for the right subtree after the split, respectively.Thus, the objective function can be optimised by transforming it into the process of finding the minimum value of a quadratic function.

Bayesian Optimisation Algorithm
XGBoost has several hyperparameters, and tuning them can be complex because their selection significantly impacts the model's performance.Therefore, careful adjustment of these hyperparameters is crucial.In previous studies, grid search (GS) has been applied to hyperparameter tuning for models with fewer hyperparameters.However, this approach is not feasible for our XGBoost model, which contains many hyperparameters [18].
Bayesian optimisation is an effective method for the global optimisation of objective functions, but the evaluation cost of the objective function is high [19].The Bayesian optimisation algorithm (BO) is primarily used for hyperparameter tuning in machine learning and deep learning models.It is also widely applied in advanced fields such as metalearning and neural architecture search (NAS).As a highly efficient global optimisation algorithm, its goal is to find the global optimum of the objective function.In hyperparameter tuning, Bayesian optimisation utilises Bayes' theorem, as shown in Equation (10).
In the equation, f denotes the defined objective function; D 1:t = {(x 1 , y 1 ), (x 2 , y 2 ), . .., (x t , y t )}.represents the observed dataset, where x t is the decision vector; y t = f (x t ) + ε t denotes the observed value, with ε t representing the observation error; p(D 1:t | f ) signi- fies the likelihood distribution of f ; p( f ) denotes the prior probability distribution, an assumption about the black-box objective function; p(D 1:t ) represents the marginal likelihood distribution of f ; p(D) here acts as a coefficient; and p( f |D 1:t ) denotes the posterior probability distribution of f [20].Bayesian optimisation leverages previous optimisation results to select the best observation points to approximate the minimum value of the objective function.
The Bayesian optimisation algorithm consists of two essential components: the probabilistic surrogate model and the acquisition function [20].The probabilistic surrogate model updates the prior continuously based on a finite set of observation points and uses Bayes' theorem to estimate the posterior probability distribution p( f | D 1:t ), which incorporates more data information and approximates the distribution of the target black-box function (observation points are specific sets of hyperparameters chosen to initiate the optimisation process, providing a starting dataset for the algorithm to learn from).
The Bayesian optimisation algorithm mainly consists of the following four steps [20]: 1.
Initialise the model by randomly selecting several sets of x t as observation points.

2.
Use a probabilistic surrogate model to estimate the objective function.

3.
Use the acquisition function to determine the next observation point x * t and substitute it into y t = f (x t ) + ε t to obtain the observation value y * t .4.
Add the obtained (x * t , y * t ) to the historical dataset D 1:t and update the probabilistic surrogate model.
The Bayesian optimisation algorithm iterates through steps 2 to 4 to obtain the optimal value of the objective function.

Data Preprocessing
Due to the diverse well conditions encountered in actual production, experimental data were collected under various conditions: well inclination angles of 0 • , 60 • , 85 • , and 90 • ; water cut rates of 20%, 40%, 60%, 80%, and 90%; and flow rates of 100 m 3 /d, Processes 2024, 12, 1660 6 of 19 300 m 3 /d, and 600 m 3 /d.The significant differences among these data points could result in certain data having an undue influence on the final results if used directly.Therefore, data standardisation is necessary.A range normalisation method was used to map all data values to the [0, 1] interval.The function used is as follows: x * = x − x min x max − x min (11) where x min is the minimum value in the dataset, and x max is the maximum value in the dataset.

Bayesian Optimisation XGBoost
After standardisation, the dataset was divided into training and testing sets.The training set was utilised for model training and parameter optimisation, while the testing set was employed to evaluate the model's performance on unseen data.Bayesian optimisation was adopted to fine-tune the hyperparameters of the XGBoost model.The algorithmic principles of the Bayesian Optimisation XGBoost model (BO-XGBoost) are depicted in Figure 1, and the specific steps are as follows: 1.
Define the objective function: The objective function was established as the mean accuracy of 10-fold cross-validation, with the maximum number of iterations for the Bayesian optimisation algorithm set to 200.

2.
Initial observation point selection: Within the predefined search ranges of the XGBoost model's hyperparameters (such as n_estimators, learning_rate, gamma, max_depth, and subsample), several sets of hyperparameters were randomly selected as initial observation points.These points were used to train the model and obtain the initial distribution of the objective function and the initial observation set D.

3.
Gaussian process estimation: Based on the observation set D, a Gaussian process was employed as the probabilistic surrogate model to estimate the objective function (a Gaussian process is a statistical method used to predict unknown values by assuming that the function values follow a Gaussian distribution, allowing for a probabilistic approach to modelling and estimation).

4.
Acquisition function calculation: The acquisition function was utilised to calculate the next observation point x * t and to compute its corresponding observation value, which represents the model's predicted accuracy y * t .

5.
Update the observation set: The new observation point x * t , y * t was added to the historical observation set D, and the Gaussian process surrogate model was updated.6.
Iteration judgment: It was determined whether the maximum number of iterations had been reached.If not, the steps from 3 onwards were repeated; if the maximum number of iterations had been reached, the optimal hyperparameter combination and the corresponding optimal value of the objective function were output, and the model's performance was evaluated using the testing set.
Through a maximum of 200 iterations, the optimal hyperparameters for the BO-XGBoost model were obtained.This BO-XGBoost model was then applied to the prediction of fluid flow patterns in wells.During the prediction process, the same feature set as the training data was used, and the standardised data were input into the BO-XGBoost model for prediction.The prediction results were used to guide the determination and control of fluid flow patterns in wells during actual production.Through a maximum of 200 iterations, the optimal hyperparameters for the BO-XGBoost model were obtained.This BO-XGBoost model was then applied to the

XGBoost
XGBoost is an efficient gradient-boosting framework that optimises a model by iteratively training decision trees.In the initial phase of the training process, XGBoost sets the prediction values of all training instances to a constant, providing a baseline prediction for the model.The algorithm then enters the iterative phase, where each iteration aims to correct the deficiencies of the current model by constructing new decision trees.
In each iteration, XGBoost calculates the gradient of the loss function concerning the current model's predictions.This gradient information gives the model the adjustment direction, guiding it to modify the predictions to reduce the loss.The construction of the new decision tree leverages the gradient information, selecting a subset of features for splitting to minimise the loss function.Each node's split is based on the gradient information to achieve optimal improvement in model fitting.
To control the model complexity and prevent overfitting, the contribution of each tree is scaled by the learning rate.Additionally, XGBoost incorporates a regularisation term in the objective function, which helps penalise the model complexity and further prevents overfitting.
XGBoost offers various parameters to control the structure of the trees, such as maximum tree depth, minimum child weight, and the gamma parameter.These parameters enable effective pruning, ensuring the model is manageable and fits well with the training data.
XGBoost makes predictions on the training set through an ensemble of trees, with each tree's prediction weighted by the learning rate and summed to produce the final prediction.The algorithm uses the same method for the test set, providing the final prediction through the weighted summation of the ensemble of trees' predictions.Figure 2 shows the XGBoost training flow chart.

XGBoost
XGBoost is an efficient gradient-boosting framework that optimises a mode atively training decision trees.In the initial phase of the training process, XGBoos prediction values of all training instances to a constant, providing a baseline pr for the model.The algorithm then enters the iterative phase, where each iteration correct the deficiencies of the current model by constructing new decision trees.
In each iteration, XGBoost calculates the gradient of the loss function concer current model's predictions.This gradient information gives the model the ad direction, guiding it to modify the predictions to reduce the loss.The constructio new decision tree leverages the gradient information, selecting a subset of fea splitting to minimise the loss function.Each node's split is based on the gradie mation to achieve optimal improvement in model fitting.
To control the model complexity and prevent overfitting, the contribution of is scaled by the learning rate.Additionally, XGBoost incorporates a regularisation the objective function, which helps penalise the model complexity and further overfitting.
XGBoost offers various parameters to control the structure of the trees, such imum tree depth, minimum child weight, and the gamma parameter.These pa enable effective pruning, ensuring the model is manageable and fits well with the data.
XGBoost makes predictions on the training set through an ensemble of tre each tree's prediction weighted by the learning rate and summed to produce prediction.The algorithm uses the same method for the test set, providing the fi diction through the weighted summation of the ensemble of trees' predictions.shows the XGBoost training flow chart.

Experiment 4.1. Design Experiment
A multiphase flow (oil-water) simulator was utilised in the multiphase flow laboratory to conduct the experimental work.The experiments were conducted under ambient temperature (20 • C) and atmospheric pressure.Industrial white oil and tap water were utilised instead of actual downhole oil and water.Table 1 details the density, viscosity, and surface tension of the oil and water used.In experiments, the well inclination was 90 • when horizontal.During the experiments, raw data and photographs were recorded.Figure 3 shows the oil-water two-phase flow patterns and a schematic of high-speed camera recordings, including smooth stratified flow, interface mixed stratified flow, water-in-oil emulsion, dispersed water-in-oil and oil-in-water, and dispersed oil-in-water and water, each representing different flow states.

Design Experiment
A multiphase flow (oil-water) simulator was utilised in the multiphase flow laboratory to conduct the experimental work.The experiments were conducted under ambient temperature (20 °C) and atmospheric pressure.Industrial white oil and tap water were utilised instead of actual downhole oil and water.Table 1 details    The schematic diagram of the experimental setup is shown in Figure 4.In this experiment, a total of 64 sets of valid and accurate data were obtained using the simulation experimental apparatus.These data were categorised into five distinct flow patterns: bubble flow, emulsion flow, frothy flow, wavy flow, and stratified flow.For ease of subsequent experimental processes, these patterns were assigned specific codes ranging from 0 to 4, as detailed in Table 2, which also shows the actual images corresponding to each flow type.In this experiment, a total of 64 sets of valid and accurate data were obtained using the simulation experimental apparatus.These data were categorised into five distinct flow patterns: bubble flow, emulsion flow, frothy flow, wavy flow, and stratified flow.For ease of subsequent experimental processes, these patterns were assigned specific codes ranging from 0 to 4, as detailed in Table 2, which also shows the actual images corresponding to each flow type.The experimental data, along with actual field data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.The experimental data, along with actual field data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained The experimental data, along with actual field data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained The experimental data, along with actual field data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained The experimental data, along with actual field data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained The experimental data, along with actual data, were used.The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction.The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms.The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter.The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results.By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained models predicted flow patterns, and these predictions were compared with actual data to evaluate accuracy.The performance of the models was illustrated using confusion matrices and scatter plots.
Figure 5a shows the unstandardised confusion matrix of the XGBoost algorithm's predictions on the training set, while Figure 5b presents the standardised confusion matrix on the test set.In the confusion matrix, rows represent observed flow pattern categories, and columns represent predicted categories.The numbers on the axes correspond to the five flow patterns listed in Table 2. Correct predictions are indicated by blue squares on the diagonal.Off-diagonal squares represent incorrect predictions.models predicted flow patterns, and these predictions were compared with actual data to evaluate accuracy.The performance of the models was illustrated using confusion matrices and scatter plots.
Figure 5a shows the unstandardised confusion matrix of the XGBoost algorithm's predictions on the training set, while Figure 5b presents the standardised confusion matrix on the test set.In the confusion matrix, rows represent observed flow pattern categories, and columns represent predicted categories.The numbers on the axes correspond to the five flow patterns listed in Table 2. Correct predictions are indicated by blue squares on the diagonal.Off-diagonal squares represent incorrect predictions.
In contrast, squares off the diagonal represent the number of incorrectly predicted samples.In Figure 5b, the numbers on the diagonal represent the probability of correctly predicting the corresponding flow pattern.In contrast, the numbers off the diagonal represent the probability of predicting an incorrect flow pattern.From Figure 5, it can be observed that the XGBoost algorithm model had five incorrect predictions in the training set.In contrast, squares off the diagonal represent the number of incorrectly predicted samples.In Figure 5b, the numbers on the diagonal represent the probability of correctly predicting the corresponding flow pattern.In contrast, the numbers off the diagonal represent the probability of predicting an incorrect flow pattern.
Figures 5-7 depict the confusion matrices and scatter plots for the XGBoost model's predictions on the training and test sets, respectively.From Figure 5, it can be observed that the XGBoost algorithm model had five incorrect predictions in the training set.
From Figures 5 and 7, it can be observed that the XGBoost model made five erroneous predictions in the training set results.Figures 6 and 7 illustrate the test set results, where two bubbly flows were predicted as dispersed flows, one frothy flow as a bubbly flow, and one dispersed flow as a frothy flow.The overall accuracy reached 75%.The XGBoost algorithm demonstrated some level of accuracy in flow pattern prediction, but there is significant room for improvement.Figure 12 shows the ROC curve for the BO-XGBoost model, with AUC values as follows:  Figure 11 displays the ROC curve for the XGBoost model, with AUC values as follows: Class 0 (0.964), Class 1 (0.857), Class 2 (0.873), Class 3 (1.000), and Class 4 (1.000).Figure 12 shows the ROC curve for the BO-XGBoost model, with AUC values as follows: Class 0 (0.982), Class 1 (0.929), Class 2 (0.921), Class 3 (1.000), and Class 4 (1.000).A higher AUC value indicates better classification accuracy.
The BO-XGBoost model exhibited higher AUC values for Class 0, Class 1, and Class 2 compared to the traditional XGBoost model.Both models achieved perfect AUC values of 1.000 for Class 3 and Class 4, likely due to small sample sizes.
In summary, the comparative analysis of the ROC curves and their respective AUC values highlights the superior predictive capability of the BO-XGBoost model following Bayesian optimisation.The BO-XGBoost model consistently achieved higher AUC values across most classes, indicating a more robust and accurate classification of oil-water twophase flow patterns.
Table 5 shows that XGBoost accuracy decreased notably for inclinations of 0 • , 60 • , and 85 • .At 85 • , with a flow rate of 300 m 3 /d and a water cut of 80%, both XGBoost and BO-XGBoost failed to predict accurately.However, at 90 • , both models demonstrated accurate predictions.XGBoost achieved 75% overall accuracy, while BO-XGBoost achieved 93.75%, demonstrating the feasibility and precision of both algorithms.Figure 13 shows the prediction accuracy of each flow type for the two models.BO-XGBoost's superior accuracy is due to its advanced hyperparameter optimisation strategy.Unlike traditional XGBoost models, BO-XGBoost uses Bayesian optimisation to find optimal hyperparameters, adapting better to data characteristics and improving accuracy.Despite slightly lower accuracy, XGBoost is advantageous in training speed and ease of implementation, recognised for its stable and efficient performance.
Considering all factors, BO-XGBoost has demonstrated higher prediction accuracy in this study, providing a robust choice for applications requiring high-precision predictions.However, we acknowledge that the choice of model and tuning strategy should be based on specific application needs and resource constraints.
Future research will focus on exploring and studying the performance of BO-XGBoost and XGBoost across various datasets and problem environments, with the goal of providing deeper insights into the selection of machine learning models.BO-XGBoost's superior accuracy is due to its advanced hyperparameter optimisation strategy.Unlike traditional XGBoost models, BO-XGBoost uses Bayesian optimisation to find optimal hyperparameters, adapting better to data characteristics and improving accuracy.Despite slightly lower accuracy, XGBoost is advantageous in training speed and ease of implementation, recognised for its stable and efficient performance.
Considering all factors, BO-XGBoost has demonstrated higher prediction accuracy in this study, providing a robust choice for applications requiring high-precision predictions.However, we acknowledge that the choice of model and tuning strategy should be based on specific application needs and resource constraints.
Future research will focus on exploring and studying the performance of BO-XGBoost and XGBoost across various datasets and problem environments, with the goal of providing deeper insights into the selection of machine learning models.

Model Interpretability and Feature Analysis
While the BO-XGBoost model achieves high prediction accuracy through training, it remains largely a black-box model in terms of interpretability.To address this issue, we employed Shapley additive explanations [21] (SHAP) to interpret the experimental results of the model, analysing the contribution of each feature to the prediction outcomes.Figure 14 illustrates the key features influencing the oil-water two-phase flow patterns.

Model Interpretability and Feature Analysis
While the BO-XGBoost model achieves high prediction accuracy through training, it remains largely a black-box model in terms of interpretability.To address this issue, we employed Shapley additive explanations [21] (SHAP) to interpret the experimental results of the model, analysing the contribution of each feature to the prediction outcomes.Figure 14 illustrates the key features influencing the oil-water two-phase flow patterns.From Figure 14, it is evident that the most significant feature affecting the flow pattern was the well inclination angle, followed by the daily production flow rates, with the water cut having the least impact.
In addition to the feature importance plot, we generated detailed feature explanation plots to obtain richer information.Figure 15 presents the global interpretation of features such as well inclination angles (Angle), flow rates (Flow), and water cut (Con).This comprehensive visualisation explains the contribution of these features to the prediction target, integrating feature values and multi-feature presentations.
From Figure 14, it is evident that the most significant feature affecting the flow pattern was the well inclination angle, followed by the daily production flow rates, with the water cut having the least impact.
In addition to the feature importance plot, we generated detailed feature explanation plots to obtain richer information.Figure 15 presents the global interpretation of features such as well inclination angles (Angle), flow rates (Flow), and water cut (Con).This comprehensive visualisation explains the contribution of these features to the prediction target, integrating feature values and multi-feature presentations.From Figure 15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5.This indicates that the well inclination angle played a decisive role in predicting the oil-water twophase flow patterns.The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output.The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome.In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.

Limitations of the Study
While our research demonstrates significant improvements in the predictive accuracy of oil-water two-phase flow regimes using the Bayesian-optimised XGBoost algorithm, it is important to acknowledge certain limitations to provide a comprehensive understanding of our study.
Firstly, the experimental data used in this study were obtained under controlled laboratory conditions, which may only partially replicate the complexities and variabilities of real-world reservoir environments.Factors such as temperature variations, reservoir heterogeneities, and the presence of impurities in the fluids were not accounted for in our simulations, potentially affecting the generalisability of our findings.From Figure 15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5.This indicates that the well inclination angle played a decisive role in predicting the oil-water two-phase flow patterns.The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output.The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome.In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.

Limitations of the Study
While our research demonstrates significant improvements in the predictive accuracy of oil-water two-phase flow regimes using the Bayesian-optimised XGBoost algorithm, it is important to acknowledge certain limitations to provide a comprehensive understanding of our study.
Firstly, the experimental data used in this study were obtained under controlled laboratory conditions, which may only partially replicate the complexities and variabilities of real-world reservoir environments.Factors such as temperature variations, reservoir heterogeneities, and the presence of impurities in the fluids were not accounted for in our simulations, potentially affecting the generalisability of our findings.
Secondly, the dataset was limited by the range of water cut rate, well inclination angles, and flow rates explored, as well as the sample size.Although we endeavoured to cover a broad spectrum of conditions, certain flow patterns occurring under extreme or less common operational scenarios might not have been adequately represented.These limitations indicate a need for further studies encompassing a wider range of parameters to enhance the robustness of the predictive model.Lastly, our study primarily focused on the application of the BO-XGBoost algorithm.Comparative studies involving other advanced machine learning algorithms and optimisation techniques could provide further insights into the relative advantages and potential limitations of our approach.Additionally, incorporating real-time data from field operations and conducting validation studies in actual reservoir conditions would be critical steps toward translating our laboratory-based findings into practical, field-applicable solutions.

Conclusions
This study presents the BO-XGBoost model for predicting oil-water two-phase flow patterns.The BO-XGBoost model achieved significant improvements over the traditional XGBoost model, with an accuracy of 93.8% compared to 75%.Precision, recall, and F1-score metrics also demonstrated substantial enhancements, highlighting the effectiveness of Bayesian optimisation in refining model hyperparameters.Our experimental setup, utilising data from a multiphase flow simulation apparatus, confirmed the superior performance of BO-XGBoost.Key features, such as well inclination angles, water cut rate, and flow rates, were effectively captured, leading to accurate predictions.SHAP analysis further emphasised the importance of these features in the model's predictions.Compared to existing literature, our approach offers greater accuracy and robustness in predicting flow patterns.Future research will focus on multi-step predictions and exploring additional machine learning techniques, such as ensemble learning and reinforcement learning, to further improve model performance.In conclusion, the BO-XGBoost model provides a robust methodology for investigating complex subsurface fluid dynamics with significant implications for petroleum engineering.Continuous refinement and optimisation of the model hold promise further advancements in this field.
the density, viscosity, and surface tension of the oil and water used.In experiments, the well inclination was 90° when horizontal.During the experiments, raw data and photographs were recorded.Figure 3 shows the oil-water two-phase flow patterns and a schematic of high-speed camera recordings, including smooth stratified flow, interface mixed stratified flow, water-in-oil emulsion, dispersed water-in-oil and oil-in-water, and dispersed oil-in-water and water, each representing different flow states.

Figure 3 .
Figure 3. Schematic of oil-water flow patterns (left) and the photographed diagram (right).

Figure 3 .Figure 4 .
Figure 3. Schematic of oil-water flow patterns (left) and the photographed diagram (right).The schematic diagram of the experimental setup is shown in Figure 4. Processes 2024, 12, x FOR PEER REVIEW 10 of 24

Table 2 .
Flow patterns and encoding.

Table 2 .
Flow patterns and encoding.

Table 2 .
Flow patterns and encoding.

Figures 5 -Figure 5 .
Figure 5. Confusion matrix of prediction results of the XGBoost algorithm training set.(a) Nonnormalized data; (b) Normalized data.

Figure 5 .
Figure 5. Confusion matrix of prediction results of the XGBoost algorithm training set.(a) Non-normalized data; (b) Normalized data.

Figure 6 .
Figure 6.Confusion matrix of prediction results of the XGBoost algorithm test set.(a) Non-normalized data; (b) Normalized data.

Figure 6 .
Figure 6.Confusion matrix of prediction results of the XGBoost algorithm test set.(a) Non-normalized data; (b) Normalized data.

Figure 7 .
Figure 7. Scatter plot of the XGBoost training set and test set flow prediction results.From Figures 5 and 7, it can be observed that the XGBoost model made five erroneous predictions in the training set results.Figures 6 and 7 illustrate the test set results, where two bubbly flows were predicted as dispersed flows, one frothy flow as a bubbly flow, and one dispersed flow as a frothy flow.The overall accuracy reached 75%.The XGBoost algorithm demonstrated some level of accuracy in flow pattern prediction, but there is significant room for improvement.Figures 8-10 show the confusion matrices and scatter plots for the BO-XGBoost model's predictions on the training and test sets.

Figure 7 .
Figure 7. Scatter plot of the XGBoost training set and test set flow prediction results.

Figures 8 -
10 show the confusion matrices and scatter plots for the BO-XGBoost model's predictions on the training and test sets.

Figure 8 .
Figure 8. Confusion matrix of the prediction results of the BO-XGBoost algorithm training set.(a) Non-normalized data; (b) Normalized data.Figure 8. Confusion matrix of the prediction results of the BO-XGBoost algorithm training set.(a) Non-normalized data; (b) Normalized data.

Figure 8 .
Figure 8. Confusion matrix of the prediction results of the BO-XGBoost algorithm training set.(a) Non-normalized data; (b) Normalized data.Figure 8. Confusion matrix of the prediction results of the BO-XGBoost algorithm training set.(a) Non-normalized data; (b) Normalized data.

Figure 9 .
Figure 9. Confusion matrix of the prediction results of the BO-XGBoost algorithm test set.(a) Nonnormalized data; (b) Normalized data.I have added the explanations to the figure titles.Figure 9. Confusion matrix of the prediction results of the BO-XGBoost algorithm test set.(a) Non-normalized data; (b) Normalized data.I have added the explanations to the figure titles.

Figure 9 .Figure 10 .
Figure 9. Confusion matrix of the prediction results of the BO-XGBoost algorithm test set.(a) Nonnormalized data; (b) Normalized data.I have added the explanations to the figure titles.Figure 9. Confusion matrix of the prediction results of the BO-XGBoost algorithm test set.(a) Non-normalized data; (b) Normalized data.I have added the explanations to the figure titles.

Figures 8 and 9
Figures 8 and 9 show the BO-XGBoost model's confusion matrices on the test and training sets, respectively.Figure 10 shows the scatter plots.From Figures 8 and 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.Figures 9 and 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow.The results highlight the BO-XGBoost algorithm's marked improvement in learning and predicting flow patterns.Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.
Figures 8 and 9 show the BO-XGBoost model's confusion matrices on the test and training sets, respectively.Figure 10 shows the scatter plots.From Figures 8 and 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.Figures 9 and 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow.The results highlight the BO-XGBoost algorithm's marked improvement in learning and predicting flow patterns.Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.

Figure 10 .
Figure 10.Scatter plot of the BO-XGBoost training set and test set flow prediction results.

Figures 8 and 9
Figures 8 and 9 show the BO-XGBoost model's confusion matrices on the test and training sets, respectively.Figure 10 shows the scatter plots.From Figures 8 and 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.Figures 9 and 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow.The results highlight the BO-XGBoost algorithm's marked improvement in learning and predicting flow patterns.Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.
Figures 8 and 9 show the BO-XGBoost model's confusion matrices on the test and training sets, respectively.Figure 10 shows the scatter plots.From Figures 8 and 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.Figures 9 and 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow.The results highlight the BO-XGBoost algorithm's marked improvement in learning and predicting flow patterns.Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.

Table 4
compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.The BO-XGBoost model demonstrated a significant improvement, with 93.75% accuracy compared to the XGBoost model's 75%.Precision increased from 0.788 to 0.967, recall from 0.791 to 0.971, and the F1 score from 0.784 to 0.966, further validating the BO-XGBoost model's superiority.These results indicate that Bayesian optimisation significantly enhanced the XGBoost model's predictive accuracy and classification performance.To comprehensively assess the classification performance of the models, we also utilised receiver operating characteristic (ROC) curves.In multi-class classification problems, ROC curves and the area under the curve (AUC) values provide an overall view of the model's classification capability.The ROC curves are shown in Figures11 and 12 .

Figures 11 and 12
Figures 11 and 12 compare the ROC curves of the traditional XGBoost model and the BO-XGBoost model across different classes of oil-water two-phase flow patterns.Figure11displays the ROC curve for the XGBoost model, with AUC values as follows: Class 0 (0.964), Class 1 (0.857), Class 2 (0.873), Class 3 (1.000), and Class 4 (1.000).Figure12shows the ROC curve for the BO-XGBoost model, with AUC values as follows: Class 0 (0.982), Class 1 (0.929), Class 2 (0.921), Class 3 (1.000), and Class 4 (1.000).A higher AUC value indicates better classification accuracy.The BO-XGBoost model exhibited higher AUC values for Class 0, Class 1, and Class 2 compared to the traditional XGBoost model.Both models achieved perfect AUC values of 1.000 for Class 3 and Class 4, likely due to small sample sizes.In summary, the comparative analysis of the ROC curves and their respective AUC values highlights the superior predictive capability of the BO-XGBoost model following Bayesian optimisation.The BO-XGBoost model consistently achieved higher AUC values across most classes, indicating a more robust and accurate classification of oil-water twophase flow patterns.Table5shows that XGBoost accuracy decreased notably for inclinations of 0 • , 60 • , and 85 • .At 85 • , with a flow rate of 300 m 3 /d and a water cut of 80%, both XGBoost and BO-XGBoost failed to predict accurately.However, at 90 • , both models demonstrated accurate predictions.XGBoost achieved 75% overall accuracy, while BO-XGBoost achieved 93.75%, demonstrating the feasibility and precision of both algorithms.Figure13shows the prediction accuracy of each flow type for the two models.

Figure 15 .
Figure 15.Feature global explanation image.In these plots, the vertical axis represents the feature names, and the horizontal axis represents the SHAP values.Each point corresponds to the SHAP value of a feature for a specific sample.Positive SHAP values indicate a positive impact on the prediction, while negative SHAP values indicate a negative impact.The colour of the points represents the feature values, with red points indicating higher values and blue points indicating lower values.From Figure15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5.This indicates that the well inclination angle played a decisive role in predicting the oil-water twophase flow patterns.The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output.The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome.In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.

Figure 15 .
Figure 15.Feature global explanation image.In these plots, the vertical axis represents the feature names, and the horizontal axis represents the SHAP values.Each point corresponds to the SHAP value of a feature for a specific sample.Positive SHAP values indicate a positive impact on the prediction, while negative SHAP values indicate a negative impact.The colour of the points represents the feature values, with red points indicating higher values and blue points indicating lower values.From Figure15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5.This indicates that the well inclination angle played a decisive role in predicting the oil-water two-phase flow patterns.The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output.The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome.In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.

Table 1 .
Parameters oil and water.

Table 1 .
Parameters oil and water.

Table 2 .
Flow patterns and encoding.

Table 2 .
Flow patterns and encoding.
4.2.Prediction Results AnalysisAfter training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing.Following data preprocessing, the trained

Table 2 .
Flow patterns and encoding.