1. Introduction
With the depletion of reservoirs that are easy to drill, complete, and produce, the oil and gas industry has started to drill in harsher and more hostile environments [
1]. Typical it is that oil and gas exploration and development is advancing “two deep and one non”. Drilling in these complex formations faces many challenges, such as strong uncertainty of formation information, prominent well control risks, and frequent complex accidents [
2].
In drilling operations, when the density of drilling fluid is less than the density of formation fluid and the pressurse of formation fluid cannot be balanced, the wellbore pressure will be less than the formation pressure, and the oil, gas, and water in the formation will be pressed into the wellbore, resulting in overflow. For example, the 2010 Macondo well blowout in the Gulf of Mexico was caused by operator miscalculation and the failure of the wellhead blowout preventer, which allowed high-pressure oil and gas to enter the casing (kick) and quickly develop into an uncontrollable blowout. The time to deal with an overflow is very short. If it is not controlled in time, it may develop into a blowout, causing huge safety, economic, and environmental hazards. By identifying or predicting spills early and controlling and treating them in a timely manner, it is possible to avoid complex conditions such as gas cuts, kicks, and blowouts, resulting in safer and more efficient drilling operations [
3].
Traditional overflow detection methods need to monitor surface parameters [
4]. Installing more sensors on the rig to help detect kicks [
5] leads to delays in detection. However, the traditional wellhead blowout prevention system has the characteristics of a high difficulty coefficient, high risk, serious lag, and a difficult follow-up well-killing operation [
6]. These methods only judge the overflow through obvious changes in physical parameters and surface phenomena generally have a serious time lag, rely on the experience, responsibility, and consciousness of the monitoring personnel, do not effectively use the prior drilling parameters, have great limitations, and cannot provide early identification and prediction of overflow [
6,
7].
Compared with traditional methods, artificial intelligence and big data technology have significant advantages in solving complex drilling problems, and intelligent drilling and completion technology is regarded as a transformative technology and has become a hot topic of research and development in the oil and gas industry [
8]. In recent years, researchers have conducted extensive research on pre-drilling risk prediction, in-drilling operation risk monitoring, and risk-based decision optimization [
2], and a large number of intelligent algorithms have been applied to overflow warning and prediction. Based on logging data, Zizhen Wang et al. [
3] established a machine learning model for early gas surge monitoring by adopting an integrated learning algorithm. Raed Alouhali et al. [
5] developed and evaluated five models to optimize kick detection, and the two best models were the decision tree model and the K-nearest neighbor model. Zhang Z et al. [
9] proposed a fault diagnosis method based on the feature clustering of loss and kick time series data in the drilling process. GAO X et al. [
10] proposed an early overflow identification model based on slow feature analysis (SFA) for the process of managed pressure drilling. Shouzhi Zhong [
11], aiming at the uncertainty of labeled samples of drilling accidents, first re-labeled samples using a clustering algorithm and then modeled the data by using the C4.5 model. The results show that this method can effectively improve the identification performance of drilling faults. Muojeke et al. [
12] used artificial neural networks to establish a binary classifier for overflow occurrence according to downhole monitoring parameters during overflow occurrence. Liang H et al. [
13] used a GA-BP (genetic algorithm and BP neural network) to establish a drilling overflow diagnosis model, which realized the early diagnosis of drilling overflow and reduced the probability of misjudgment and understatement of overflow. Dinh Minh Nhat et al. [
5] realized early kick detection by monitoring downhole parameters using a data-driven Bayesian network (BN) model. However, the overflow warning technology based on comprehensive logging data is most commonly used in drilling at present [
14]. It is not that elaborate. For the fine controlled pressure data obtained by managed pressure drilling, the time series data which are difficult to process using traditional models are included in the premise of fine data. Therefore, the time series analysis method is emerging. Augustine Osarogiagbon et al. [
15] used a long short-term memory recurrent neural network (LSTM-RNN) to capture the time relationship between time series data composed of D-index data and riser pressure data and obtained early hit detection. Jiasheng Fu et al. [
7] and Yuhao Wang et al. [
16] achieved continuous predictions and real-time early warnings of overflow through LSTM-CNN.
It can be found that the existing research mainly focuses on the early warning and diagnosis of drilling process risks, and the research on prediction and risk grade assessment is still in progress [
8]. Therefore, it has become a mainstream research subject to accurately identify overflow in the early warning stage and continue the risk prediction later.
The above are some methods for identifying and predicting overflows in the oil and gas field, both without and with time series data. In other fields, Ahmed Serag et al. [
17] applied the sliding window method and a multi-class random forest classifier to high-dimensional feature vectors for accurate segmentation. Huasong Cao et al. [
18] realized online fault diagnosis of a small pressurized water reactor by using the PCA-SVM model based on the sliding window method. Mao Y et al. [
19] proposed a cable fault detection method based on the Long Short Term Memory Neural Network and Support Vector Machine (LSTM-SVM) model. Qi Zhou et al. [
20] adopted four kinds of machine learning techniques based on the sliding window method to enhance effective environment inspection, showing that the classification machine learning model based on the sliding window model is comparable to the widely used deep learning sequence model and long short-term memory model. Sungwoo Park et al. [
21] proposed a robust sliding windowing-based LightGBM model for short-term load prediction using anomaly detection and repair, which verified the excellent forecasting performance of the proposed model. These machine learning models and algorithms combined with time series analysis provide a new idea and method to solve the overflow prediction of controlled pressure drilling including time series data.
To this end, this paper starts with real drilling-controlled pressure drilling. This includes the collected data of real drilling engineering parameters, managed pressure fine data, and model calculation data, combined with the consideration of time series data processing. An overflow identification and early warning model based on the fusion of the sliding window model and random forest algorithm is proposed. Key features that are determined through feature analysis, data processing, and the sliding window model are used to process the data, and then the established random forest model is used to classify and identify the overflow risk. The proposed new method based on a serial fusion data-driven model provides guidance for the fusion application of data and models. It realizes the timely identification and early warning of overflow risk in managed pressure drilling, which helps to further reduce the overflow risk in the managed pressure drilling process of oil and gas engineering.
  2. Method
  2.1. Calculation of Equivalent Circulating Density of Drilling Fluid
The equivalent circulating density (ECD) of drilling fluid is defined as the sum of the equivalent density value and the density value of the drilling fluid itself when the flow resistance of the drilling fluid circulates in the annulus. However, the density of drilling fluid is not a fixed constant under the influence of various factors. Additionally, in some special drilling operations, such as managed pressure drilling, the wellhead will have a certain pressure. Therefore, the bottom hole circulating pressure is composed of hydrostatic column pressure, wellhead back pressure, and annular pressure loss, as shown in Equation (1).
        
Then the calculation equation of 
ECD can be obtained, as shown in Equation (2).
        
Among them, the wellhead back pressure Pbp can be directly measured by the pressure sensor at the wellhead, and with relatively high accuracy, it can be directly used in the managed pressure data.
The hydrostatic column pressure 
Ph and the annular pressure loss ∆
Pf are calculated from the data collected in the field according to Equations (3) and (4) [
22].
        
In the process of managed pressure drilling, the collected managed pressure data include real drilling engineering data, PWD data, and fine managed pressure data. Based on the collected data, the data values of hydrostatic column pressure, annular pressure consumption, and ECD are calculated in real time through the physical calculation model, and then the collected data are added as the final data. By calculating characteristic values of physical models, the quality of input data of machine learning models is enriched and improved, and the series fusion of data, physical models, and machine learning models in data is realized. When the model identifies overflow, the corresponding feature information is output for reference to guide parameter regulation.
  2.2. Data Processing
Data processing is the foundation of a data-driven model, which can be problem-oriented, improve the quality of data, and reduce the overall complexity of the model and the time cost of training, identification, and predicting.
Combined with the managed pressure data adopted in this paper, the data preprocessing methods involved include target feature coding and data cleaning. Before the eigenmatrix data are imported into a model, time series data type processing and data normalization of the overall data needs to be carried out, which is automatically implemented by Python code. Data processing involves specific methods and steps as follows.
  2.2.1. Target Feature Coding
Target feature coding is a kind of feature engineering technology, which exists in the experimental testing stage of the model. Its purpose is to qualitatively determine whether data correspond to the qualitative overflow or not so that the model can more easily process and learn the relationship between features. Meanwhile, the quantified target results are also the direct basis for evaluating the performance of the model after training.
This paper analyzes the complex situation of the data source through drilling logs and well history data, determines the situation and overflow situation in the data, and adds the corresponding working conditions to the last column of the data as the characteristic value of the target feature overflow for model training and testing in the data.
In order to facilitate the training and testing of the following model, 0–1 unique thermal coding was adopted, the feature name of the target feature is defined as overflow, and its feature value is 0 or 1, where “0” indicates that no overflow occurs, and “1” indicates that overflow occurs.
  2.2.2. Data Cleaning
Considering the actual situation that some data may not be collected (in the form of −999.25 in this data) or be vacant due to human selection or downhole equipment problems, data cleaning was carried out.
We eliminated the sample data in three cases where the eigenvalues of a row are not collected, empty, or all 0. Given the complexity of the drilling process, we retained all valid values in the remaining engineering parameter data. Considering that the objective of this study is to solve the overflow identification and prediction in the real drilling process of managed pressure drilling, the sample data of non-drilling periods such as hole drilling, pump shutdown, plug pressure test, and single connection were excluded when preparing training data. Considering the balance of the target eigenvalue class, we under-sampled the data, removed the data samples farther than the overflow section, and kept the data samples of the continuous drilling process.
  2.2.3. Time Series Data Type Conversion
The real drilling managed pressure data were collected at a time precise to seconds, the characteristic value corresponding to the time feature of the Timestamp type formed was a string type, and the data were not normalized and difficult to use directly. Therefore, we first converted the time data of the Timestamp type to a datetime object, forming the standard time data in the format of “%Y/%m/%d %H:%M:%S”, and then converted it to Unix time stamp, and finally formed the numerical time data.
  2.2.4. Data Normalization
In order to further reduce the impact of data biases and outliers on a model, it is usually necessary to standardize and normalize the data before the model uses the data. Both data standardization and normalization belong to a kind of data transformation, which uses technical methods such as mathematical functions to transform the data in order to better adapt to the needs of the selected analysis algorithm and model. However, data standardization requires the data to approximately obey a normal distribution. Therefore, we chose the MinMaxScaler data normalization method to map the data between 0 and 1 according to a linear scale. The method is as shown in Equation (5).
          
In the equation, x is the eigenvalue of a feature in the original data. x′ is the normalized eigenvalue. xmin is the minimum value of this feature in the training set. xmax is the maximum value of this feature in the training set. a is the lower bound of the scaled range (0). b is the upper limit of the scaled range (1).
  2.3. Feature Analysis
In order to make accurate and effective use of data, we identified which features had a greater impact on overflow, and then screened key features to improve the running speed and effect of the model; we carried out feature analysis on the managed pressure data collected in real drilling.
  2.3.1. Empirical Feature Selection
Firstly, according to the corresponding phenomenon of artificial experience and the overflow working condition mechanism, the artificial feature selection was carried out on the data, and then the feature extraction was combined with the grey correlation analysis algorithm to determine the number of input features of the final model.
The common parameters used to judge overflow include direct and indirect characteristic parameters, which mainly involve drilling fluid parameters, drilling tool parameters, gas component parameters, etc. Direct characteristic parameters include outlet flow and total pool volume. The overflow signal shows whether the total pool volume of drilling fluid will increase during circulation (when the pumping does not increase and the drill string is not lowered), the wellhead will automatically overflow drilling fluid after stopping the pump, the amount of drilling fluid injected during drilling start is less than the volume of the drilling tool, or the amount of drilling fluid returned during drilling is greater than the volume of the drilling tool. Indirect characteristic parameters include drilling rate, vertical pressure, hook load, gas measurement, and drilling fluid properties. The signals of overflow are a sudden acceleration of drilling rate or blowout, a decrease or an increase in vertical pressure, a decrease in hook load, an increase in hydrocarbon content measured by gas, and a decrease in export drilling fluid density [
7]. C Peng et al. [
23] used seven parameters, namely Stand pipe pressure (MPa), Mud pool volume (m
3), difference of inlet and outlet flow rates (L/s), Density difference of inlet and outlet drilling fluids (kg/m
3), Temperate difference of inlet and outlet drilling fluids (°C), Conductivity difference of inlet and outlet drilling fluids (S/m) and drilling time (min), as feature parameters for early overflow detection. Z Wang et al. [
3] used eight parameters, including hook load (WHO), drilling pressure (WOB), torque (TOR), flow rate (FLW), drilling speed (ROP), riser pressure (SPP), conductivity (CON), and mud outlet density (DEN). These are typical feature parameters used to identify overflow. On this basis, we screened the relevant feature parameters in MPD data.
  2.3.2. Algorithmic Feature Selection
In addition to empirical feature selection, we used the Pearson correlation coefficient algorithm in the correlation analysis method for feature selection. The Pearson correlation coefficient is a statistic that measures the degree of linear correlation between two variables. It is defined as the quotient of the product of covariance and standard deviation between two variables, usually represented by r, and its equation is as follows:
In the equation, xi and yi are the values of the two variables at the i-th observation point, respectively.  and  are the sample means of the two variables, respectively. n is the total number of observation points.
In previous studies, Yashuang Mu et al. [
24] established a decision tree based on the Pearson correlation coefficient, demonstrating its good parallel performance in reducing the computation time of large-scale data classification problems. Mei K et al. [
25] established a weighted Pearson correlation random forest classification model for predicting protein structures from large protein data sets, which greatly reduced the classification error rate. It can be seen that the Pearson correlation coefficient algorithm has good applicability to the data of the classification model.
In this study, we calculated the Pearson correlation coefficient among all features to obtain the correlation between each feature and the target feature, and excluded features with a correlation coefficient of 0 or that was relatively small with overflow, so as to further select features for all non-target features.
  2.4. Sliding Window
For continuous and time-series data such as MPD data, there may be the same downhole working conditions in the same period of time. The current data are changed based on the previous data, and the next data will also change based on the previous data. Under the fine time series partitioning, the data in a small period of time do not change suddenly from the previous data. Therefore, the MPD data are suitable for the continuous research of windowing by using the sliding window method, and the relationship between adjacent and recent data is accurately grasped by the machine learning model.
After setting the sliding window size and time step, the sliding window method can be used to slice the data. The sliding window method is used to traverse the entire time series data, and the time series data with the size of the sliding window is intercepted from the current position each time to form a sample feature matrix for model training, performance testing, identification, and prediction.
  2.5. Random Forest Algorithm
In this study, the Random Forest algorithm was used as a classification model, which is a machine learning algorithm based on integrated learning with high prediction accuracy and generalization ability [
25,
26], insensitive to outliers and noise, and capable of handling high-dimensional data. This integrated learning approach has demonstrated good performance and robustness in practical applications and thus has been widely used in the fields of data mining and machine learning.
  2.5.1. Principles of Random Forest Algorithm
As shown in 
Figure 1, the Random Forest algorithm is designed to improve the overall prediction performance by constructing multiple decision trees and combining their predictions. Each decision tree is trained on a random subset of the original data set and only a portion of randomly selected features are considered for splitting at each node. The specific algorithm flow is as follows [
27,
28]:
Multiple subsets are randomly selected from the original data set with put back; these subsets are called Bootstrap samples.
- (2)
- Random feature selection 
During the construction of the tree, only a portion of the randomly selected features are considered each time a node is split to prevent certain features from overly dominating the decision-making process of the entire random forest.
- (3)
- Construction of decision trees 
For each self-help sample, a decision tree is constructed using the CART (Classification and Regression Trees) algorithm. This makes the training data for each tree different and helps to increase the diversity of the model.
- (4)
- Prediction 
For new sample data, each decision tree gives a prediction. In binary classification problems, majority voting is usually used to decide the final prediction.
  2.5.2. Gini Impurity
Commonly used metrics in the construction of random forest algorithms include information gain and Gini impurity. These metrics are used to select the best features for node partitioning to construct a CART decision tree. In the random forest implementation of Scikit-learn for Python, Gini impurity is used by default as a measure of split quality.
Gini impurity is used to measure the impurity of a data set; the smaller the Gini impurity, the purer the data, i.e., the more the samples in the data set tend to belong to the same category. The equation is as follows:
In the equation, Gini(D) is the Gini impurity of data set D. C is all possible categories in the data set. P(k|D) is the proportion of samples belonging to category k in data set D.
  2.6. Model Evaluation
  2.6.1. Accuracy Rate
The effectiveness and performance of the trained model is usually evaluated by the accuracy rate, which can be calculated as follows:
In the equation, the meanings of other symbols are represented by the confusion matrix, as shown in 
Table 1.
  2.6.2. Confusion Matrix
The actual confusion matrix is a 2 × 2 matrix containing the number of four classification cases, as shown in 
Table 1. Therefore, the confusion matrix can accurately show the classification prediction of a binary classification model for the sample, where these predictions are only the identification of test data results, or the identification of real-time incoming data in the actual drilling process, and not the prediction of future conditions.
  2.6.3. ROC Curve
The ROC curve is a curve with the False Positive Rate (FPR) as the horizontal axis and the True Positive Rate (TPR) as the vertical axis. In the binary classification problem, the TPR represents the proportion of correctly predicted positive samples in the total positive samples. The FPR represents the proportion of negative samples that are incorrectly predicted to be positive to the total negative sample. The greater the ROC curve it forms to the top left corner, the better the performance of the classifier. The AUC value is the area under the ROC curve, ranging between 0 and 1, which is used to measure the performance of the classifier. The larger the value, the better the model effect. To this end, we introduced the ROC curve as the evaluation standard to judge the performance ability of the model again.
  2.6.4. Other Evaluation Indicators
It can be seen from the model accuracy calculation equation that the accuracy rate reflects the overall effect, and the actual effect of the model cannot be carefully evaluated when the data are unbalanced and the problem cost is large. Although the confusion matrix further shows the case of binary classification, it is inconvenient to evaluate the effect of the model on a specific classification sample. For this purpose, multiple evaluation criteria such as precision, recall, and f1 score can be used, and the evaluation criteria are based on two classified samples. The calculation equation of the three evaluation indicators is as follows:
Based on the above method overview, we fused all methods in the series, that is, the overall flow chart of the series fusion method was obtained, as shown in 
Figure 2.
  3. Case Study
  3.1. Data Selection
The research data used in this paper were the data collected during real drilling controlled pressure drilling of some Wells with frequent overflow occurrence in the middle block of the Tarim Oilfield in China, involving engineering parameter data, PWD data, fine managed pressure data, and calculation parameters. There were a total of 55 features, as shown in 
Appendix A, used as the initial data source.
  3.2. Data Preprocessing
Firstly, the data were preprocessed, the target feature column for overflow was added, and the target feature was encoded for each data. According to well history data and site working condition records, each data were coded by overflow occurrence time, followed by under sampling processing and data cleaning, and a total of 37,646 data were obtained, among which the ratio of normal working condition to overflow working condition was 29,692:7953.
  3.3. Physical Model Added Features
Using Python code, the annular pressure loss (MPa) and hydrostatic column pressure (MPa) were calculated through the managed pressure data collected by real drilling and the physical model of annular pressure loss and hydrostatic column pressure. ECD (g/cc) was calculated using the wellhead back pressure data from the managed pressure data combined with the calculated annular pressure consumption and hydrostatic column pressure, and these three features were added as additional features, plus the target feature overflow, for a total of 59 features.
In addition, computer simulation tests show that under the pure physical model code, the calculation and addition of annular pressure loss, hydrostatic column pressure, and ECD for each piece of data and the running time of the model was only milliseconds, which relatively realizes real-time calculation.
  3.4. Feature Analysis
  3.4.1. Empirical Feature Selection
First, we removed the feature parameters that were not collected (−999.25) or whose data value was always 0, including the following 13 feature parameters: kelly-in (m), casing pressure (MPa), pump stroke 1 (spm), pump stroke 3 (spm), H2S (ppm), PWD depth (m), PWD vertical depth (m), PWD drill string pressure (MPa), PWD annulus temperature (°C), PWD measurement ECD (g/cc), injection volume (m3), and back-up depth (m). Then, after adding the target feature overflow, there were 46 feature parameters.
Secondly, we removed feature parameters that were imprecise (marked in the data description), not directly related to overflow, and duplicate. Subsequently, we excluded the following 15 feature parameters: time, hook position (m), hook velocity (m/s), vertical pressure log (MPa), late time (min), inlet flow log (L/s), outlet flow log (%), outlet density log (g/cc), PWD inclination (°), PWD bearing (°), circulation pressure (MPa), fixed point depth (m), fixed point vertical depth (m) fixed point pressure (MPa), and wellhead regulating pressure (MPa). We added the target feature overflow and there were 31 features left.
  3.4.2. Pearson Correlation Coefficient Feature Selection
For the 31 feature parameters mentioned above, excluding the five column features related to time and well depth, we normalized the remaining 26 feature parameters. Subsequently, the Pearson correlation coefficient was used to conduct feature analysis on 26 feature parameters, and the correlation coefficient matrix was displayed through a correlation heat map, as shown in 
Figure 3.
From the correlation analysis heat map, we quantitatively obtained the correlation between the various features. Considering the size of the correlation coefficient between the feature parameters and the target (less than 0.1) and the actual engineering needs, the three feature parameters of the flow rate of the back pressure pump (L/s), additional back pressure (MPa), and hydrostatic pressure (MPa) were removed.
In the end, we obtained the following 28 feature parameters: DateTime, well depth (m), vertical depth (m), bit depth (m), bit vertical depth (m), drilling time (min/m), bit weight (KN), suspension weight (KN), rotational speed (rpm), torque (KN.m), pump stroke 2 (spm), total pool volume (m
3), mud spill (m
3), entry density log (g/cc), inlet temperature (°C), outlet temperature (°C), total hydrocarbon (%), C1 (%), C2 (%), PWD annulus pressure (MPa), vertical pressure (MPa), measure the back pressure (MPa), export flow (L/s), export density (g/cc), inlet flow rate (L/s), ECD (g/cc), annular pressure loss (MPa), and overflow. The relationship between the main features and the overflow correlation coefficient is shown in 
Figure 4.
  3.5. Sliding Window Settings
We set the size of the sliding window to 30 and the time step to 3. The features of the sliding window size and corresponding target variables were extracted from the managed pressure data through cyclic iteration. The data of the window size in the data were successively taken as the feature matrix of the model input with the interval of the time step. The sliding window principle diagram of this research process is shown in 
Figure 5.
However, at the drilling site, the driller and other personnel pay more attention to the occurrence and identification of complex working conditions than normal drilling conditions. In an actual drilling situation, when overflow occurs, it is necessary to regulate the parameters or even stop drilling, which leads to a reduction in drilling efficiency and NPT. For this reason, we set a sliding window model, and if one target result of 30 sample data was overflow, then the result of the sliding window model was determined to be overflow. In this way, the unnecessary continuous prediction and advanced identification of overflow after the occurrence of overflow were relatively realized in the case of model prediction, which is in line with practical application.
  3.6. Feature Matrix Processing
After time type processing, the time column “DateTime” was converted into a numerical type suitable for model input, which was used as the time column of the sliding window model, and the overflow column was used as the target column of the sliding window.
We normalized the data after sliding window processing and divided the training and test sets according to a ratio of 8:2 as the input data of the random forest model.
  3.7. Model Building and Testing
We defined the RandomForestClassifier model and entered the feature matrix data after sliding window processing into the model for training. After the training, the test set after sliding window processing was input into the model for testing. A total of 2508 sliding windows and 2508 corresponding model prediction results were obtained. The ratio of normal to overflow was 1971:537. The number of overflow outcomes predicted by the random forest model was as follows: the ratio of normal to overflow in the actual result was 1972:536, as shown in 
Figure 6. The accuracy of the test was 99.9601%, and only one overflow incident was predicted by the model to be normal. As shown in 
Figure 7, the ROC curve score of the model was 0.999. The overall results were excellent and the results required for overflow identification was achieved.
In order to analyze the prediction effect of the model on the test set in detail, we took the sliding window samples of 800 to 1000 and 1400 to 1500 and changed the model into a probabilistic prediction model on the basis of remaining conditions staying unchanged. The median value of 0.5 was selected as the watershed of probability binary classification to explore the probability prediction of the constructed and trained random forest model on the two batches of data. By comparing its probability distribution with the actual situation, we could judge whether the prediction result of the model was always fixed at 0 or 1 and whether overfitting occurs.
As shown in 
Figure 8, we selected a probability prediction scatter plot among the sliding window samples from 800 to 1000. By comparing it with the actual working conditions, we found that the prediction results of the model were consistent with reality, and it does not always predict a 0 or 1 case. As shown in 
Figure 9, in the sliding window sample from 1400 to 1500, a probability prediction line chart was selected. Through comparison with actual working conditions, we obtained similar results. Therefore, we concluded that the proposed series fusion data-driven model has the practical effect of overflow identification.
  3.8. Overflow Identification and Early Warning
In order to improve and optimize the overflow identification results, we set the overflow risk warning box to pop up preferentially when an overflow risk is identified, as shown in 
Figure 10. The feature data corresponding to the identified overflow conditions are output for the early warning of overflow risk. It is convenient for on-site drilling engineers and other personnel to timely control the drilling fluid density based on the characteristic value, especially based on the ECD and the measured drilling fluid density value, to avoid complex conditions such as overflow in Wells with narrow safety density windows. As shown in 
Figure 11, it is the characteristic data corresponding to the two overflow conditions identified by the model in the test set. We output all the corresponding key characteristic data for early warning of overflow risk.
  4. Discussion
  4.1. The Impact of Time Series Data
For time series data collected by MPD, it is in the style of “year/month/day hour: minute: second” and belongs to the Timestamp type. In Python, input data of methods and models usually need to be numeric, rather than strings, dates, and other types. If the type conversion is not carried out, neither sliding window processing, data normalization, nor subsequent model training and testing can be directly used. Therefore, it is necessary to ensure that the input data are of the correct type and can be converted to floating point numbers. To do this, we converted Timestamp data of type TIMESTAMP to a Unix timestamp (in seconds) and then converted it to the corresponding timestamp at the end if the output is needed in the original format.
  4.2. The Influence of Adding Feature Data on the Model
In order to explore the effect of the characteristics obtained from the series physical model, under the condition that other conditions remain unchanged, we used conventional feature data (excluding pressure control fine data features and ECD), conventional feature data plus ECD, and the feature data used in this case study. Through the test of these three different data sets, the influence of adding ECD feature data to the data set on the accuracy of model overflow recognition was compared.
After testing, we obtained the results of the model overflow recognition accuracy in the three cases shown in 
Figure 12. In addition, the other two cases were further evaluated to obtain the confusion matrix shown in 
Figure 13 and 
Figure 14. Through the three graphs, it can be found that after the addition of ECD, the accuracy of the model’s overflow identification on the test set increased, and the cases of two false identifications were reduced compared with the results without the addition. Moreover, after adding pressure control fine data and ECD, the accuracy rate was further improved compared with the previous two, which achieved the effect in the case study of this paper. Therefore, the use of managed pressure drilling data and ECD data has the effect of improving the accuracy of overflow identification.
  4.3. The Influence of Different Algorithm Models
In order to test and compare the effects of different machine learning classification algorithm models, under the premise of the same data and data preprocessing, we took the final obtained feature matrix as the input of different models, compared their predictions through the prediction accuracy of different models, and determined whether there was an overfitting phenomenon in the random forest model.
  4.3.1. Accuracy Rate
To this end, we established a logistic regression model, an SVM model, a multi-layer perceptron neural network model, and a decision tree model, and conducted training and testing under the same conditions. Finally, the identification accuracy of the logistic regression model was 90.1116%, the SVM model was 97.6077%, the neural network model was 99.0032%, and the decision tree model was 99.3222%. The identification accuracy of the random forest model on the test set has been given in 
Section 3.7, which was 99.9601%. The accuracy of the constructed five algorithm models is shown in 
Figure 15. It can be seen that the random forest model had the highest accuracy.
  4.3.2. Confusion Matrix
In order to further evaluate the identification effect of the analysis model on classified samples, we combined the confusion matrix and other evaluation criteria for comparative analysis. The confusion matrix of the four newly established models for overflow risk identification is shown in 
Figure 16, 
Figure 17, 
Figure 18 and 
Figure 19.
According to the prediction accuracy of the model and the results of the confusion matrix, therewere 1971 normal conditions and 537 overflow conditions in the test data. The number of correct identifications of normal working conditions by the logistic regression model was 1938, the number of correct identifications of overflow working conditions was 322, and the overall effect was general. The number of correct identifications of normal working conditions by the SVM model was 1962, the number of correct identifications of overflow conditions was 486, and the overall effect was good. The number of correct identifications of normal conditions by the neural network model was 1969, the number of correct identifications of overflow conditions was 514, and the overall effect was good. The number of correct identifications of normal working conditions was 1966, the number of correct identifications of overflow conditions was 525, and the overall effect was better. However, there was still a gap between the four models and the effect of the random forest model.
  4.3.3. ROC Curve
It can be seen from the ROC curve that the ROC values of logistic regression, SVM, neural network, and decision tree models are 0.791, 0.950, 0.978, and 0.988, respectively, while the ROC values of the random forest model are 0.999, which still proves that the random forest model was more effective in evaluating the standard ROC curve.
  4.3.4. Other Evaluation Indicators
Finally, the prediction effect of the five models on the test set was based on two classification samples, normal (0) and overflow (1), and the actual effect of the model was evaluated more carefully through the precision rate, recall rate, and f1 score, as shown in 
Figure 24, 
Figure 25, 
Figure 26, 
Figure 27, 
Figure 28 and 
Figure 29.
By comparing the accuracy rate, recall rate, and f1 scores of the five models under normal and overflow conditions, the random forest model had the best scores in each term. Therefore, the feasibility of the random forest model in overflow risk identification is further confirmed.
Based on the above four categories of evaluation indicators, we can conclude that the identification effect of overflow risk is ranked from largest to smallest as the random forest model, the decision tree model, the neural network model, the SVM model, and the logistic regression model. As an integrated learning method based on the decision tree model, the random forest model has better prediction ability than the decision tree model.
  4.4. The Influence of the Series Fusion Model on MPD Drilling Overflow Identification
There is a process of overflow and even blowout. Usually, in the process of development, there will be obvious abnormal changes in the values of key characteristic parameters, such as an abnormal increase in outlet flow and total hydrocarbon values. However, if this is not identified and controlled in time, once it develops into an overflow accident, it will increase drilling costs and reduce drilling efficiency. If the corresponding well control measures are not taken in time, it may also cause blowout accidents, resulting in wellbore scrapping, environmental pollution, and casualties. Therefore, the early identification of overflow is very important. By using the series fusion model in this paper for overflow identification, once the data changes in key features deviate significantly from the conventional, the model will immediately identify the anomalies and make an early warning.
The existing methods and models for MPD overflow risk identification have not been studied by combining physical models and machine learning models, and there are no additional key feature parameters based on reference. Meanwhile, pure binary classification models cannot meet the quantitative identification of risks, and the model proposed in this article is also suitable for the quantitative probability identification of risks. In addition, the series fusion model in this article avoids unnecessary continuous overflow risk identification through sliding windows, with high testing accuracy and providing overflow risk warning and feature information output.
  5. Conclusions
This paper used managed pressure drilling data for overflow identification. On the basis of feature analysis, after preprocessing and ECD calculation by a physical model, the input feature data were processed by the sliding window model, and then the overflow was classified and identified by the random forest algorithm. Finally, the various method steps were connected in a series to form a series fusion data-driven model drilling overflow identification method, and this further carried out overflow risk warnings and predictions of future overflow risk, achieving the identification, early warning, and prediction of drilling overflow risk. Based on this study, the following conclusions could be obtained:
(1) The influence of different feature data and machine learning algorithms on the overflow recognition accuracy of the whole model was tested. The finally implemented data-driven model based on the serial fusion of the physical model, data processing, the sliding window model, and the random forest algorithm model achieved an excellent effect of more than 99.9% in overflow risk identification.
(2) An early warning method for overflow risk was proposed. When overflow risk is identified, an early warning prompt box pops up and corresponding characteristic information is output.
(3) This research is a new method for identifying and warning the overflow risk of pressure-controlled drilling, which is of great significance for promoting the safety of drilling systems and the automation of the drilling process. The proposed series fusion model construction method forms a complete set of fusion data-driven model construction methods, which can improve the overall machine rationality and accuracy of the model, and provide a certain reference for the model fusion method.