A Well ‐ Overflow Prediction Algorithm Based on Semi ‐ Supervised Learning

: Oil drilling is the core process of oil and natural gas resources exploitation. Well over ‐ flow is one of the biggest threats to safety drilling. Prediction of the overflow in advance can effec ‐ tively avoid the occurrence of this kind of accident. However, the drilling history has unbalanced distribution, and labeling data is a time ‐ consuming and laborious job. To address this issue, an overflow ‐ prediction algorithm based on semi ‐ supervised learning is designed in this paper, which can accurately predict overflow 10 min in advance when the labeled data are limited. Firstly, a three ‐ step feature ‐ selection algorithm is conducted to extract 22 features, and the time series sam ‐ ples are constructed through a 500 ‐ width sliding window with step size 1. Then, the Mean Teacher model with Jitter noise is employed to train the labeled and unlabeled data at the same time, in which a fused CNN ‐ LSTM network is built for time ‐ series prediction. Compared with supervised learning and other semi ‐ supervised learning frameworks, the results show that the proposed model based on only 200 labeled samples is able to achieve the same effect of supervised learning method using 1000 labeled samples, and the prediction accuracy can reach 87.43% 10 min in ad ‐ vance. With the increase of the proportion of unlabeled samples, the performance of the model can sustain a rise within a certain range.


Introduction
As the "blood" of modern industry, oil is an important primary energy.It not only plays an important role in basic necessities but also works as an indispensable strategic resource for national survival and development that promotes the economy and safeguards security.Drilling is a key step in oil and gas exploitation, in which overflow is one of the greatest threats to the safety of the operation.If it is not handled properly, the overflow will evolve into a blowout, resulting in wellbore scrapping, which will not only cause great economic losses but also endanger the lives, property, and safety of drilling workers and surrounding people.The most effective prevention approach is early detection of overflow.Therefore, predicting the occurrence of overflow based on real-time drilling data can strive for precious time control overflow, to reduce safety risks timely and effectively.
In the traditional oil drilling technique, overflow is usually judged by the drilling engineers on the ground with relevant instrument data, by analyzing the changes of drilling feature parameters, such as standpipe pressure, inlet and outlet flow difference, and so on.However, artificial judgment highly depends on the experience of engineers, and it brings great work pressure to the engineers.With the development of machine learning technology, more and more scholars construct machine-learning models to predict overflow risk.Hargreaves et al., (2001) analyzed deep-sea acoustic data to monitor overflow by Bayesian model and calculated the probability of overflow [1].Lian (2013) fused rough set and support vector machine (RS-SVM) to monitor the occurrence of overflow [2].Lind et al., (2014) proposed a radial basis function (RBF) neural network based on the k-means clustering algorithm to predict drilling risk [3].Li et al., (2015) put forward a prediction method of the overflow based on the fuzzy expert system [4].Liang et al., (2018) proposed a fuzzy multilevel algorithm based on Particle swarm optimization (PSO) to optimize Support vector regression machine (SVR), and realized real-time dynamic evaluation of drilling risk [5].Liang et al., (2019) established a model for overflow diagnosis based on the monitoring standpipe pressure and casing pressure in pressure wave transmission with the genetic algorithm and BP neural network (GA-BP).In this model, the genetic algorithm was used to accelerate the convergence speed of neural networks and avoid falling into local extremum.The early diagnosis of drilling overflow was realized, and the misjudgment rate of drilling overflow was reduced [6].In the same year, based on the correlation between overflow accidents and the trend of casing pressure, Liang et al., proposed an intelligent early warning method for drilling overflow accidents based on an improved DBSCAN clustering method.The early warning method used the idea of time-series scanning and hierarchical rule clustering to improve the speed and accuracy of clustering [7].Zhu et al., (2019).collected data such as geological lithology, designed well structure, real-time drilling fluid performance, rock physical properties of backflow cuttings, and drilling engineering parameter to build an artificial neural network to predict the risk probability of stuck pipe [8].Sergey Borozdin et al., (2020) used deep learning method and created a drilling simulator, which makes it possible to recreate a digital twin of a real well and simulate an almost unlimited number of complications of various kinds on it [9].Mohammad Sabah et al., (2020) combined a number of heuristic search algorithms including genetic algorithm (GA), particle swarm size (PSO), and cuckoo search algorithm (COA), with multilayer perception (MLP) neural network and least square support vector machine (LSSVM) to present different hybrid algorithms in prediction of lost circulation [10].Liu et al., (2021) developed a dynamic Bayesian network to create a dynamic risk assessment model for evaluating the safety of deep-water drilling operations [11].In the same year, Yin et al., applied a similar method on risk analysis of offshore blowout [12], and Liang et al., established a random forest overflow accident identification and classification model based on bat algorithm optimization [13].Wang et al., (2022) proposed a drilling identification method based on optimized SVM [14].
According to the above literature, machine-learning and deep-learning models, such as support vector machine (SVM), artificial neural network, long-term and shortterm memory network (LSTM) and so on, become the main steam to predict overflow.The accuracy of these supervised learning-based methods highly depends on a large number of labeled training data.In practice, drilling data produced by one well is massive, and labeling data manually is time-consuming and heavily dependent on the experience of engineers.Besides, overflow data is very rare.The generalization ability limits the application of above models in drilling engineering.Therefore, to solve the problem of the small amount of labeled data and the large number of unlabeled data, a semisupervised learning model is proposed that can predict overflow with limited label data.

Semi-Supervised Learning
Semi-supervised learning (SSL) is a kind of learning method that combines supervised learning with unsupervised learning.SSL model is built by a small number of labeled samples and a large number of unlabeled samples.In practice, collecting labeled samples is often difficult, expensive, and time-consuming in practical conditions, while unlabeled samples are easy to obtain.In this case, SSL is more suitable for application, since SSL can effectively utilize the unlabeled data to improve the model performance.The application of the SSL results from the hypothesis of the model.When the model hypothesis is established, unlabeled data can improve the learning performance of the model with a high probability, and vice versa.The main SSL assumptions are as follows [15]: 1. Smoothing hypothesis.2. When two samples are very close in the high-density data region, their class labels are likely to be the same; on the contrary, when the low-density data regions divide the two samples, they are likely to have different class labels.3. Clustering hypothesis.4. When two samples belong to the same cluster, their class labels are probably the same.
This hypothesis is also named as low-density separation hypothesis, which means that the classification decision surface should be located in the low-density data region instead of the high-density data area.The decision surface should not divide the samples from the same high-density data area into both sides of the surface.5. Manifold hypothesis.6.On the one hand, in high-dimensional space, the data volume increases exponentially as the dimension increases, so it is difficult to estimate the real data distribution.On the other hand, if the input data is on some low-dimensional manifold, a low-dimensional representation could be found by unlabeled data, and then the simplified task will be fulfilled with labeled data.Therefore, the manifold hypothesis maps the high-dimensional data to the low-dimensional manifold, and if the two samples are located in the local neighborhood of the low-dimensional manifold, their class labels are likely to be the same.
It is worth noting that, when the semi-supervised model hypothesis is not valid, unlabeled data will actually degrade the learning performance of the model.

Mean Teacher Algorithm
The Mean Teacher [16] method is a kind of consistent regularization method based on smoothing hypothesis.The main idea of the Mean Teacher model is to reduce the over-fitting problem of the neural network through the consistent regularization method of unlabeled data, that is, the model can be trained to consistently predict a given unlabeled data and its perturbed data.The structure of the Mean Teacher algorithm is shown in Figure 1.As shown in Figure 1, Mean Teacher consists of a student model and a teacher model, which share the same framework based on supervised learning, but the parameters of both models are different.The parameters  represent the student model, and the parameters  represent the teacher model. represent the output of the student model, while  represent the output of the teacher model.In each training iteration, the same sample with different noise interference is input into the student model and the teacher model, and the disturbance of the student model is , the disturbance of the teacher model is  .
The purpose of calculating the classification cross-entropy loss (Ls) between the prediction label of the student model output and the real label of the sample is to ensure the data-fitting of the labeled sample.The purpose of calculating the consistency loss (Lu) between the prediction labels of the student and the teacher model is to preserve the similarity between the prediction labels of the student and the teacher model under different noise disturbances.The whole loss function is obtained by weighting the crossentropy classification loss and consistency loss, and then the parameter weights of the student model are updated by back propagation.In training phase, the classification cross entropy loss and consistency loss are computed simultaneously for the labeled data, while the classification cross-entropy loss is not used for the unlabeled data.The teacher model is not trained by back propagation directly, but Exponential Moving Average (EMAs) as Equation ( 1) is conducted on the parameters of the student model to update the parameters of the teacher model as the parameter of the teacher model, where N is the periodic size.The overall loss function of the Mean Teacher model is constructed as Equation ( 2), where  ,   is the cross entropy of the student model.
The specific training process of Mean Teacher is shown in Algorithm 1.
Algorithm 1. Mean Teacher learning algorithm.
1.The labeled data after disturbance  is input into the student model, and then the classification cross entropy loss between the prediction label and the real label of the training set is calculated.2. All the data after perturbed  , including labeled data and unlabeled data, are input into the student model and the teacher model, and the consistency loss between the prediction label of the student model and the teacher model is calculated.3.According to Equation (2), parameter weights of the student model are updated by back propagation.4. According to Equation (1), the exponential moving average of the parameters of the student model is taken as the parameter of the teacher model.5. Repeat the above process until the network converges.

Dataset and Feature Selection
In this study, 10 overflow samples have been collected based on the historical drilling data from one real well in an oil field.Each sample consists of the Well Logging, Pressure While Drilling (PWD), and Managed Pressure Drilling (MPD) data around the overflow once per second, with a total of 56 features, as shown in Table 1.
In this paper, there are three steps to select features to obtain the best feature subset-Analysis of Variance (ANOVA), Recursive Feature Elimination (RFE), and Mutual Information Coefficient (MIC)-which can ensure the maximum classification accuracy in the subsequent process [17].ANOVA is used to filter features with small variance.RFE is a greedy algorithm for selecting the best subset of features; the main principle is to build a machine-learning model constantly, delete the worst features based on the weight of the model, and iterate this process repeatedly until all features are traversed.The redundant relationship between features is not considered in the above two feature selection methods.MIC method is used to capture the relationship between each feature and label to further screen features.The feature selection process is shown in Figure 2 and the final selected features are shown as Table 2.The model constructed in this paper inputs 10 min of drilling data to predict whether overflow will occur in the next 10 min, and MPD device stores 50 pieces of data in 1 min, so the time series of 500 samples are used to predict whether overflow will occur in the next 500 samples.In view of the above intercepted data, a sliding time window is employed to construct time series samples, in which the window size is 500 timesteps and the sliding step size is 1.We collect the data of 500 timesteps as sample feature X, and collect whether overflow occurs after 500 timesteps as label y.For label y, if the feature data of 1-500 is taken as a sample feature  (the blue box in Figure 3), the corresponding label  is calibrated by finding whether overflow occurs in 501-1000 unit time   , and the occurrence is 1, otherwise it is 0. Taking the feature data of 2-501 as the second sample feature (see the red box in Figure 3), the corresponding label  is calibrated by judging whether overflow occurs in 502-1001 unit time   , and the occurrence is 1, otherwise it is 0, and so on.Finally, in order to eliminate the influence of physical dimension, the Min-Max Normalization method is adopted for the original data, and the data range is reduced to [0, 1] to minimum training error.

The Prediction Model
Overflow prediction is essentially an issue concerning the multi-variable and multistep time-series prediction.The overflow condition can be determined according to the trend of the features of multiple drilling parameters, and the convolution neural network (CNN) can extract and map these parameters to produce higher and more extensive effective features.Regarding the long timestep of overflow prediction, the problem of gradient disappearance can be avoided by using the Long Short Term Memory (LSTM) model.In this paper, we build a CNN-LSTM fusion network for overflow prediction, using CNN and LSTM to merge a variety of effective features to capture the long-term dependence of time series and avoid the disappearance of gradients.The network structure is shown as Figure 4, and the detailed structure is shown as Supplementary Table S1.
The CNN-LSTM model consists of a one-dimensional convolution network (1D-CNN), a LSTM layer, a three-layer full connection layer, and an output layer.Specifically, the input layer dimension is (500, 22).We use convolution layer for convolution on one-dimensional sequence and complete feature extraction through convolution operation.The one-dimensional convolution network built in this paper is composed of three sets of convolution and pooling layers.The first layer convolution has 64 convolution kernels, the kernel size is 11 × 22.There are 128 convolution kernels in the second layer, and the size of the kernel is 7 × 64.The third layer has 128 convolution kernels, and the size is 10 × 128.The output size of the model is reduced through the pool layer, and finally the features extracted by 1D-CNN are output.And then LSTM layer, the full connection layer and the output layer follow the CNN part.In order to avoid the overfitting of the neural network, we add the batch normalization layer after the convolution layer and the dropout layer after the full connection layer.Finally, the output layer is the SoftMax layer.

Construction of Semi-Supervised Framework
In this paper, Mean Teacher algorithm is used to build a semi-supervised learning framework, and the input data needs to be enhanced according to the characteristics of Mean Teacher algorithm.After researching the characters of the drilling data, noise injection is more suitable for our data, which injects a small amount of noise/abnormal values into the time series without changing the corresponding label [18].Terry et al., use methods such as Jitter, Scale, MagWarp, and TimeWarp to enhance wearable sensor data; appropriate enhancement can improve classification performance from 77.54% to 86.88% [19].Jitter is usually to simulate additional sensor noise, and we use Gaussian noise in this paper.Scale resizes the data in the time window by multiplying a random scalar.MagWarp changes the size of each sample by converting the smooth curve of the data window.TimeWarp changes the time position of samples by smoothly distorting the time interval between samples.These data-enhancement methods can improve the robustness to multiplicative and additive noise.
In order to explore the effect of the above four kinds of data enhancement, we randomly select a data sample with the size of (500, 22).Due to the significant changes in Stand Pipe Pressure (Mpa), Pump Impulse (spm), Wellhead Pressure (Mpa), Outlet Flow (L/S), and Inlet Flow (L/S) when overflow occurs, these five features are selected to enhance in four ways above.According to Figure 5 as follows, it can be seen that Jitter increases noise to time series data but does not change the trends of time series data.Therefore, Jitter is used to enhance the input data of Mean Teacher, that is, to increase Gaussian noise.
Finally, the overall framework of Mean Teacher is as follows.Firstly, the training dataset is obtained by pre-processing the original data.Then, the input data is enhanced by Jitter to effectively avoid the problem of over-fitting of the model.Finally, the CNN-LSTM model [20] is used as the student model and teacher model of Mean Teacher for time-series prediction.The CNN-LSTM model is proved to be effective to predict overflow as the accuracy 89% in 10 min advance.The framework is shown in Figure 6.

Model Training
In the experiment, 15% of the data are randomly selected as the verification set, that is, the proportion of training set and verification set is 17:3.In the training stage, to fully use the unlabeled data, the unlabeled data ratio  is defined as Equation ( 3), where L is the number of labeled samples, while Lu is the number of labeled samples.

𝜆 𝐿𝑢 𝐿
(3) Firstly, L pieces of data are randomly selected from the training set as labeled data sets.Secondly, the corresponding amount of data   is randomly selected from the remaining data at the ratio of  , and the label is deleted as the unlabeled data set.At the same time, in each iteration, the label batch size is set to N, which means that the batch data of each training contains N labeled samples and   unlabeled samples.In order to verify the experimental results, Pseudo-label [21] semi-supervised framework and CNN-LSTM supervised learning model are built for comparing.In this experiment, almost the same hyper-parameters are used on Pseudo-label and Mean Teacher, Adam is used as optimizer, lr is learning rate, initial value is 0.001, weight decay is set as exponential decay, value is 0.0001, and  is given a value of 5.Under different number of labeled training data, namely 50, 200, 1000 label samples, Mean Teacher, Pseudo-label, and supervised learning models are compared.The experimental parameters are shown as follows (Table 3).

Model Results
The model is implemented by PyTorch, and the model is trained on the 32-core Telsa P40 calculation card.The accuracy is selected as the evaluation metric, and the results are shown in Table 4. MeanTeacher+ represents MeanTeacher with Jitter data enhancements.Supervised represents a CNN-LSTM model that uses only labeled data.The training process is shown as Figure 7.It can be seen that the model is convergent after 70 epochs.According to Figure 8, the accuracy of MeanTeacher+ is higher than that of supervised learning in the case of 50, 200, 1000 label samples.This is because supervised learning only uses a small number of labeled samples for model training and the model is easy to over-fit, resulting in poor performance.While MeanTeacher+ inputs the unlabeled data into the model for training, the consistency loss provided by the unlabeled data can make the classification decision boundary fall in the low-density area.On the contrary, the performance of Pseudo-label is slightly worse.Pseudo-label uses the labeled data to train the model first, and then the trained model is used to predict the unlabeled data's pseudo-labels.Obviously, due to the initial label data being less and the accuracy of the model being limited, pseudo-labels of unlabeled data are likely to be wrong.The large percentage of wrong pseudo-labels are input for training, which has a negative impact on the performance of the model.When the ratio of unlabeled samples to labeled samples is 5, in the case of 50 labeled samples, MeanTeacher+ model is 3.56% better compared with supervised learning; in the case of 200 labeled samples, the highest prediction accuracy of 10 min in advance is 87.43%, which is only 1% lower than that of supervised learning using 1000 labeled samples.The results suggest that a similar predictive result can be achieved with 200 labeled samples by MeanTeacher+ compared with supervised learning, which needs more than 1000 samples.In the case of 1000 labeled samples, the accuracy of the MeanTeach-er+ is almost equal to that of supervised learning using all label samples.
To further verify the effect, ablation experiments are conducted to compare different machine-learning models and the effect of main component of MeanTeacher+.The machine-learning models such as SVM, Random Forest, LightGBM, XGBoost, and CNN-LSTM are trained in supervised mode under 200 label samples.MeanTeacher− means MeanTeacher+ without feature selection and Jitter noise, while MeanTeacher means MeanTeacher+ without Jitter noise.All the MeanTeacher models are trained with 200 labeled samples under  = 5.The results are listed in Table 5.

Sensitivity Analysis
To verify the effect of the main parameter unlabeled data ratio  , we change  to compare the accuracies with the number of labeled data L. Figure 9 shows the accuracy of MeanTeacher with different  in each batch training data given 50 labeled samples.It can be seen that, when  is increased, the accuracy of the algorithm is improved.The result is valuable for real-world engineering application.Unlabeled samples are quite easy to obtain, so it is possible to improve the accuracy of the model by increasing the number of unlabeled samples.The sliding window width is another key parameter for our model.The width represents the model's visual field.The bigger the width is, the more information the model can obtain.More information will surely help the model to improve the effect, but it also brings expensive training costs and makes convergence hard.Hence, the width should be set properly.Several experiments are conducted to find the width, and the results are shown in Figure 10.It can be seen 500 is a proper width, and when the width is beyond 500, the accuracy begins to decrease.

Conclusions
In order to predict drilling overflow, a three-step feature selection algorithm is conducted to extract 22 effective features from historical drilling data.The Mean Teacher semi-supervised learning framework is employed to train labeled data and unlabeled data at the same time, in which the CNN-LSTM fusion network works as the time-series prediction model, and Jitter noise is added to the time-series data as enhancement to prevent over-fitting.Compared with supervised learning and other semi-supervised learning frameworks, the results show that under the ratio of labeled to unlabeled data as 1:5, our model only needs 200 labeled samples to achieve the effect of the supervised learning method under 1000 samples, and the prediction accuracy can reach 87.43% 10 min in advance.Therefore, the drilling overflow prediction algorithm based on semisupervised learning designed in this paper can predict drilling overflow accidents in advance to ensure drilling safety, even when the amount of labeled data is limited.
The success of the deep-learning model is based on the assumption that the distribution of training data and test data is consistent.Due to the differences in data distribution of different wells, the deep-learning model has poor prediction ability when facing new wells.Regarding the prediction of new wells, it is still necessary to label the new well data for model re-training and re-prediction.However, labeling manually is timeconsuming and heavily depends on the experience of practitioners, and often only a small amount of labeled data can be obtained.The main advantage of this algorithm is to achieve high accuracy by a small amount of labeled data.It not only reduces the workload of labeling, but also predicts the overflow accurately.
According to the above process, when facing the prediction problem of a new well, the limitation of this algorithm is that it still needs a small amount of the labeled data of the new well, and the small amount of the labeled data must consist of normal and overflow data.However, it is quite difficult to acquire these data completely during the initial drilling process, and that means it is hard to apply the model in a short time for a new well.To address this issue, we will study transfer learning and realize the adaption for a semi-supervised domain with a small amount of labeled data from the new well and the previous well-source domain samples.Besides, the proposed model is a general model, and we will transfer the model to predict other complex work conditions such as mud losses and pipe sticking.

Conflicts of Interest:
The authors declare no conflict of interest.

Figure 5 .
Figure 5. Enhancement effect of different time series.V.P. represents Stand Pipe pressure.P.I.2 represents Pump impulse.M.B.P. represents Wellhead pressure.I.F.represents Inlet flow.O.F. represents Outlet flow.

Figure 8 .
Figure 8. Accuracy of various algorithms under different labeled samples.

Figure 10 .
Figure 10.Model accuracy and loss with different sliding window widths.

Author Contributions:
Data curation, M.C.; funding acquisition, W.L.; investigation, W.L.; methodology, W.L. and X.H.; project administration, J.F.; resources, J.F. and M.C.; software, J.F., M.C. and X.H.; visualization, Y.L.; writing-original draft, Y.L. and X.H.; writing-review & editing, Y.L.All authors have read and agreed to the published version of the manuscript.Funding: The authors are grateful for the support of the National Key Research and Development Program of China (2019YFA0708304, 2021YFF1201203, 2021YFF1201205), the National Natural Science Foundation of China (61972174 and 62172187), the Key core technology research project of CNPC (2020B-4019), the Science and Technology Planning Project of Guangdong Province (2020A0505100018), Guangdong Universities' Innovation Team Project (2021KCXTD015) and Guangdong Key Disciplines Project (2021ZDJS138).Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.Data Availability Statement: Not applicable.

Table 1 .
Descriptive statistics of data.

Table 2 .
Data feature table after feature selection.

Table 4 .
Accuracy of various algorithms under different label samples.

Table 5 .
Results of ablation experiments.From the table, it shows CNN-LSTM perform better than other supervised models, and feature selection and Jitter noise are able to improve the effect of MeanTeacher.