An Anomaly Detection Model for Oil and Gas Pipelines Using Machine Learning

: Detection of minor leaks in oil or gas pipelines is a critical and persistent problem in the oil and gas industry. Many organisations have long relied on ﬁxed hardware or manual assessments to monitor leaks. With rapid industrialisation and technological advancements, innovative engineering technologies that are cost-effective, faster, and easier to implement are essential. Herein, machine learning-based anomaly detection models are proposed to solve the problem of oil and gas pipeline leakage. Five machine learning algorithms, namely, random forest, support vector machine, k-nearest neighbour, gradient boosting, and decision tree, were used to develop detection models for pipeline leaks. The support vector machine algorithm, with an accuracy of 97.4%, overperformed the other algorithms in detecting pipeline leakage and thus proved its efﬁciency as an accurate model for detecting leakage in oil and gas pipelines.


Introduction
In the oil and gas industry, various problems and anomalies could damage oil and gas pipelines, which could ultimately result in human injuries and financial loss. A few examples of these anomalies include corrosion, leakage, and rust. Oil and gas leakage can be dangerous for people's health and the surrounding environment. Additionally, leakage of gases such as isobutane and propane into the atmosphere is very harmful because of their effect on ozone depletion or global warming. Therefore, a number of studies have been published to develop a gas leak detection model [1]. Recent advancements in artificial intelligence (AI) and data sensing have created new opportunities to solve challenging problems in environmental monitoring, such as solid waste, air, and wastewater pollution [2].
AI is one of the most useful technologies in this age. It encompasses a wide array of technologies, including machine learning (ML) and deep learning (DL), which can be used in various applications such as industry, health, economies, etc. [3]. Furthermore, AI plays a pivotal role in improving the oil and gas industry, and various ML-and DLbased AI techniques have been used to detect anomalies in pipelines. In previous studies, several deep learning models were implemented to detect oil and gas leakage in pipelines. In ref. [4], the authors aimed to reduce environmental pollution by developing an ML model to detect oil and gas leakage. The model resulted in an accuracy of 98.57%. At the same time, the model presented in ref. [5] resulted in an accuracy of 99.4% in detecting leakage. In another study, a mask region-based convolutional neural network (Mask R-CNN) and the Visual Geometry Group 16 (VGG-16) model were employed to locate and identify oil spills in pipelines [6]. The model resulted in an accuracy of 93%, which is higher than the model of ref. [7], which also used CNN to demonstrate recent findings in 1.
An automated system is developed to identify anomalies in the oil and gas pipeline; 2.
A comparison of five ML algorithms to detect pipeline leakage using industrial datasets is performed; 3.
Evaluation methodology in terms of accuracy, precision, recall, F1-score, accuracy, and ROC-AUC is proposed; 4.
An optimisation technique is used to increase the performance of the proposed models.
Enhancing the economy of several large companies in the oil and gas industry will enhance the economy in many counties, such as Saudi Arabia. As stated earlier, the authors developed an ML-based novel solution for pipeline leak detection. Five ML models, namely, support vector machine (SVM), k-nearest neighbours (KNN), random forest (RF), gradient boosting (GB), and the decision tree (DT) algorithm, were used and compared. The proposed model is anticipated to add many benefits to the Saudi market and even to the global market.
The remainder of this paper is organised as follows. Section 2 summarises the related studies in leakage detection systems using AI techniques. Section 3 introduces the proposed methodology. Section 4 discusses the results of the proposed model. Finally, Section 5 presents the concluding remarks and some suggestions for future work.

Related Work
Wang et al. [4] proposed a model that uses temperature information fusion and distributed vibration to detect oil and gas pipeline leaks depending on the measurement ability of distributed optical fibre sensors. Their prime goal was to reduce environmental pollution and economic loss by monitoring and timely recognising pipeline leakage. The researchers used five different classification models, namely naive Bayes, KNN, DT, RF, and backpropagation neural networks. The models can recognise the normal operation, interference, and state of leakage. A comparison was made between the performance of the classifiers, and RF exhibited the most superior performance with five vibration attribute values and six temperature attribute values. The RF classifier reached 98.57% in recognising the oil and gas pipeline leakage.
Lu et al. [9] proposed a model that can extract the features of pipelines to detect leakage. The continuous expansion of pipeline networks and the lack of research in the field of pipeline leakage recognition using leak features were the two main driving factors in this study. A combination of variational mode decomposition and SVM was proposed to extract pipeline leakage characteristics. The researchers employed three kernel functions, namely the polynomial kernel, linear kernel, and radial basis function (RBF) kernel. The researchers found RBF to be the optimum kernel function with 96% accuracy, 92% specificity, and 100% sensitivity. In addition to the effectiveness of the proposed method in the experimental data, the researchers assessed the method in practical application.
Xiao [5] proposed a model that uses acoustic signals to detect gas pipeline leakage. The main goal of this research was to protect society from damage caused by gas pipeline leaks. The proposed method to detect gas pipeline leaks employed SVM and wavelet transform. The latter was used to preprocess acoustic sensor signals, and the entropy-based algorithm was used to select the optimal wavelet basis, followed by the extraction of leak-related information from the acoustic signals. The Relief-F algorithm was used for feature selection, and its output was fed as an input to the SVM model to detect gas pipeline leakage. The proposed method proved its effectiveness as it reached 99.4% in classifying the events leading leaks or no leaks using the three most discriminative features and 95.6% using the five most discriminative features.
De Kerf et al. [7] proposed a model for detecting oil leakage inside a port environment using thermal IR cameras and unmanned aerial vehicles (UAV). The IR images were necessary to detect oil leakage during night-time. The researchers presented a method to annotate the red, green, and blue (RGB) images and match them with the IR images to collect the dataset. The collected images were resized and used to train a CNN. Once the network was trained, it enabled the frequent inspection of oil leakage on the water at a low cost. During the test stage, the researchers were able to detect oil leakage on water successfully with an accuracy of 89%. The implemented solution can decrease the cleaning cost of oil leakage in water, minimise human interaction during the process, and increase the detection rate. Further improvements could be applied in the future using other camera technologies and more advanced preprocessing techniques.
Ghorbani and Behzadan [6] developed different models for oil spill detection to help people take adequate actions effectively and immediately and mitigate the overall damage. Two deep learning models, mask R-CNN and VGG-16, were used to locate and identify oil leakage. A dataset was created through web mining, and it contained 1292 images. The VGG-16 model was used for the image classification process to predict oil leakage via an image, and it reached an accuracy of 93%. The mask R-CNN model was use for segmentation to detect oil leakage and to mark the boundaries of the spill at the pixel level and yielded average recall and precision of 70% and 61%, respectively. The resultant models can create more opportunities for advancing the current practices of combining data analytics and AI into up-and downstream operations in the oil and gas industry, as well as detecting environmental pollutants using non-intrusive techniques. To increase the likelihood of technology adoption by the oil and gas industry and reduce the implementation cost, the researchers worked with RGB images as inputs as the used drones contain RGB cameras. With certain modifications, the same method can be used for other types of inputs (e.g., thermal and infrared images).
Melo et al. [10] introduced different techniques for the detection of natural gas leakage in oil facilities. Different CNNs were proposed to detect the leakage of natural gas. The dataset that they used contained 2980 images and was divided into two classes, namely, 'with leak' (980 images) and 'without leak' (2000 images). The performance of 27 different CNN models was evaluated to achieve the best accuracy. The model with the best performance had the following characteristics: SGDM optimisation algorithm, 18 convolution layer architecture, and dropout regularisation technique, and it yielded an accuracy of 99.78% and a false-negative rate of 0%. In the future, the researchers plan to evaluate the generalisation ability of the model on unseen images of different types.

Methodology
The methodology that was followed during this study includes important steps for building an ML model. The first step involves the collection of the required dataset and a preprocessing phase. The second step involves training the proposed model and evaluating its performance. A more detailed description of the methodology is included in this section. Figure 1 summarise the methodology of this study.

Methodology
The methodology that was followed during this study includes important steps for building an ML model. The first step involves the collection of the required dataset and a preprocessing phase. The second step involves training the proposed model and evaluating its performance. A more detailed description of the methodology is included in this section. Figure 1 summarise the methodology of this study.

Data Collection
An open-source dataset obtained from GitHub was used in this study [11]. The dataset was proposed for public use for studies such as ML and other statistical studies. It was originally proposed with a regression target class of the corrosion defect. The dataset contains eight features and 10,293 instances, and it contains numerical attributes. Additionally, the dataset could be used for regression and classification problems, and it is split into training and testing sets. Table 1 describe its various features.

Features
Description The temperature of the wellhead Wellhead press (psi) The pressure of the wellhead MMCFD gas Million standard cubic feet per day of gas BOPD Barrel of oil produced per day BWPD Barrel of water produced per day BSW Basic solid and water CO2 mol.
Molecular mass of CO2 Gas Grav.
Gas gravity CR Corrosion defect

Data Preprocessing
The success of ML algorithms depends on various factors. The first factor is the quality and representation of the instances on the dataset. The work in the training phase needs to have reliable data that does not contain noisy or redundant values. Data preparation

Data Collection
An open-source dataset obtained from GitHub was used in this study [11]. The dataset was proposed for public use for studies such as ML and other statistical studies. It was originally proposed with a regression target class of the corrosion defect. The dataset contains eight features and 10,293 instances, and it contains numerical attributes. Additionally, the dataset could be used for regression and classification problems, and it is split into training and testing sets. Table 1 describe its various features.

Data Preprocessing
The success of ML algorithms depends on various factors. The first factor is the quality and representation of the instances on the dataset. The work in the training phase needs to have reliable data that does not contain noisy or redundant values. Data preparation and filtering are important steps in processing ML problems. Data preprocessing includes data cleaning, features normalisation, and extraction [12].

Label Binarizing
There is a noteworthy difference between regression and classification problems. The former is concerned with predicting a quantity, while the latter is concerned with predicting a label. Thus, label binarizing was used in this study to convert a regression problem into a classification problem [13].
The authors converted the target attribute from regression to classification to build the said models. The values less than or equal to 0.211 were treated as 'low', and the values greater than 0.211 were treated as 'high'. Figure 2 illustrate label binarizing. The figure shows the number of instances in each class; the 'high' class contains 5491 samples, whereas the 'low' class contains 4801 samples. and filtering are important steps in processing ML problems. Data preprocessing includes data cleaning, features normalisation, and extraction [12].

Label Binarizing
There is a noteworthy difference between regression and classification problems. The former is concerned with predicting a quantity, while the latter is concerned with predicting a label. Thus, label binarizing was used in this study to convert a regression problem into a classification problem [13].
The authors converted the target attribute from regression to classification to build the said models. The values less than or equal to 0.211 were treated as 'low', and the values greater than 0.211 were treated as 'high'. Figure 2 illustrate label binarizing. The figure shows the number of instances in each class; the 'high' class contains 5491 samples, whereas the 'low' class contains 4801 samples.

Features Scaling
Feature scaling is a technique used to normalise the range of independent features or variables of data. Feature scaling is performed during the preprocessing stage, and it is also known as data normalisation. Feature scaling can be carried out using either data standardisation or normalisation [14].
Data normalisation improves the performance of the ML model, as well as generates an accurate prediction model that predicts with high accuracy. It is also known as minmax normalisation or min-max scaling. It rescales the range of features within the range [0,1]. Normalisation uses a general formula given as: Here, x' is the new value, x is the original values, min(x) and max(x) are the minimum and the maximum values of the feature, respectively [14,15].

Classification and Model Design
After preprocessing and cleaning the dataset, the authors built ML models. To build the models, the dataset was divided into samples to train and test the model. The performance of the model was measured in terms of the accuracy of the model on the testing

Features Scaling
Feature scaling is a technique used to normalise the range of independent features or variables of data. Feature scaling is performed during the preprocessing stage, and it is also known as data normalisation. Feature scaling can be carried out using either data standardisation or normalisation [14].
Data normalisation improves the performance of the ML model, as well as generates an accurate prediction model that predicts with high accuracy. It is also known as min-max normalisation or min-max scaling. It rescales the range of features within the range [0,1]. Normalisation uses a general formula given as: Here, x' is the new value, x is the original values, min(x) and max(x) are the minimum and the maximum values of the feature, respectively [14,15].

Classification and Model Design
After preprocessing and cleaning the dataset, the authors built ML models. To build the models, the dataset was divided into samples to train and test the model. The performance of the model was measured in terms of the accuracy of the model on the testing sample [16]. The most common approaches for splitting the dataset are 7:3 (training:testing) and 10-fold cross-validation (CV). In the 7:3 approach, the dataset is divided into two samples, one for training and the other for testing. The training sample represents 70% of the dataset, and the testing sample is the remaining 30% [17]. The training sample is used to train the model and enhance its ability to learn the complexity behind the features of the dataset, while the testing sample is used to measure the performance of the model on unseen data. In a 10-fold CV, the dataset is divided into 10 folds, and the model is trained 10 times. In each iteration, nine folds are used to train the model, and the remaining fold is used to test its performance. The average accuracy is calculated at the end of this process [18].
3.3.1. Support Vector Machine SVM, a supervised learning approach, is one of the most popular and simplest ML techniques because its solutions are often perfect and unique. In addition, it has good generalisability due to the principle of structural risk minimisation. This principle reduces the confidence interval while keeping the values of training error constant [19,20]. SVM can be used in both regression and classification prediction because it maximises the predictive accuracy rate through the use of ML theory and avoids data over-fitting [21]. When using SVM, it must be considered that it is a nonparametric technique (scattered technique), as its use requires storing all the training data in memory during the training phase to determine the model's parameters. As for future forecasting, support vectors are relied upon, which are a subset of training cases [19]. As shown in Figure 3, support vectors are represented by points scattered around a straight line called the hyperplane, a single line used to separate and classify data. The idea of SVM is to find a hyperplane that achieves maximum separation [20]. Furthermore, the representation of this hyperplane varies depending on whether the data can be easily separated, which results in two types of SVM classifiers, linear and nonlinear. Several hyperplanes are shown in Figure 3, and an SVM will select the best among them. sample [16]. The most common approaches for splitting the dataset are 7:3 (training:testing) and 10-fold cross-validation (CV). In the 7:3 approach, the dataset is divided into two samples, one for training and the other for testing. The training sample represents 70% of the dataset, and the testing sample is the remaining 30% [17]. The training sample is used to train the model and enhance its ability to learn the complexity behind the features of the dataset, while the testing sample is used to measure the performance of the model on unseen data. In a 10-fold CV, the dataset is divided into 10 folds, and the model is trained 10 times. In each iteration, nine folds are used to train the model, and the remaining fold is used to test its performance. The average accuracy is calculated at the end of this process [18].
3.3.1. Support Vector Machine SVM, a supervised learning approach, is one of the most popular and simplest ML techniques because its solutions are often perfect and unique. In addition, it has good generalisability due to the principle of structural risk minimisation. This principle reduces the confidence interval while keeping the values of training error constant [19,20]. SVM can be used in both regression and classification prediction because it maximises the predictive accuracy rate through the use of ML theory and avoids data over-fitting [21]. When using SVM, it must be considered that it is a nonparametric technique (scattered technique), as its use requires storing all the training data in memory during the training phase to determine the model's parameters. As for future forecasting, support vectors are relied upon, which are a subset of training cases [19]. As shown in Figure 3, support vectors are represented by points scattered around a straight line called the hyperplane, a single line used to separate and classify data. The idea of SVM is to find a hyperplane that achieves maximum separation [20]. Furthermore, the representation of this hyperplane varies depending on whether the data can be easily separated, which results in two types of SVM classifiers, linear and nonlinear. Several hyperplanes are shown in Figure 3, and an SVM will select the best among them.

Decision Tree
DT is a supervised algorithm used to solve both classification and regression problems. It is used to create a predictive model that predicts the value or category of the target, and this is carried out by teaching the model the simple decision rules derived from the training data. In this algorithm, the process of predicting the class name of any record starts from the root of the tree. The prediction is developed by comparing the value of the tree root attribute with the attribute of the record whose class name is to be predicted. Based on the comparison, the moves between the following nodes depend on the branch corresponding to that value [22]. Figure 4 show the DT classifier.

Decision Tree
DT is a supervised algorithm used to solve both classification and regression problems. It is used to create a predictive model that predicts the value or category of the target, and this is carried out by teaching the model the simple decision rules derived from the training data. In this algorithm, the process of predicting the class name of any record starts from the root of the tree. The prediction is developed by comparing the value of the tree root attribute with the attribute of the record whose class name is to be predicted. Based on the comparison, the moves between the following nodes depend on the branch corresponding to that value [22]. Figure 4 show the DT classifier.

Random Forest
RF is one of the most widely used supervised ML algorithms. It is responsible for building an ensemble of decision trees and then training them using the bagging method; therefore, it is called a 'random forest' [24]. Bagging is a concept that aims to integrate several learning models to improve the overall performance of the achieved result [24]. In recent years, this algorithm has garnered popularity owing to its simplicity and versatility in being applied to both classification and regression models. Moreover, the isolated tree structure in the forest can predict the class, which is basically the class that obtains the highest number of votes within the model [24]. Figure 5 depict the functioning of the RF algorithm.

Random Forest
RF is one of the most widely used supervised ML algorithms. It is responsible for building an ensemble of decision trees and then training them using the bagging method; therefore, it is called a 'random forest' [24]. Bagging is a concept that aims to integrate several learning models to improve the overall performance of the achieved result [24]. In recent years, this algorithm has garnered popularity owing to its simplicity and versatility in being applied to both classification and regression models. Moreover, the isolated tree structure in the forest can predict the class, which is basically the class that obtains the highest number of votes within the model [24]. Figure 5 depict the functioning of the RF algorithm.
DT is a supervised algorithm used to solve both classification and regression problems. It is used to create a predictive model that predicts the value or category of the target, and this is carried out by teaching the model the simple decision rules derived from the training data. In this algorithm, the process of predicting the class name of any record starts from the root of the tree. The prediction is developed by comparing the value of the tree root attribute with the attribute of the record whose class name is to be predicted. Based on the comparison, the moves between the following nodes depend on the branch corresponding to that value [22]. Figure 4 show the DT classifier.

Random Forest
RF is one of the most widely used supervised ML algorithms. It is responsible for building an ensemble of decision trees and then training them using the bagging method; therefore, it is called a 'random forest' [24]. Bagging is a concept that aims to integrate several learning models to improve the overall performance of the achieved result [24]. In recent years, this algorithm has garnered popularity owing to its simplicity and versatility in being applied to both classification and regression models. Moreover, the isolated tree structure in the forest can predict the class, which is basically the class that obtains the highest number of votes within the model [24]. Figure 5 depict the functioning of the RF algorithm.  It is proven that multiple unlinked trees working together is more efficient than a single isolated tree [26]. Because of this, the trees tend to protect and shield each other from defects that may develop within the forest structure. This protection is maintained while they do not walk within the same path. An interesting mystery involves the method by which the RF algorithm ensures that the behaviour of these individually isolated trees does not overly correlate with other tree structures within the model.
Finally, random forest is a very useful and versatile algorithm that can be used for both regression and classification. Furthermore, if there are enough trees in the forest, the overfitting problem is solved, and highly accurate prediction results are achieved [27]. An unfavourable aspect of the RF algorithm is the fact that, when a lot of trees are used, the algorithm can become inefficient and extremely slow for real-time predictions. The use of more trees is required for a much more accurate prediction, resulting in a slower model [28]. In most real-world applications, RF works well, but there can be cases when run-time performance is critical, and other approaches may be more effective [28].

K-Nearest Neighbour
KNN is one of the supervised algorithms used to solve both classification and regression problems. This algorithm is known as the lazy learning algorithm because it only stores data during the training phase without performing any arithmetic operations on it. This algorithm creates a predictive model that predicts the correct category of test data by finding the distance between it and the training data. The algorithm determines the k number of points closest to the test data. Next, it calculates the probability of the test data falling into the category k group, and finally, it chooses the category that achieves the highest probability. The parameter k represents the number of neighbours' relatives included in the voting process. The distance between point data and its nearest neighbour can be calculated as Euclidean distance, Manhattan distance, Hamming distance, Minkowski distance, etc. Among these distance metrics, Euclidean distance is the most widely used [29]. Figure 6 show the Euclidean distance of the KNN classifier.
It is proven that multiple unlinked trees working together is more efficient than a single isolated tree [26]. Because of this, the trees tend to protect and shield each other from defects that may develop within the forest structure. This protection is maintained while they do not walk within the same path. An interesting mystery involves the method by which the RF algorithm ensures that the behaviour of these individually isolated trees does not overly correlate with other tree structures within the model.
Finally, random forest is a very useful and versatile algorithm that can be used for both regression and classification. Furthermore, if there are enough trees in the forest, the overfitting problem is solved, and highly accurate prediction results are achieved [27]. An unfavourable aspect of the RF algorithm is the fact that, when a lot of trees are used, the algorithm can become inefficient and extremely slow for real-time predictions. The use of more trees is required for a much more accurate prediction, resulting in a slower model [28]. In most real-world applications, RF works well, but there can be cases when run-time performance is critical, and other approaches may be more effective [28].

K-Nearest Neighbour
KNN is one of the supervised algorithms used to solve both classification and regression problems. This algorithm is known as the lazy learning algorithm because it only stores data during the training phase without performing any arithmetic operations on it. This algorithm creates a predictive model that predicts the correct category of test data by finding the distance between it and the training data. The algorithm determines the k number of points closest to the test data. Next, it calculates the probability of the test data falling into the category k group, and finally, it chooses the category that achieves the highest probability. The parameter k represents the number of neighbours' relatives included in the voting process. The distance between point data and its nearest neighbour can be calculated as Euclidean distance, Manhattan distance, Hamming distance, Minkowski distance, etc. Among these distance metrics, Euclidean distance is the most widely used [29]. Figure 6 show the Euclidean distance of the KNN classifier.  Figure 6 depict a graph containing two classes of datasets A and B, and a new data point for which the class it might belong to needs to be predicted. Using the Euclidean distance equation with a value of k equal to 5, the distance between the data points can be calculated to obtain the nearest neighbours [30]. Figure 7 show the classification of KNN.  Figure 6 depict a graph containing two classes of datasets A and B, and a new data point for which the class it might belong to needs to be predicted. Using the Euclidean distance equation with a value of k equal to 5, the distance between the data points can be calculated to obtain the nearest neighbours [30]. Figure 7 show the classification of KNN.
As shown in Figure 7, the three nearest neighbours are from class A, and the two nearest neighbours are from class B, so the new point belongs to class A [30]. As shown in Figure 7, the three nearest neighbours are from class A, and the two nearest neighbours are from class B, so the new point belongs to class A [30].

Gradient Boosting
GB is a supervised algorithm used to build a predictive ML model. In the process of integrating individual decision trees into the algorithm, a method called 'reinforcement' is used. Reinforcement means developing a strong learner by merging several learning algorithms of weak learners into a single chain. The DT in this algorithm represents weak learners. The model of this algorithm is characterised by high efficiency and accuracy because each tree inside it works to fix the errors of the tree that precedes it. However, the sequential increase of trees inside the algorithm improves its performance but slows the learning process. In addition, the model relies on the loss function for residual detection. For example, the logarithmic loss is used in classification and regression tasks. Figure 8 show how the GB algorithm works [31].

Gradient Boosting
GB is a supervised algorithm used to build a predictive ML model. In the process of integrating individual decision trees into the algorithm, a method called 'reinforcement' is used. Reinforcement means developing a strong learner by merging several learning algorithms of weak learners into a single chain. The DT in this algorithm represents weak learners. The model of this algorithm is characterised by high efficiency and accuracy because each tree inside it works to fix the errors of the tree that precedes it. However, the sequential increase of trees inside the algorithm improves its performance but slows the learning process. In addition, the model relies on the loss function for residual detection. For example, the logarithmic loss is used in classification and regression tasks. Figure 8 show how the GB algorithm works [31]. As shown in Figure 7, the three nearest neighbours are from class A, and the two nearest neighbours are from class B, so the new point belongs to class A [30].

Gradient Boosting
GB is a supervised algorithm used to build a predictive ML model. In the process of integrating individual decision trees into the algorithm, a method called 'reinforcement' is used. Reinforcement means developing a strong learner by merging several learning algorithms of weak learners into a single chain. The DT in this algorithm represents weak learners. The model of this algorithm is characterised by high efficiency and accuracy because each tree inside it works to fix the errors of the tree that precedes it. However, the sequential increase of trees inside the algorithm improves its performance but slows the learning process. In addition, the model relies on the loss function for residual detection. For example, the logarithmic loss is used in classification and regression tasks. Figure 8 show how the GB algorithm works [31].

Parameter Tuning
Hyperparameters are the optimal values that define the model architecture. Hyperparameter tuning refers to the process of searching and selecting the optimal parameter and creating the model architecture. The value of the hyperparameter cannot be estimated from data and must be set before initiating the learning process [32].

Grid Search
Grid search is a basic hyperparameter tuning method. GridsearchCV enables the grid search, where it generates candidates from a grid of parameter values. Furthermore, the GridsearchCV instance implements the usual estimator application programming interface. As a result of fitting the grid search on the dataset, the best combination is retained after all the possible combinations of parameter values are evaluated [33]. Table 2 show the optimal values for each parameter.

Evaluation Metrics
In addition to accuracy, other metrics, namely, the confusion matrix, precision, recall (or sensitivity), specificity, F-score, and receiver operator characteristic-area under the curve (ROC-AUC), are used to measure the performance [34].
The confusion matrix measures the performance of the ML model by comparing the predicted values with the real values. Figure 9 show a confusion matrix for binary problem classification [35].

Parameter Tuning
Hyperparameters are the optimal values that define the model architecture. Hyperparameter tuning refers to the process of searching and selecting the optimal parameter and creating the model architecture. The value of the hyperparameter cannot be estimated from data and must be set before initiating the learning process [32].

Grid Search
Grid search is a basic hyperparameter tuning method. GridsearchCV enables the grid search, where it generates candidates from a grid of parameter values. Furthermore, the GridsearchCV instance implements the usual estimator application programming interface. As a result of fitting the grid search on the dataset, the best combination is retained after all the possible combinations of parameter values are evaluated [33]. Table 2 show the optimal values for each parameter.

Evaluation Metrics
In addition to accuracy, other metrics, namely, the confusion matrix, precision, recall (or sensitivity), specificity, F-score, and receiver operator characteristic-area under the curve (ROC-AUC), are used to measure the performance [34].
The confusion matrix measures the performance of the ML model by comparing the predicted values with the real values. Figure 9 show a confusion matrix for binary problem classification [35].  [36]. Figure 9. Confusion matrix [36].
The symbols TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negative, respectively [35].
Accuracy represents the percentage of the truly predicted samples among all the samples in the testing set [37].
Precision represents the percentage of the truly predicted samples of the positive class among all the positive predictions [37].
Recall (also known as sensitivity) represents the percentage of the positive samples that were correctly predicted among all the real positive samples [37].
Specificity represents the percentage of the negative samples that were correctly predicted among all the real negative samples [37].
The F1-score represents the average of the truly predicted samples of the positive class (precision) and the positive samples that are correctly predicted (recall). It is used to evaluate the balance of the model's predictions among the two classes [34,37]. ROC-AUC plots the probability of TP and FP at various thresholds. Thus, it shows the ability of the model to distinguish between the two classes [38]. Table 3 show the results of evaluating the models on the oil and gas pipeline leakage dataset prior to parameter optimisation. As evident from Table 3, the SVM model resulted in the best performance with an accuracy of 96.1%, followed by the RF model with an accuracy of 91.56%. The other models resulted in accuracy below 90%. Figure 10 show the confusion matrix for the SVM model.

Results and Discussion
The confusion matrix shows that the model misclassified 178 samples in the 'high corrosion' class (0) and 224 samples in the 'low corrosion' class (1). This means the model's ability to identify high corrosion is very high, and this is needed to predict pipeline leakage. Table 4 show the results of evaluating the models on the oil and gas pipeline leakage dataset after the parameter optimisation.
As shown in Table 4, the performance of all the models is improved after parameter optimisation. The SVM model resulted in the best performance with an accuracy of 97.43%, followed by the RF model with an accuracy of 91.81%. The accuracy of RF did not improve very well compared with the previous experiments. The performance of the GB model was significantly improved from 87.39% to 90.25%. Figure 11 show the confusion matrix for the SVM model. The confusion matrix shows that the model misclassified 178 samples in the 'high corrosion' class (0) and 224 samples in the 'low corrosion' class (1). This means the model's ability to identify high corrosion is very high, and this is needed to predict pipeline leakage. Table 4 show the results of evaluating the models on the oil and gas pipeline leakage dataset after the parameter optimisation. As shown in Table 4, the performance of all the models is improved after parameter optimisation. The SVM model resulted in the best performance with an accuracy of 97.43%, followed by the RF model with an accuracy of 91.81%. The accuracy of RF did not improve very well compared with the previous experiments. The performance of the GB model was significantly improved from 87.39% to 90.25%. Figure 11 show the confusion matrix for the SVM model.   The confusion matrix shows that the model misclassified 119 samples in the 'high corrosion' class (0) and 145 samples in the 'low corrosion' class (1). The number of misclassified samples was reduced in both classes. Although the accuracy of the SVM model was improved by 1% after optimisation, the confusion matrix shows a great improvement The confusion matrix shows that the model misclassified 119 samples in the 'high corrosion' class (0) and 145 samples in the 'low corrosion' class (1). The number of misclassified samples was reduced in both classes. Although the accuracy of the SVM model was improved by 1% after optimisation, the confusion matrix shows a great improvement in the model's ability to distinguish between the two classes. Moreover, the model's ability to identify high corrosion is very high, and this is needed to predict pipeline leakage. Figure 12 show the ROC-AUC curve of the SVM model. The confusion matrix shows that the model misclassified 119 samples in the 'high corrosion' class (0) and 145 samples in the 'low corrosion' class (1). The number of misclassified samples was reduced in both classes. Although the accuracy of the SVM model was improved by 1% after optimisation, the confusion matrix shows a great improvement in the model's ability to distinguish between the two classes. Moreover, the model's ability to identify high corrosion is very high, and this is needed to predict pipeline leakage. Figure 12 show the ROC-AUC curve of the SVM model. The ROC-AUC curve shows that the ability of the model to differentiate between low and high corrosion is very high (0.97), which means that the model can be used in realworld applications with a high level of confidentiality. The ROC-AUC curve shows that the ability of the model to differentiate between low and high corrosion is very high (0.97), which means that the model can be used in real-world applications with a high level of confidentiality.

Conclusions
In this paper, one of the most prominent issues faced by most oil and gas companies is highlighted, which is the problem of oil and gas leakage inside pipelines. Several previous studies were reviewed to benefit from some proposed solutions to solve the leakage problem and identify which algorithms can be used. The appropriate dataset was found, several predictive models were built using several ML algorithms, and then a comparison was made between them, choosing the best one in terms of performance. During the stage of evaluating models on the dataset of oil and gas pipeline leakage, two experiments were conducted, the first before parameter optimisation and the second after that. The results of the first experiment showed that all the proposed models resulted in good performance in anomaly detection, with performance of more than 83% in all the evaluating matrices. In comparison, the SVM model outperformed the rest of the models in performance with an accuracy of 96.1%, followed by the RF model with an accuracy of 91.56%. In the second experiment of the optimized models, there was a significant improvement in the performance of all the models. The SVM model is still considered the best among the rest of the models, with an accuracy of 97.43% and 97% in precision, recall, f1-score, and ROC-AUC. SVM was followed by the RF model with an accuracy of 91.81% and 92% in all other matrices. The confusion matrix shows the model's ability to detect corrosion and distinguish between the two classes of high and low corrosion. According to these results, the proposed model achieved good performance in the industrial data that was used, achieving the goal of this study to be used in the real world.
Using the proposed models, it is possible to develop systems capable of effectively identifying the unusual event of oil and gas pipeline leakage, thus facilitating the proper operation of the industry and avoiding any potential damage to the industrial companies and the surrounding environment. The only difficulty in this study was collecting a real