A Data-Driven Machine Learning Approach for Corrosion Risk Assessment—A Comparative Study

: Understanding the corrosion risk of a pipeline is vital for maintaining health, safety and the environment. This study implemented a data-driven machine learning approach that relied on Principal Component Analysis (PCA), Particle Swarm Optimization (PSO), Feed-Forward Artiﬁcial Neural Network (FFANN), Gradient Boosting Machine (GBM), Random Forest (RF) and Deep Neural Network (DNN) to estimate the corrosion defect depth growth of aged pipelines. By modifying the hyperparameters of the FFANN algorithm with PSO and using PCA to transform the operating variables of the pipelines, di ﬀ erent Machine Learning (ML) models were developed and tested for the X52 grade of pipeline. A comparative analysis of the computational accuracy of the corrosion defect growth was estimated for the PCA transformed and non-transformed parametric values of the training data to know the inﬂuence of the PCA transformation on the accuracy of the models. The result of the analysis showed that the ML modelling with PCA transformed data has an accuracy that is 3.52 to 5.32 times better than those carried out without PCA transformation. Again, the PCA transformed GBM model was found to have the best modeling accuracy amongst the tested algorithms; hence, it was used for computing the future corrosion defect depth growth of the pipelines. This helped to compute the corrosion risks using the failure probabilities at di ﬀ erent lifecycle phases of the asset. The excerpts from the results of this study indicate that my technique is vital for the prognostic health monitoring of pipelines because it will provide information for maintenance and inspection planning


Introduction
The corrosion of pipelines has caused numerous problems for the operators of oil and gas companies due to the costs associated with the management.These problems have been predominately caused by the operating parameters that are associated with water chemistry, flow attributes, the material properties of the pipeline and microbiological activities [1,2].Corrosion mitigation is one of the challenges for oil and gas companies because of the complexities of corrosion defect initiation, stabilization and progression.To this end, relevant stakeholders have developed architectures that have the capability of intelligently estimating the corrosion defect growth.
Uniform corrosion of carbon steel material depends on an electrochemical reaction at the steel surface; a fusion process across the porous oxide film surfaces and the migration of the corrosion species [3].This movement of the corrosive species is caused by the unpredictability of the concentration of the solvents in which the exchange of the cation and anion occur [3].This in turn results in the build-up of the protective oxide films that temporarily coat the surface of the steel material to prevent more corrosion [4].However, due to the mechanical actions of the abrasive species and turbulence coming from the fluid flowing in the pipelines, the protective oxides' coatings are always eroded in a short time, resulting in more corrosion.The work of many researchers in this area has hinged on models that are mainly deterministic, stochastic and empirical using field, simulated and experimental data [4][5][6][7][8][9][10].Amongst these articles are the time-dependent stochastic models and the characterization of corrosion defect growth by exponential and logarithmic models by authors referenced in [11] and [12].The researchers determined the reliability of pipelines at different times after exposure to corrosion and computed the corrosion growth rate over time.Other authors [13][14][15] applied the Markov decision process as a continuous-time, non-homogenous linear growth pure birth model and estimated the time-dependent corrosion growth.This method helped the researchers to evaluate corrosion defect depth using multi-regression and power law models, which were also used by other experts in corrosion studies [11,12].
Notwithstanding the gains made in corrosion evaluation via systematic modelling of the defect growth with time, there are still numerous questions that are yet to be answered on corrosion defect growth estimation.To this end, many experts are still not conclusive on the trends and the expected timing of corrosion defect growth.As a result of these doubts, other scholars have veered toward artificial intelligence as a workable tool for establishing the corrosion behaviour of pipelines.For instance, the authors referenced in [13] predicted the metal loss in pipelines using an Artificial Neural Network (ANN) by considering field data that encompassed geometric profile and flow characteristics of the pipeline.They proposed a non-deterministic artificial intelligence method that estimated the corrosion at different sections of the pipeline.Many other investigations on the use of artificial intelligence in the estimation of the corrosion defect growth of carbon steel materials used for pipelines also abound in the literature [14][15][16][17][18][19][20][21][22][23].Notable among these studied is the work on the fatigue crack growth where ANN was employed to investigate the corrosion-fatigue crack of a dual phase steel at different stress intensities inconsideration of the martensite content of 32-76% [23].Other researchers [17] also estimated the corrosion-fatigue growth with ANN.However, these authors concentrated on the length of crack and cyclic loading time of the pipeline during operation.Other research papers [14][15][16][17][18] also used machine learning strategies for establishing the corrosion defect growth of pipelines in varying conditions of operations.
Pipelines are one of the most important assets in the oil and gas industry because of the pivotal role they play in the transportation of products from the producing fields to the satellites where they are processed and transported to the refineries.However, due to the presence of corrosive and abrasive species in the oil and gas extracted from the reservoirs, the pipelines are continually subjected to internal corrosion that results in the loss of the pipe wall thickness.Because of the risk the pipe wall thickness loss poses to the integrity of the pipeline, efforts are always made by the operators to assess the integrity of the pipeline and implement mitigation programs such as the addition of corrosion inhibitors [16] to reduce the corrosion rate and ameliorate the failure risk.Unfortunately, the addition of corrosion inhibitors has not significantly stopped the corrosion and erosion mechanisms in the pipelines because of the presence of some rough solid substances (such as sand) that have the tendency to cause abrasive wear.Again, the impact of the corrosion inhibitors has also been found to have a limited impact on the top of the line [24].This, unfortunately, has some ramifications for pipeline corrosion at the 9 o'clock to 3 o'clock section (upper half) of the pipeline.Since older pipelines are more prone to corrosion failure and most of them are only monitored with corrosion coupons and probes installed at discrete points, it becomes important that other cost-effective corrosion defect depth monitoring strategies be adopted.
Since a simple or complex mathematical relationship between the corrosion defect depths and the operating parameters cannot be easily formulated, artificial learning implementation is expounded in this paper.Consequently, it is assumed that the historical behaviour of the corrosive and abrasive actions of the operating parameters will continue to result in corrosion defect growth.Although intelligent pigging of pipelines is a well-established corrosion estimation practice in the oil and gas industry [25,26], it is crucial to note that the cost of running the operation is high and many older transmission pipelines were not designed for intelligent pigging.This strange situation made it possible that the corrosion growth rates can only be determined with the installed corrosion probes and coupons or by using ultrasonic thickness measurement.The use of the techniques mentioned above require physical measurement of the corrosion defects at different times and sections of the pipeline prior to modelling the corrosion defect growth.Given the fact that poor implementation of the corrosion risk assessment program can occur due to a poor future corrosion defect depth growth estimation, it is important that a modelling procedure that relies on computer algorithms be implemented.Again, the use of a machine learning model will make decisions on corrosion defect faster, seeing that the entire procedure will depend on the data obtained from historic events.The use of machine learning will also provide another line of defense in managing the risks associated with the operation of pipelines.
Since the importance of data-driven corrosion risk assessment cannot be overemphasized per the discussions so far, this research will determine the effect of different operating parameters-temperature (TM), operating pressure (PS), CO 2 partial pressure (PCO 2 ), chloride ion concentration (CL), sulphate ion concentration (SO), Basic Sediments and Water (BSW), oil production rate (BOPD), water production rate (BWPD), gas production rate (MMCFD), iron content(FE), alkalinity concentration (HCO 3 ) and calcium concentration (CA) on the corrosion defect depths of pipelines by i.
Implementing Developing a technique for using the robust PCA transformed machine learning algorithm to determine the failure probability and reliability of the pipeline using the hazard function of the pipe-wall thickness loss.
Since the pipe-wall thickness loss of pipelines subjected to corrosion and erosion is a continuous process [27], this study will develop a model for the instantaneous and time-dependent corrosion defect depth growth, and the failure probability and reliability of pipelines.This enables experts to make decisions on the corrosion risk status of the pipeline at different instances.Furthermore, this model will be vital for short, medium and long-term planning of the pipeline integrity program and will help to reduce the cost and risk associated with the repeated monitoring of corrosion in unfriendly terrains.

Research Methodology
This study developed corrosion risk assessment that relies on the corrosion defect depth growth of the pipelines using the historic operating parametric values and measured corrosion defect depths.The Machine Learning (ML) algorithms considered in this study-Feed-Forward Neural Network (FFANN), gradient Boosting Machine (GBM), Random Forest (RF ) and Deep Neural Network (DNN) -were used to compute the corrosion defect depth growth for the PCA transformed training data and training data that were not transformed.The best performing algorithm was used to determine the instantaneous and time-dependent corrosion defect depth growth using the historic trend of the operating parameters.This enabled the estimation of the failure probability, which is vital for corrosion risk assessment, inspection and maintenance planning.

Development of the Corrosion Risk Assessment (CRA) Model
The first step in the CRA model development is to determine the requisite input parameters for estimating the corrosion defect depth amongst many parameters routinely measured by the oil and gas companies.The parameters used as the input in the model, which are related to the water chemistry and flow characteristics of the oil and gas and shown in Table 1 have been the subject of numerous pipeline corrosion studies in the recent past [1,2,9,27,28].The implementation of stochastic based techniques for the reliability and failure probability of corroded pipelines have been utilized by many researchers [5][6][7][10][11][12] for the CRA of aged pipelines.However, the complications associated with the physical modelling of many interacting operating parameters make it difficult to develop effective models.As a result, the combination of Machine Learning Algorithms (MLAs) and some of the established reliability and failure probability estimation techniques (Figure 1) will ease the intricacies of the physical model development and improve the accuracy of the CRA.The implementation of stochastic based techniques for the reliability and failure probability of corroded pipelines have been utilized by many researchers [5][6][7][10][11][12] for the CRA of aged pipelines.However, the complications associated with the physical modelling of many interacting operating parameters make it difficult to develop effective models.As a result, the combination of Machine Learning Algorithms (MLAs) and some of the established reliability and failure probability estimation techniques (Figure 1) will ease the intricacies of the physical model development and improve the accuracy of the CRA.A modification to the input parameters and the hyperparameter attributes of the MLAs will be done with PSO and PCA.This modification is aimed at improving the input data processing to improve the results of the estimated corrosion defect depths.The comparison of the MLAs used in the rudimentary and PCA Transformed data processing techniques (see Figure 1) will be done with the Mean Absolute Error Percentage (MAEP) shown in Equation ( 1).This will give room for the utilization of the best estimation results of the corrosion defect depth, thereby minimizing the level of uncertainty associated with the CRA vis-à-vis improving the integrity management procedure of the asset: where DD act is the actual defect depth, DD prd is the predicted defect depth and n represents the number of samples.
The IDD represents the estimated corrosion defect depth obtained at any given instance of predicting the status of the pipeline.This is based on the prevailing operating conditions over the time intervals between successive measurements of the defect depths.Since the TDD is the cumulative values of the IDD, the status of the pipeline over a given period of operation can be estimated.It is important to note that the behaviour of the pipeline is assumed to be within the scope of the historic information of the operating parameters and the corrosion defect depths.The IDD and TDD growth will therefore be modelled per Equation ( 2): where X 1 (t), X 2 (t), . . ., X i (t) represents the operating parameters of the pipeline at a given time t, T is the cumulative time and f α represents the MLAs used in the original or modified forms of the parameters and the hyperparameters.
The corrosion defect depth at any given instance will be modelled as a nonlinear complex dynamic system that depends on the operating characteristics and their interactions.It will be assumed that the influence of microbiological organisms such as Sulphur Reducing Bacteria (SRB), which have the tendency to contribute to corrosion defect growth [28], have been minimized by the fluid flow velocity and turbulence in the pipeline.

PCA-PSO-FFANN Modelling Algorithm
Despite the usefulness of PCA in dimensionality reduction and the original pattern retention for some new variables, it is also a useful tool for the determination of input variable significance and influence on the output variable [29].The implementation of PCA transformation in this study does not only help in dimensionality reduction but in the transformation of the variables, which also enhances the hyperspace for the optimal combination of the hyperparameters.To ensure that the model variables are optimized in the FFANN, PSO was introduced for iteratively obtaining the near perfect weights for the hidden layers of the network (Figure 2).The PSO technique is used for obtaining the global maxima of the weights used in the search hyperspace of the algorithm by allowing the particles to move at their own velocities while adjusting to the velocities of other particles [30,31].This procedure makes it possible to get the global best position of the particles per the relationship shown in Equation ( 3): where v k i represents the velocity of individual particle z k i at an iteration k, r1 and r2 are random variables between 0 and 1, P k i is the best position of individual particle at k iteration, P k g is the global best position of the swarm at k iteration, c1 and c2 are constants that represent the cognitive parameter and social parameter respectively and ω is the initial weight, which decreases linearly with the number of iterations per Equation (4) [31]: where ωmax and ωmin represent the maximum and minimum initial weights respectively and kmax represents the maximum number of iterations.
In the model, a set of ℓ observations of the input and output variables given by {(xi,yi)} ℓ i=1 of D dimensional feature vectors {xi ϵ D } and ϒ corresponding outcomes {yi ϵ ϒ } were initially normalized using Equation ( 5) prior to the PCA transformation: where, ηnorm, η, ηmin and ηmax respectively represent the normalized, original, minimum and maximum values of the input or output variable.

Principal Component Analysis (PCA) Transformation
The normalized input and output variables given by {(x (n) i, y (n) i)} ℓ i=1, where PCA transformed from ℓ to dataspace using Equation (6) to Equation ( 7) [29]: The PSO technique is used for obtaining the global maxima of the weights used in the search hyperspace of the algorithm by allowing the particles to move at their own velocities while adjusting to the velocities of other particles [30,31].This procedure makes it possible to get the global best position of the particles per the relationship shown in Equation ( 3): where v k i represents the velocity of individual particle z k i at an iteration k, r 1 and r 2 are random variables between 0 and 1, P k i is the best position of individual particle at k iteration, P k g is the global best position of the swarm at k iteration, c 1 and c 2 are constants that represent the cognitive parameter and social parameter respectively and ω is the initial weight, which decreases linearly with the number of iterations per Equation (4) [31]: where ω max and ω min represent the maximum and minimum initial weights respectively and k max represents the maximum number of iterations.
In the model, a set of observations of the input and output variables given by {(x i ,y i )} i=1 of D dimensional feature vectors {x i R D } and Y corresponding outcomes {y i R Y } were initially normalized using Equation ( 5) prior to the PCA transformation: where, η norm , η, η min and η max respectively represent the normalized, original, minimum and maximum values of the input or output variable.

Principal Component Analysis (PCA) Transformation
The normalized input and output variables given by {(x (n) i , y (n) i )} i=1 , where PCA transformed from R to R ℘ dataspace using Equation (6) to Equation (7) [29]: where f x and f y are the functions of the normalized input and output variables, µ x , µ y R corresponds to the mean values of x (n) and y (n) , whereas ν ℘ x and ν ℘ y are the × ℘ matrices with ℘ orthogonal unit vectors of the input and output variables respectively.The values λ (x) R ν and λ (y) R ν are the new dimensional datapoints projections of the normalized values of the input and output variables, which are expected to be less than or equal to the original datapoints.
Effective construction of the new PCA dimensions of the normalized input and output variables will involve minimizing the error value in k iterations per Equation (7).
The Singular Value Decomposition (SVD) technique has been employed for determining the principal components by some researchers who solved Equation ( 7) [32,33].

Feed-Forward Artificial Neural Network (FFANN) Modelling
To compute the output variables from the normalized and PCA transformed original values of the input variables, this model assumes that a nonlinear relationship exists between the input and output variables and uses the tanh and sigmoid (σ) activation functions shown in Equation ( 8) to convert the vectors multiplication values in the hidden layers and output layer shown in Figure 2.
Thus, the PCA-PSO-FFANN model for the output parameter as the normalized and PCA transformed input variables (x (np) i ) transverses through the hidden layers.The model for the various layers in the PCA-PSO-FFANN is determined with the relationships shown in Equations ( 9)-( 12): H (1)   ∈ R H (1) (9) where W (1) represents the particle swarm optimized input weights at the input layer and b (1) is the bias at that layer, which is equivalent to the number of neurons in the hidden layer 1 given by H (1) : Preactivation in input layer2 : H (2)   ∈ R H (2) (10) where W (2) represents the particle swarm optimized input weights at the hidden layer1 and b (2) is the bias at that layer, which is equivalent to the number of neurons in the hidden layer 2 given by H (2) : Activation in hidden layer2 : Preactivation in input − output layer : where W (3) represents the particle swarm optimized input weights at the output layer and b (3) is the output layer bias, which is equivalent to the number of neurons in the output layer.
The estimated values of the output are computed by activating the result in Equation ( 11) with a sigmoidal activation function and adding the error value ε:

PCA-Gradient Boosting Machine (GBM) Algorithm
GBM uses a learning procedure that involves the continuous building of new models in order to properly estimate the responses of the variables, by ensuring that the base-learners are maximally correlated with the negative gradient of the loss functions of the ensemble [34].For a given experimental dataset {(x i ,y i )} i=1 that is normalized and PCA transformed using Equation (6) to Equation ( 7), the GBM algorithm determined a functional dependency x (np) f → y (np) for an ensemble estimator f x (np) that has a minimal loss function Ψ y (np) , f x (np) given by Equation ( 13) [35,36].
If f x (np) and f x (np) represent true functional dependency and estimated functional dependency respectively, and the parametric value of the track space of the relationships is restricted to the function f x (np) , θ , then the function f x (np) shown in Equation ( 13) can be written per Equation (14).
where θ, θ, E x (np) and E y (np) represent the set of parameters in the model, the set parameter estimates, expectation function of the explanatory variable and expectation function of the dependent variable, respectively.
Since solving Equation ( 14) in a numerical optimization is by iteration because of the absence of a closed form of the solution [34], the set parameter estimates θ can be determined for k steps iteration as shown in Equation (15).
The empirical loss function-J(θ) over the observed dataset x (np) , x (np) ℘ i=1 can be determined with the step gradient descent following Equation ( 16) [34]: Seeing that the improvement in the loss function gradient change-∇J(θ) gives rise to the optimized solution, the gradient of the loss function and the parametric estimate-θ for a 1 to k steps interaction can be determined by using Equation ( 17): where θk is the new incremental parameter estimate of the ensemble at a new iterative step k.

PCA-Random Forest (RF) Algorithm
Random forest is one of the ensemble learning procedures that generates numerous classifiers that are aggregated with a bagging technique and randomization to give extra weights to the nodes.This helps to enhance the prediction accuracy of the model via the bootstrapping of the sampled dataset [37][38][39].By growing a set number of trees from the original dataset as bootstraps, budding unpruned classification trees and randomly selecting the best branches, new data are predicted from the trees.
The collection of the tree predictors-h x (np) , θ k helps to minimize the individual prediction error of the trees for k = 1, . . ., k, the covariant of the PCA transformed training input variables-x (np) and θ k , which is independent and identically distributed (iid) random vectors of the model parameter.The RF predictor uses unweighted averages over the collector ( ĥ) shown in Equation (18) to generalize the prediction by minimizing the loss function and avoiding overfitting via the convergence of Equation ( 19) [40]: The loss function of the model has a least square formulation and is represented by , with E(*) being the expectation function of various parameters.

PCA-Deep Neural Network (DNN) Algorithm
The DNN uses a multilayer perceptron procedure to learn the pattern and behaviour of data with multiple levels of abstractions and applies a hierarchical architecture and backpropagation technique for minimizing error in the computation [41].
For a PCA transformed data-x , a functional dependency f : R N → {0, . . ., L} that minimizes the loss function is determined by solving Equation (20): where N is the number of training set and L is the number of labels.

Time-Dependent Reliability Modelling
The use of the MLAs described in the previous section to train a model for the instantaneous and time-dependent defect depth growth of the pipeline makes it possible to compute the failure probabilities and reliabilities.This scenario is attainable because the leak and burst failure susceptibility of the pipeline increases with the growth of the corrosion defect.Imperatively, the influences of the external forces resulting from the operational and environmental conditions of the pipeline bring about the reduction of the resistance of the asset to failure due to the increase in corrosion defect depth growth.To this end, the understanding and estimation of the status of the pipeline at any time will give the operators information for effective integrity management and risk reduction.
Since the reliability of the pipeline at any given time is the ability of the pipeline to efficiently transport the oil and gas from the fields without failure, and the failure is expected to occur when there is an extreme load on the asset, the time-dependent reliability is proposed as a function of the pipe-wall thickness loss.This is estimated with the operating conditions over a given time boundary.Although the limit state function has been utilized with the Monte Carlo simulation estimated failure rates for reliability assessment by numerous researchers [42,43], this reliance increased the subjectivity of the results.In this study, the level of uncertainty associated with the estimated reliability is reduced, as the rate of the pipe-wall thickness loss is based on the realistic estimates from the field operating conditions.Emphatically, the dependence of this model on the difference in the stress concentrations on the pipeline at zero corrosion defect depth, and the predicted corrosion defect depth at a given time, increases the accuracy of the computed reliability.
The reliability of the pipeline (R) at any given instance t can be determined with Equation ( 21), which has F R representing the cumulative density function, f R representing the probability density function and c and T standing for the mean pipe wall thickness loss function and the time to failure, respectively [44,45]: The rate of the pipe wall thickness loss λ p (t) due to the corrosion can be estimated as a hazard function represented in Equation ( 22) [44].
If the rate of the pipe wall thickness loss is directly correlated with the probability of failure of the pipeline at a given time, and the probability of failure for 90% pipe-wall thickness loss is 1, the reliability, failure probability (f P ) and the probability density function (f R ) at any given time can be computed with Equations ( 23) and ( 24):

Industrial Experiment and Application
To ascertain the functionality of the model developed in this study, field data from onshore oil and gas gathering pipelines were used.The dataset was obtained from fields in the Niger Delta region of Nigeria from sixty X52 grade pipelines for a period of ten years.The corrosion defect depths were measured with the pulse-echo technique of the Ultrasonic Thickness Measurement (UTM) while the operating parameters were routinely obtained from the fields over the period of monitoring.A comprehensive description of the experimental procedure can be obtained from the previous studies on the fields from references [12] and [46].The descriptive statistics of the maximum values of the corrosion defect depths and the mean values of the operating parameters are summarized in Table 2. Due to the presence of missing values resulting from the unforeseen circumstances in the data collection procedure in the field, and the need to have a robust dataset for training, testing and validation of the model, a multivariate polynomial regression, Equation ( 25) was used to determine the relationship between the corrosion defect depth growth rate.A total of 60 wells of operating data from over 8300 records obtained over a period of ten years was used to develop the regression model by relying on the mean values of the maximum corrosion rates and the mean values of the operating parameters, where β (1) 1,...,m , β 1,...,m , β 1,...,m and ρ are the coefficients of the operating parameters φ, φ 2 , φ 3 and the product of φ, respectively, while D d is the corrosion defect depth growth rate.
A comparison between the field measured corrosion defect depth growth rate and the ones predicted with Equation ( 25) is shown in Figure 3.
the relationship between the corrosion defect depth growth rate.A total of 60 wells of operating data from over 8300 records obtained over a period of ten years was used to develop the regression model by relying on the mean values of the maximum corrosion rates and the mean values of the operating parameters, where ,…, ( ) , ,…, ( ) , ,…, ( ) and ρ are the coefficients of the operating parameters ϕ, ϕ 2 , ϕ 3 and the product of ϕ, respectively, while Dd is the corrosion defect depth growth rate.
A comparison between the field measured corrosion defect depth growth rate and the ones predicted with Equation ( 25) is shown in Figure 3.This polynomial model provided a baseline for generating the dataset for training the machine learning model as per the following procedures: i.
Randomly generate 20,000 uniform distribution of the operating parameters to sufficiently represent different data combinations of the operating parameters and corrosion defect depth growth.ii.
Compute the corrosion defect depths growth rate based on Equation (25).iii.
Randomly select 5000 samples of the corrosion defect depth and the operating parameters ensuring that the values are not more than ±25% of the original values.This variability is to give room for noises that are characteristics of field data operating in varying conditions.Since small changes in some of the operating parameters of the pipelines can result in corrosion defect growth changes [47][48][49], it is important to account for possible variation of the data beyond the original boundaries.This is vital for improving the quality of the model to cope with unexpected changes in the operating parameters characteristics and the attendant impact on the corrosion defect growth.iv.
The dataset is used for training and validation in a 5-fold cross-ensemble validation training model having 20% of the original dataset for validation.
The model training was done in two phases, with phase one involving the PCA-PSO-FFANN and other PCA modified algorithms of GBM, DNN and RF.The second phase of the model training involved the same procedures and algorithms, but the datasets were not PCA transformed prior to the training.This second phase was used as a control for establishing the effect of the PCA transformation on the MLAs.
The PSO-FFANN algorithm was implemented with the cognitive (c 1 ) and social parameter constants (c 2 ) of 0.5, an initial weight (ω) of 0.9 and the best individual particle position (P k ) of 2.5, while using 100 particles and 1000 maximum iterations.The number of hidden neurons in the first and second layers of the FFANN was determined as 2* I_V + 4 and 2* I_V + 2, respectively, where I_V is the number of input variables to the model.

Results and Discussion
The comparison of the 5-fold cross-ensemble validation results of the PCA transformed dataset and the untransformed dataset was determined with the Root Mean Square Error (RMSE) and MAEP measurements shown in Table 3.It can be seen from Table 3 that the performance of the algorithms with the PCA transformed dataset is better than those without the PCA transformation.The PCA transformation of the PSO-FFANN model resulted in an accuracy improvement of 4.3 times more than that obtained from the model prior to the PCA transformation of the dataset.The same case was noted with the PCA-GBM, PCA-RF and PCA-DNN that improved 5.32 times, 4.19 times and 3.52 times, respectively.These significant improvements in the accuracy of prediction with PCA transformations highlight the potency of the technique for machine learning and prediction of future states of corrosion defect depth growth.To further evaluate the performance of the PCA transformation, a dataset of simulated pipelines, obtained from Equation (25) (Table 4), that are corroding at low, mild, high and severe levels based on NACE classification [50], were modelled.This simulation is important because the result shown in Table 3 is for pipelines whose corrosion defect depth comprises a mixture of low, mild, high and severe corrosion defect depth growth rates.Hence, understanding the behaviour of the model in different corrosion scenarios will be important in the determination of the robustness of the MLAs.
The performance of the algorithms, when the various classes of corrosion are considered on the pipelines as measured with the RMSE and MAEP, is shown in Figures 4 and 5.
This simulation is important because the result shown in Table 3 is for pipelines whose corrosion defect depth comprises a mixture of low, mild, high and severe corrosion defect depth growth rates.Hence, understanding the behaviour of the model in different corrosion scenarios will be important in the determination of the robustness of the MLAs.
The performance of the algorithms, when the various classes of corrosion are considered on the pipelines as measured with the RMSE and MAEP, is shown in Figures 4 and 5. Again, it can be inferred from the information in Figures 4 and 5 that the PCA transformed algorithms have superlative performance when compared with the algorithms with datasets that were untransformed.However, the accuracy of PCA transformed algorithms of the severe corrosion category is significantly lower than that of the other corrosion categories.PCA-DNN showed the worst accuracy level for the PCA transformed dataset when compared with the other models.However, the PCA-PSO-FFANN model did not perform better than the PCA-GBM and PCA-RF.Again, it can be inferred from the information in Figures 4 and 5 that the PCA transformed algorithms have superlative performance when compared with the algorithms with datasets that were untransformed.However, the accuracy of PCA transformed algorithms of the severe corrosion category is significantly lower than that of the other corrosion categories.PCA-DNN showed the worst accuracy level for the PCA transformed dataset when compared with the other models.However, the PCA-PSO-FFANN model did not perform better than the PCA-GBM and PCA-RF.
This simulation is important because the result shown in Table 3 is for pipelines whose corrosion defect depth comprises a mixture of low, mild, high and severe corrosion defect depth growth rates.Hence, understanding the behaviour of the model in different corrosion scenarios will be important in the determination of the robustness of the MLAs.
The performance of the algorithms, when the various classes of corrosion are considered on the pipelines as measured with the RMSE and MAEP, is shown in Figures 4 and 5. Again, it can be inferred from the information in Figures 4 and 5 that the PCA transformed algorithms have superlative performance when compared with the algorithms with datasets that were untransformed.However, the accuracy of PCA transformed algorithms of the severe corrosion category is significantly lower than that of the other corrosion categories.PCA-DNN showed the worst accuracy level for the PCA transformed dataset when compared with the other models.However, the PCA-PSO-FFANN model did not perform better than the PCA-GBM and PCA-RF.

Estimation of the Instantaneous Defect Depth (IDD) and Time-Dependent Defect Depth (TDD) Growth
Since the IDD is vital for the estimation of the TDD growth and reliability estimation of the pipeline, as per the previous comments in this paper, the PCA algorithms are used for estimating the future corrosion defect behaviour of the corroded pipelines for different corrosion categories.For the pipelines with the corrosion defect depth characteristic shown in Table 4, the original simulated future instantaneous and time-dependent corrosion defects for a period of 50 years as exemplified with some of the corrosion categories are compared in Figures 6 and 7.
Since the IDD is vital for the estimation of the TDD growth and reliability estimation of the pipeline, as per the previous comments in this paper, the PCA algorithms are used for estimating the future corrosion defect behaviour of the corroded pipelines for different corrosion categories.For the pipelines with the corrosion defect depth characteristic shown in Table 4, the original simulated future instantaneous and time-dependent corrosion defects for a period of 50 years as exemplified with some of the corrosion categories are compared in Figures 6 and 7.    Considering the fact that the status of a pipeline depends significantly on the corrosion defect depth growth at any given time, there is a high tendency to implement Corrosion Risk Assessment (CRA) of the pipelines with knowledge of this corrosion induced pipe wall thickness loss.The figures also indicated that the MLAs have predicted the low corrosion category better than the other categories, with PCA-GBM making a better prediction than the other algorithms.The prediction errors are most pronounced in the severe and mild corrosion categories, with PSO-FFANN and PCA-DNN showing a more distinctive lower accuracy of the predicted TDD growth than the other models.Considering the fact that the status of a pipeline depends significantly on the corrosion defect depth growth at any given time, there is a high tendency to implement Corrosion Risk Assessment (CRA) of the pipelines with knowledge of this corrosion induced pipe wall thickness loss.The figures also indicated that the MLAs have predicted the low corrosion category better than the other categories, with PCA-GBM making a better prediction than the other algorithms.The prediction errors are most pronounced in the severe and mild corrosion categories, with PSO-FFANN and PCA-DNN showing a more distinctive lower accuracy of the predicted TDD growth than the other models.

Time-Dependent Reliability Estimation
Considering the better prediction of the PCA transformed MLAs, and the comparatively more accurate prediction of the PCA-GBM than the PCA-PSO-FFANN and other MLAs (see Figures 4 and 5), the model was used for estimating the time-dependent corrosion defect depth growth vis-à-vis the reliability of the pipelines.Based on the PCA-GBM and Equations ( 21)-( 24), the probability density function, the probability of failure and the reliability of the pipelines undergoing different corrosion categories were determined (Figures 8-10).

Time-Dependent Reliability Estimation
Considering the better prediction of the PCA transformed MLAs, and the comparatively more accurate prediction of the PCA-GBM than the PCA-PSO-FFANN and other MLAs (see Figures 4 and  5), the model was used for estimating the time-dependent corrosion defect depth growth vis-à-vis the reliability of the pipelines.Based on the PCA-GBM and Equations ( 21)-( 24), the probability density function, the probability of failure and the reliability of the pipelines undergoing different corrosion categories were determined (Figures 8-10).The right skewed tails of the probability density plots in Figure 8 give an indication of the degradation failure resulting from the loss of the pipe-wall thickness over time of the operation of the pipelines.Although the loss of pipe-wall thickness is the key to pipeline failure, the probability of the wear-off is significantly higher for the severe corrosion category than for any of the other categories.This extensively high wear-off probability for the severe corrosion category has been attributed to numerous factors that include the high concentration of the corrosive species from the reservoirs [28] and the microstructural flaws of the pipe material.It is expected that the corrosion deterioration rate of the pipelines will be higher at the early stages of the lifecycles of the pipelines but will gradually reduce with the ageing of the assets as the probability of failure gradually reduces (Figure 9).This phenomenon can be attributed to the passivity that results in the reduction of the electrochemical reactivities of the pipeline material in its environment, as protective films form on the corroded surfaces [4].Although passivity can be short-lived due to erosion and turbulence in the pipelines [4], the prolonged complex interactions of the chemical species in the corrosive environment and the changing characteristics of the oil and gas flowing through the pipelines [2] can also contribute to the systematic inhibition reactions of the corrosive species.The right skewed tails of the probability density plots in Figure 8 give an indication of the degradation failure resulting from the loss of the pipe-wall thickness over time of the operation of the pipelines.Although the loss of pipe-wall thickness is the key to pipeline failure, the probability of the wear-off is significantly higher for the severe corrosion category than for any of the other categories.This extensively high wear-off probability for the severe corrosion category has been attributed to numerous factors that include the high concentration of the corrosive species from the reservoirs [28] and the microstructural flaws of the pipe material.It is expected that the corrosion deterioration rate of the pipelines will be higher at the early stages of the lifecycles of the pipelines but will gradually reduce with the ageing of the assets as the probability of failure gradually reduces (Figure 9).This phenomenon can be attributed to the passivity that results in the reduction of the electrochemical reactivities of the pipeline material in its environment, as protective films form on the corroded surfaces [4].Although passivity can be short-lived due to erosion and turbulence in the pipelines [4], the prolonged complex interactions of the chemical species in the corrosive environment and the changing characteristics of the oil and gas flowing through the pipelines [2] can also contribute to the systematic inhibition reactions of the corrosive species.The estimation of the failure probability and reliability of the pipelines will help pipeline integrity management since knowledge of the expected pipe-wall thickness loss will be a major trigger for different inspection, maintenance, repair and replacement operations.As expected, the pipelines with severe corrosion rates will be exposed to a higher risk of failure at a given time when compared with pipelines degrading at the other corrosion categories.For instance, from the information shown in Figure 10, it can be deduced that after 5 years of the pipeline service, it is expected that the low corroding pipeline will have a reliability of ~95%, the mildly corroded pipeline will be at ~78% reliability, the highly corroded pipeline will be at ~55% reliability and the pipeline undergoing severe corrosion will only have ~5% reliability.This variation in the reliabilities of the pipelines is very significant in asset integrity management.As such, the extracted knowledge from this machine learning algorithm will be effective in guiding corrosion mitigation strategies.To this end, fields that have very high corrosion tendencies could be managed with high priorities to reduce  The estimation of the failure probability and reliability of the pipelines will help pipeline integrity management since knowledge of the expected pipe-wall thickness loss will be a major trigger for different inspection, maintenance, repair and replacement operations.As expected, the pipelines with severe corrosion rates will be exposed to a higher risk of failure at a given time when compared with pipelines degrading at the other corrosion categories.For instance, from the information shown in Figure 10, it can be deduced that after 5 years of the pipeline service, it is expected that the low corroding pipeline will have a reliability of ~95%, the mildly corroded pipeline will be at ~78% reliability, the highly corroded pipeline will be at ~55% reliability and the pipeline undergoing severe corrosion will only have ~5% reliability.This variation in the reliabilities of the pipelines is very significant in asset integrity management.As such, the extracted knowledge from this machine learning algorithm will be effective in guiding corrosion mitigation strategies.To this end, fields that have very high corrosion tendencies could be managed with high priorities to reduce the risk of failure while prolonging the lifespan of the pipeline.This can be done through some specially planned integrity management programs, which can modify the characteristics of the operating parameters.Seeing that the corrosion risk level that an organization is willing to accept will The estimation of the failure probability and reliability of the pipelines will help pipeline integrity management since knowledge of the expected pipe-wall thickness loss will be a major trigger for different inspection, maintenance, repair and replacement operations.As expected, the pipelines with severe corrosion rates will be exposed to a higher risk of failure at a given time when compared with pipelines degrading at the other corrosion categories.For instance, from the information shown in Figure 10, it can be deduced that after 5 years of the pipeline service, it is expected that the low corroding pipeline will have a reliability of ~95%, the mildly corroded pipeline will be at ~78% reliability, the highly corroded pipeline will be at ~55% reliability and the pipeline undergoing severe corrosion will only have ~5% reliability.This variation in the reliabilities of the pipelines is very significant in asset integrity management.As such, the extracted knowledge from this machine learning algorithm will be effective in guiding corrosion mitigation strategies.To this end, fields that have very high corrosion tendencies could be managed with high priorities to reduce the risk of failure while prolonging the lifespan of the pipeline.This can be done through some specially planned integrity management programs, which can modify the characteristics of the operating parameters.Seeing that the corrosion risk level that an organization is willing to accept will depend on the failure probability at any instance, effective CRA can be designed and implemented in real-time following the modelled conditions of the pipeline at such instances.

Conclusions
The need to improve the accuracy of the estimated corrosion defect depth growth and the reliability of corroded and aged pipelines cannot be overemphasized, hence the need for developing a data-driven machine learning strategy for Corrosion Risk Assessment (CRA) of pipelines.Principal Component Analysis (PCA) and Particle Swarm Optimization (PSO) were used to develop ML models of corrosion defect growth.To establish the efficiency of the PCA in corrosion defect depth estimation, MLAs such as Gradient Boosting Machine (GBM), Random Forest (RF) and Deep Neural (DNN) were used for modelling corrosion defect depth growth after PCA transformation of the datasets.
Although, the PCA-PSO-FFANN algorithm and the other models, such as PCA-GBM, PCA-RF and PCA-DNN, were able to predict the corrosion datasets to a higher accuracy than the algorithms used for the modelling without PCA transformation, the PCA-GBM showed a superlative performance in comparison to the other algorithms.To this end, the PCA-GBM was used to implement a corrosion defect depth growth model of pipelines corroding at different categories: low (<0.025mm/yr.),mild (0.025 mm/yr.to 0.13 mm/yr.),high (0.13 mm/yr.to 0.25 mm/yr.)and severe (>0.25 mm/yr.).By using the time-dependent corrosion defect depths growth rate estimated with the PCA-GBM and the hazard function-based failure probability and reliability models, the future status of the pipelines was determined.
This PCA-GBM model was tested with uniform corrosion datasets of onshore pipelines using the flow characteristics of the oil and gas and the water chemistry information from the reservoirs, which were obtained from routine quality control of the pipelines.Following these findings, it is possible to use the operating parameters of the pipelines over a given period to determine the status of the pipelines in the future, using a data-driven PCA-GBM machine learning algorithm.This algorithm will also enhance the real-time estimation of the corroded state of the aged pipelines since the technique can rely on historical information to estimate the future status of the corrosion defect depths.The reliability and failure probability estimation models will also help to enhance the integrity of the pipelines through short, medium and long-term integrity management planning.This will inevitable assist in the reduction of the failure risk of the pipelines, by ensuring that real-time inspection, maintenance, repairs and replacement of the ageing corroded pipelines are carried out.
Finally, the implementation of this model will help the operators of the oil and gas pipelines to: i. Understand the expected time-dependent changes in the corrosion defect depth growth trajectory using the variabilities in the historical operating parameters and the corrosion defect depths growth rates.ii.
Provide a handy tool for planning the pipeline integrity management by providing a guide to experts on the expected pipe wall thickness loss over a given time interval.iii.
Give baseline information for effective management of the quality of the operating parameters, thereby maintaining a low cost of production in a safe operating environment.iv.
Implement a microscale corrosion defect depth estimation since corrosion degradation in the pipeline is a continuous process, thereby opening a new frontier for quick and effective decisions on pipeline integrity management.

Figure 2 .
Figure 2. Framework for the particle swarm optimization of the feed-forward artificial neural network model.

Figure 2 .
Figure 2. Framework for the particle swarm optimization of the feed-forward artificial neural network model.

Figure 3 .
Figure 3.Comparison of the field measured corrosion defect depth with that estimated with Equation (25).

Figure 3 .
Figure 3.Comparison of the field measured corrosion defect depth with that estimated with Equation (25).

Figure 4 .
Figure 4. Comparison of the RMSE of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 5 .
Figure 5.Comparison of the MAEP of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 4 .
Figure 4. Comparison of the RMSE of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 4 .
Figure 4. Comparison of the RMSE of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 5 .
Figure 5.Comparison of the MAEP of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 5 .
Figure 5.Comparison of the MAEP of the PCA transformed and untransformed datasets of low, mild, high and severe corroding pipelines-(a): PCA transformed dataset; (b): Untransformed dataset.

Figure 6 .
Figure 6.Comparison of the Instantaneous Defect Depth (IDD) of pipelines undergoing low, mild, high and severe corrosion categories-(a): low corrosion; (b): mild corrosion.

Figure 6 .
Figure 6.Comparison of the Instantaneous Defect Depth (IDD) of pipelines undergoing low, mild, high and severe corrosion categories-(a): low corrosion; (b): mild corrosion.

Figure 8 .
Figure 8. PCA-GBM estimated time-dependent probability density function of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 8 .
Figure 8. PCA-GBM estimated time-dependent probability density function of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 9 .
Figure 9. PCA-GBM estimated Time-dependent probability of failure of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 10 .
Figure 10.PCA-GBM estimated Time-dependent reliability of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 9 .Figure 9 .
Figure 9. PCA-GBM estimated Time-dependent probability of failure of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 10 .
Figure 10.PCA-GBM estimated Time-dependent reliability of the pipelines undergoing low, mild, high and severe corrosion categories.

Figure 10 .
Figure 10.PCA-GBM estimated Time-dependent reliability of the pipelines undergoing low, mild, high and severe corrosion categories.

Table 1 .
Input variables used for the determination of the corrosion defect depth of the pipelines.

Table 1 .
Input variables used for the determination of the corrosion defect depth of the pipelines.

Table 2 .
Descriptive statistics of the studied parameters.

Table 3 .
Summary of the Root Mean Square Error (RMSE) and Mean Absolute Error Percentage (MAEP) values of the MLAs for the PCA transformed and Untransformed datasets (std: standard deviation).

Table 4 .
Summary of the corrosion defect depths growth rate of the pipelines degrading at different corrosion categories.