Data Augmentation of a Corrosion Dataset for Defect Growth Prediction of Pipelines Using Conditional Tabular Generative Adversarial Networks

Due to corrosion characteristics, there are data scarcity and uneven distribution in corrosion datasets, and collecting high-quality data is time-consuming and sometimes difficult. Therefore, this work introduces a novel data augmentation strategy using a conditional tabular generative adversarial network (CTGAN) for enhancing corrosion datasets of pipelines. Firstly, the corrosion dataset is subjected to data cleaning and variable correlation analysis. The CTGAN is then used to generate external environmental factors as input variables for corrosion growth prediction, and a hybrid model based on machine learning is employed to generate corrosion depth as an output variable. The fake data are merged with the original data to form the synthetic dataset. Finally, the proposed data augmentation strategy is verified by analyzing the synthetic dataset using different visualization methods and evaluation indicators. The results show that the synthetic and original datasets have similar distributions, and the data augmentation strategy can learn the distribution of real corrosion data and sample fake data that are highly similar to the real data. Predictive models trained on the synthetic dataset perform better than predictive models trained using only the original dataset. In comparative tests, the proposed strategy outperformed other data generation methods.


Introduction
Pipelines are the most economical and safest way of transporting oil and gas over long distances [1].As pipelines undergo aging and coatings experience degradation, corrosion will occur on the outer surface of the buried pipeline [2,3].Corrosion is a phenomenon characterized by the deterioration of metals resulting from electrochemical processes occurring on the metal surface exposed to its surrounding environment.The steady loss of pipe metal on the external surface of the pipeline will result in a reduction in its service life and compromise its structural integrity [4,5], which may cause significant losses to human safety and the economy and may also have a catastrophic impact on the environment (for example, through soil contamination, toxic spills, and explosions) [6].
The corrosion growth model can assess accurately the corrosion condition of buried pipelines, providing a scientific basis for pipeline integrity management [7,8].In the framework of Industry 4.0, multiple systems that involve monitoring operations, managing risks, and maintaining and inspecting oil and gas pipelines are being extensively digitized [9].This advancement facilitates rapid expansion and application of data-driven models for predicting corrosion damage in pipelines [10].Data-driven models have the ability to learn linear or non-linear relationships between environmental factors and corrosion rate/depth from rich datasets.Ben Seghier et al. [2] used a range of artificial intelligence models to study the relationship between the maximum pitting depth and factors that contribute to pitting.Yazdi et al. [11] proposed a methodology for pipeline integrity management that utilizes a dynamic Bayesian network approach, taking into account the effects of microbial corrosion.Akhlaghi [6] explored the potential of deep learning models to predict the maximum pitting depth, with model training taking into account various characteristics of the soil as well as different types of coatings on the pipelines.However, data-driven models use historical data to train model parameters to predict future changes in corrosion depth.When constructing predictive models, it is imperative to give careful consideration not only to the advanced algorithm, but also to the dataset used to train and test the model used.
A reliable database should contain sufficient quantities of high-quality data, covering as much information as possible in the research.Due to long periods of corrosion, the high cost of detection and the privacy of the data, it is sometimes difficult to collect large amounts of real corrosion data comprising varying ages of pipelines and associated soil properties [12].Therefore, most current models are established using corrosion data sourced from publicly available datasets.Existing datasets that contain information on soil parameters as well as the actual corrosion depths of long-life pipelines are the National Institute of Standards and Technology (NIST) dataset and Velázquez's dataset [13].The NIST dataset was obtained by field investigations conducted on several pipes that were buried in 128 locations across the United States.These studies spanned a period of up to 17 years and included soils that were representative of the various regions.The Velázquez dataset was obtained from 259 underground pipelines in operation in southern Mexico over the course of three years.The Velázquez's dataset has extensive information on various kinds of coating and cathodic protection, in contrast to the NIST dataset, which only includes data from uncoated pipes without cathodic protection [14].Thus, Velázquez's dataset is regarded as a more accurate representation of actual pipeline corrosion and is highly favored by scholars.However, the quality and quantity of this dataset become a drawback that limits the performance of predictive models.Firstly, due to the spatial and temporal uncertainty of soil data, outliers will inevitably appear in the collected data.Outliers not only affect modeling accuracy but may also cause model overfitting.More importantly, the sample size of the dataset is too small and the distribution of coating styles is uneven, which may affect the generalizability and prediction accuracy of the prediction model.Unfortunately, none of these issues have received their due attention.
In cases where the size of the real dataset cannot be scaled up due to cost and data scarcity, the use of data augmentation techniques to generate fake data in place of the real data to improve model performance is an approach worth considering [15].Common deep generative models include variational auto-encoders (VAEs), generative adversarial networks (GANs) and their variants.The GAN is a novel data augmentation approach that has been recently developed to increase the sample variety and sample size based on real data and has been widely applied in the fields of image and classification.It is an improvement on VAE, and the accuracy of generated samples is better than in other synthetic data oversampling methods [16].Douzas and Bacao [17] used the conditional version of generative adversarial networks (CGANs) to approximate the real data distribution and fake data for the minority class of various imbalanced datasets.Tang et al. [18] used CGANs and CTGANs to generate more formation characteristics data for shale reservoirs through a small number of samples.He and Zhou [19] used tabular generative adversarial networks to generate synthetic full-scale blasting test data for corroded pipelines.Woldesellasse and Tesfamariam [20] used a CGAN to handle the class imbalance in soil data by generating synthetic samples.Habibi et al. [21] used a CTGAN and machine learning for imbalanced tabular data modeling to improve IoT botnet attack detection.However, data augmentation research for regression has received less attention than for classification, especially for tabular data that contain both continuous data and discrete data with uneven distributions.Datasets for regression contain both input and output variables.When a GAN is used directly to generate corrosion data, it does not distinguish between input and output variables, but treats each equally [22].This will result in a situation where the generated output variables do not match up with the generated input variables, even when the generated input variables have a similar distribution and correlation with the real input variables.As a result, the overall quality of the generated dataset is reduced, and models built on the synthetic dataset comprising these generated data may exhibit low prediction accuracy.
The primary objective of this work is to explore the use of data augmentation to generate corrosion data, thereby furnishing an accurate and robust dataset for predicting the corrosion depth growth in corroded pipelines.The research endeavors to address the following challenges: 1.
Outliers' detection: deleting anomalous data points in the original corrosion dataset.

2.
Tabular data handling: The corrosion dataset consists of data with a variety of structures and has different distributions for continuous variables and uneven distributions for discrete variables.

3.
Oversampling: Ensuring that the model can capture the real data distribution and generate new samples following the same distribution.If the real data are randomly sampled during the training process, the rows with the smallest number of categories will not be fully represented, so the GAN may not be trained correctly.If the real data are oversampled, the GAN will learn the oversampled distribution instead of the real data distribution.

4.
Regression challenges: It is imperative that the relationship between environmental factors and corrosion depth remain unchanged in the generated data.
Therefore, a novel data augmentation strategy is proposed in this paper.The data generation in this strategy consists of two parts: the generation of input variables (environmental factors) and output variables (corrosion depth) for corrosion growth prediction.For the outliers in the corrosion dataset, multiple detection methods are used for data cleaning.For tabular data and oversampling, CTGAN, which introduces mode-specific normalization and training by sampling, is used to learn real data to generate new environmental factors.For the regression challenge, a corrosion growth model based on machine learning algorithms is applied to generate corrosion depths corresponding to the new environmental factors.The fake data are merged with the original data to form the synthetic dataset.Finally, the effectiveness of the proposed strategy is verified by analyzing the distribution and credibility of the synthetic dataset using various visualization methods and evaluation metrics.This technique can provide data support for future corrosion growth modeling, and this method can also be applied to data generation for regression problems in other fields.
The rest of this paper is structured as follows.In Section 2, the buried pipeline corrosion dataset is introduced, and data cleaning and correlation analysis are performed on this dataset.Section 3 proposes the data augmentation strategy and introduces the theory of key algorithms.Section 4 verifies the credibility of the synthetic data and the advancement of the strategy.In Section 5, the main work of this research is summarized.

The Buried Pipeline Corrosion Dataset
The corrosion dataset utilized in this study comprises 241 samples derived from Velázquez's dataset.The maximum pitting depth (d max ), soil properties, pipe age (t), coating type (ct), and pipe-to-soil potential (pp) were recorded for each sample.In this dataset, any metal loss resulting from corrosion with a diameter equal to or less than twice the wall thickness of the pipeline is denoted the maximum pitting depth.The soil properties include pH, soil resistivity (re), water content (wc), redox potential (rp), chloride content (cc), sulfate content (sc), bicarbonate content (bc), bulk density (bd), and soil texture (class).The variables ct and class in the dataset are discrete.The coating types include non-coated (NC), asphalt-enamel-coated (AEC), wrap tape-coated (WTC), coal tar-coated (CTC), and fusion-bonded epoxy (FBE).The soil texture is classified according to U.S. Department of Agriculture standards into clay (C), sandy clay loam (SCL), or clay loam (CL) [23].Table 1 lists the statistics of buried pipeline corrosion data.In continuous variables, Mean, Min, Max and std.represent the mean, minimum, maximum and standard deviation of the variable, respectively.It is evident that the majority of variables do not conform to a normal distribution.In discrete variables, min and max represent the smallest and largest number of categories in the variable.It is evident that there exists a substantial disparity in the numerical distribution of ct categories, with certain categories exhibiting a notably low count.Among them, the number of NC is 31, the number of AEC is 6, the number of WTC is 95, the number of CTC is 101, and the number of FBE is 8.

Data Cleaning
In statistics, the outlier is an observation that differs from other well-structured data.Common outlier detection methods include statistics-based methods, distance-based methods, clustering-based methods, and model-based methods [24].To mitigate the risk of over-reliance on a single method and excessive elimination of data, this paper identifies one technique from each of these four categories based on the distribution and characteristics of the variables.A data point is an outlier if it is determined to be an outlier by three of these four techniques.
In statistics-based methods, the interquartile range is used to measure statistical dispersion and data variability by dividing the dataset into quartiles [25].
In distance-based methods, the Mahalanobis distance is used to determine whether a data point is an outlier by calculating the distance between the data point and other data points.
In the clustering-based method, DBSCAN is used to cluster the data points into different clusters and then determine whether the data points are outliers or not by judging whether they belong to a certain cluster [26].
In the model-based method, quantile regression is used to describe the distribution of data at different quantiles by fitting regression lines at different quantiles, thereby identifying data points that deviate greatly from the normal situation [27].
Taking into account the physical meaning of the variables, this paper deletes a total of 43 outliers.In order to illustrate the statistical properties of the modified dataset and its probability density distribution, that is, the median and quartiles, upper limit and lower limit, etc., the violin plot is drawn.The violin plot is a graphical technique used to represent continuous data and can be considered a combination of box plots and kernel density plots.As depicted in Figure 1, the data distribution after outlier detection is more concentrated.

Relational Analysis
The Spearman correlation coefficient is used to assess the monotonic relationship between variables when analyzing the correlation between variables in soil properties.It is a statistical measure that is applicable in situations when the relationship between variables is non-linear or when the variables do not adhere to a normal distribution [28].
The Spearman correlation coefficient does not assume that the data come from a specific distribution, which makes it more flexible in terms of the form and distribution of the data.
where R(x i ) and R(y i ) are the levels of the i-th observation of x and y.R(x) and R(y) are their average values.As depicted in Figure 1, the data distribution after outlier detection is more concentrated.

Relational Analysis
The Spearman correlation coefficient is used to assess the monotonic relationship between variables when analyzing the correlation between variables in soil properties.It is a statistical measure that is applicable in situations when the relationship between variables is non-linear or when the variables do not adhere to a normal distribution [28].The Spearman correlation coefficient does not assume that the data come from a specific distribution, which makes it more flexible in terms of the form and distribution of the data.

The Proposed Data Augmentation Strategy
The data augmentation strategy proposed in this paper is divided into two modules.The first module that of data generation, which combines machine learning and deep

The Proposed Data Augmentation Strategy
The data augmentation strategy proposed in this paper is divided into two modules.The first module that of data generation, which combines machine learning and deep generative models.Its main task is to generate data on environmental factors as well as corrosion depth.The focus is to ensure that the correlation and joint probability density between the input variables of the fake dataset are the same as those of the original dataset, while ensuring that the output variables still correspond to the input variables.The second module is data verification.The synthetic dataset is analyzed using a variety of evaluation indicators.
The details of this strategy are shown in Figure 3.The corrosion dataset after data cleaning is divided into a training set and a testing set [29].The training set is used to generate fake data and the testing set is used to verify the credibility of the synthetic dataset.Since the corrosion dataset is tabular data and is not unevenly distributed, CTGAN is introduced to learn the distribution and correlation of the input variables X train in the real data to generate fake data X fake similar to real data.In order to generate accurate output variables, this paper employs a hybrid method integrating a support vector machine with a firefly algorithm (FA) to establish a corrosion growth model, which has been proven by El Amine Ben Seghier et al. [30] to be able to obtain corrosion depth with higher prediction accuracy than other methods.The X fake is input to the corrosion growth model to obtain the corresponding Y fake .X fake and Y fake together form the fake dataset.Data from the fake dataset that do not conform to the physical meaning of the variables are eliminated, such as the data with pH or t less than 0. The fake dataset is merged with the original dataset to form a synthetic dataset.To verify the credibility of the synthetic dataset, the dataset D syn_train integrating the fake dataset with the training set is used to establish corrosion growth models based on three different machine learning algorithms, and the predictive performance of these models on the testing set is compared with that of a corrosion growth model built using only the training set.

Generative Adversarial Networks
GAN is a framework designed to train generative models using adversarial techniques [31].It jointly optimizes two models, namely the generator G and the discriminator D. The structure can be observed in Figure 4.Both models are independent neural networks.The generator takes random noise as input and captures the distribution of the training dataset.Its objective is to generate fake data that are indistinguishable from real data to fool the discriminator.Both the fake data and real data will then be simultaneously inputted into the discriminator.The objective of the discriminator is to differentiate between fake data and real data, with the discerned outcomes subsequently being sent as feedback to the generator.Throughout the training of GAN, the G and D

Generative Adversarial Networks
GAN is a framework designed to train generative models using adversarial techniques [31].It jointly optimizes two models, namely the generator G and the discriminator D. The structure can be observed in Figure 4.Both models are independent neural networks.The generator takes random noise as input and captures the distribution of the training dataset.Its objective is to generate fake data that are indistinguishable from real data to fool the discriminator.Both the fake data and real data will then be simultaneously inputted into the discriminator.The objective of the discriminator is to differentiate between fake data and real data, with the discerned outcomes subsequently being sent as feedback to the generator.Throughout the training of GAN, the G and D continuously learn from each other through adversarial interaction.Once the generator has successfully learned knowledge of the distribution of real data and generates fake data that closely resemble the real ones, the discriminator will no longer be able to properly discern the legitimacy of the input samples, indicating that the training has been completed.

Generative Adversarial Networks
GAN is a framework designed to train generative models using adversarial techniques [31].It jointly optimizes two models, namely the generator G and the discriminator D. The structure can be observed in Figure 4.Both models are independent neural networks.The generator takes random noise as input and captures the distribution of the training dataset.Its objective is to generate fake data that are indistinguishable from real data to fool the discriminator.Both the fake data and real data will then be simultaneously inputted into the discriminator.The objective of the discriminator is to differentiate between fake data and real data, with the discerned outcomes subsequently being sent as feedback to the generator.Throughout the training of GAN, the G and D continuously learn from each other through adversarial interaction.Once the generator has successfully learned knowledge of the distribution of real data and generates fake data that closely resemble the real ones, the discriminator will no longer be able to properly discern the legitimacy of the input samples, indicating that the training has been completed.The optimization process of GAN is essentially a game process, wherein the objective is to identify the extremum and achieve equilibrium.The distribution of noise z is denoted as represents the data generated by G through the learning process [32].To attain equilibrium between the G and D, the objective function of GAN is formulated as follows [31]: The optimization process of GAN is essentially a game process, wherein the objective is to identify the extremum and achieve equilibrium.The distribution of noise z is denoted as P z (z), while the distribution of real data is denoted as P data (x).D(x) represents the probability that the sample comes from real data, taking on values between 0 and 1. G(z) represents the data generated by G through the learning process [32].To attain equilibrium between the G and D, the objective function of GAN is formulated as follows [31]: where the first item indicates the discriminator probability of the training data, and the second item indicates the discriminator prediction of the fake data.

Conditional Tabular Generative Adversarial Networks
The corrosion dataset comprises 11 columns of continuous variables and 2 columns of discrete variables.Every column can be considered a random variable following an unidentified distribution, whereas each row can be seen as a specific instance of the joint probability distribution of each variable.Hence, the corrosion dataset presents various distinctive characteristics that pose a challenge to the GAN model.These include the presence of mixed data types, non-Gaussian and multimodal distributions, and highly imbalanced categorical columns.
Therefore, this paper introduces CTGAN to address these problems, which carries out the mode-specific normalization and designs a conditional generator and training-bysampling [33].
The variational Gaussian mixture model is employed to estimate the number of patterns and fit a Gaussian mixture for every continuous column C i .
where c i,j is the value of the i-th column and j-th row.µ k , η k and ϕ k are the weights, means, and standard deviations of kth modes, respectively.Therefore, the probability of each value c i,j coming from each mode can be calculated by the probability densities ρ k = µ k N c i,j ; η k , ϕ k of each pattern.c i,j is represented as a one-hot encoded vector β i,j specifying the pattern and a scalar α i,j referring to the specific value in the modes.
For the imbalance problem in categorical columns, the vector cond is introduced to indicate the condition.There are two discrete columns in the corrosion dataset (the coating and soil categories), namely D 1 = {1, 2, 3, 4, 5} and D 2 = {1, 2, 3}.The discrete columns D 1 and D 2 are represented as one-hot vectors d 1 and d 2 .The mask vector m i is utilized to denote the associated one-hot vector d i .For example, if specifying the third coating as a condition, we have m 1 = [0, 0, 1, 0, 0], m 2 = [0, 0, 0], and cond = [0, 0, 1, 0, 0, 0, 0, 0].In addition, the cross-entropy between m i and d i is added to penalize its loss to force the conditional generator to produce The generator G and discriminator D of CTGAN use the fully connected network to capture all possible relationships between columns.There are two fully connected hidden layers in the network structure.Moreover, the generator G uses the batch normalization and the Relu activation function [33].

Machine Learning
In order to mitigate the risks associated with depending on a single algorithm style, three types of machine learning methods were chosen to validate the synthetic dataset.These algorithms include the artificial neural network (ANN), which is based on a neural network; random forest (RF), which is based on decision trees; and the support vector machine (SVM).
(1) Artificial neural networks Artificial neural networks comprise several neuron nodes, which consist of the input layer, hidden layer, and output layer.Neurons in the hidden layer establish connections between the input and output layers using specific nonlinear functions [34].Neural networks can make precise predictions by being trained on a set of sample data.
where b and b j , respectively, represent the bias for the output and the j-th hidden nodes.w j and w ji , respectively, denote the weights for the output and hidden nodes.
(2) Random forest The fundamental principle of RF is to construct many base models and combine their predictions in order to achieve precise outcomes.RF employs bagging and random attribute selection techniques for constructing models [35].The decision trees are constructed by obtaining samples and will cease to branch further if their mean squared error reaches the optimal level.The predictions from each decision tree are tallied, and the outcomes are collectively voted upon to generate the ultimate conclusion.
(3) Support vector machine The concept of SVM is to partition a dataset to determine the hyperplane with the largest geometric separation.The hyperplane is defined using Equation (6).
where ω, x represent the regression coefficient vector and bias.The learning strategy of SVM is to maximize the interval, which can be formalized as a problem of solving convex quadratic programming.

Evaluation Indicators
There is no unified evaluation indicator for adversarial generation networks.In this work, a combination of statistical analysis and machine learning methods is applied to evaluate the performance of the proposed strategies.
This paper performs statistical analysis on the synthetic dataset and explores correlations between variables in the dataset.The Kolmogorov-Smirnov (KS) test is used to measure whether real data F 1 (x) and fake data F 2 (x) come from the same distribution [36].
The credibility of synthetic datasets is evaluated by machine learning models.Therefore, mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE) and the correlation coefficient (R 2 ) are selected as evaluation indicators.MAE and RMSE represent the deviation between the predicted value and field data.MAPE refers to the deviation between the predicted value and field data as a percentage of field data.R 2 shows the fit between the predicted value and the field data.
In Equations ( 8)-( 11), y i , ŷi , y are the field data, predicted data, and average values of predicted data; n is the number of the dataset.

Results and Discussions
The corrosion dataset after data cleaning contains a total of 197 data points; 70% of the data are used for the training set D train , whereas 30% of the data are used for the testing set D test .The D train and D test sets consist of 137 and 60 data points, respectively.The data analysis and model training are performed using the algorithm packages contained in Python 3.8.

The Evaluation of the Synthetic Dataset
Before using data augmentation to generate the fake data, the hyperparameters of the algorithm in strategy need to be tuned.The main hyperparameters that affect the performance of CTGAN are the learning rate of the generator and the discriminator, epoch, and batch size.This paper uses the Bayesian optimization algorithm to optimize the hyperparameters.Bayesian optimization is a method that determines the best combination of hyperparameters by using various surrogate functions to fit the relationship between the hyperparameters and evaluation [37].In this paper, the tree-structured Parzen estimator is chosen as the surrogate function, and the sum of the KS statistics of fake data and real data for each variable is used for the evaluation.The smaller the sum of the KS statistic, the closer the distribution of fake data and real data is.The total number of epochs is 1750 and the batch size is 150.The learning rates of generator G and discriminator D are 0.00037 and 0.000223, respectively.The main hyperparameters that affect the performance of SVM are regularization parameter C and epsilon, which are tuned by the FA algorithm.The regularization parameter C is 1.198 and epsilon is 0.1691.
The input variables X train in the training set are fed into CTGAN, and AEC and FEB in ct are selected as conditional vector cond to generate 200 new data points each.After excluding negative values in other variables from the data except pp, a total of 378 X fake samples were obtained.The training set D train is used to train SVM to establish the corrosion growth model.X fake is used to input the corrosion growth model to obtain Y fake , and the fake dataset D fake is finally obtained.The original dataset and the fake dataset are merged into a synthetic dataset D syn , with a total of 575 data points.Its statistics are shown in Table 2.In order to evaluate whether the proposed data augmentation strategy can truly learn and generate real corrosion data, different visualization methods were used to analyze the synthetic dataset.First, the distribution of each input variable between the original dataset and the synthetic dataset is compared visually, as shown in Figure 5.The original dataset is represented in orange and the synthetic dataset is represented in green.For continuous variables, the original dataset and the synthetic dataset have similar distributions.For discrete variables ct, the data on FBE and AEC have been supplemented.Therefore, the statistical properties of the synthetic dataset closely match those of the original dataset.In order to evaluate whether the proposed data augmentation strategy can truly learn and generate real corrosion data, different visualization methods were used to analyze the synthetic dataset.First, the distribution of each input variable between the original dataset and the synthetic dataset is compared visually, as shown in Figure 5.The original dataset is represented in orange and the synthetic dataset is represented in green.For continuous variables, the original dataset and the synthetic dataset have similar distributions.For discrete variables ct, the data on FBE and AEC have been supplemented.Therefore, the statistical properties of the synthetic dataset closely match those of the original dataset.Then, Spearman correlation analysis was performed on the synthetic dataset, as depicted in Figure 6.The result shows that compared to the correlation matrix of the original dataset, the synthetic dataset is able to keep the relationship between the variables of the original dataset.This indicates that the proposed strategy is able to generate data with a similar structure to the real data.Then, Spearman correlation analysis was performed on the synthetic dataset, as depicted in Figure 6.The result shows that compared to the correlation matrix of the original dataset, the synthetic dataset is able to keep the relationship between the variables of the original dataset.This indicates that the proposed strategy is able to generate data with a similar structure to the real data.There are 10 input variables in the synthetic dataset.For multidimensional data, principal component analysis can be used to extract principal components from the dataset to reduce the data dimension, which is a method of orthogonally transforming a set of variables into a set of linearly uncorrelated orthogonal basis principal components.Two principal components are extracted to represent multidimensional data from the original dataset and the synthetic dataset.The data are displayed in two-dimensional space, as depicted in Figure 7.It can be seen that the synthetic dataset has a similar distribution to the original dataset.Therefore, this strategy can successfully capture the distribution of real corrosion data and sample fake data that are highly similar to the real data.There are 10 input variables in the synthetic dataset.For multidimensional data, principal component analysis can be used to extract principal components from the dataset to reduce the data dimension, which is a method of orthogonally transforming a set of variables into a set of linearly uncorrelated orthogonal basis principal components.Two principal components are extracted to represent multidimensional data from the original dataset and the synthetic dataset.The data are displayed in two-dimensional space, as depicted in Figure 7.It can be seen that the synthetic dataset has a similar distribution to the original dataset.Therefore, this strategy can successfully capture the distribution of real corrosion data and sample fake data that are highly similar to the real data.There are 10 input variables in the synthetic dataset.For multidimensional data, principal component analysis can be used to extract principal components from the dataset to reduce the data dimension, which is a method of orthogonally transforming a set of variables into a set of linearly uncorrelated orthogonal basis principal components.Two principal components are extracted to represent multidimensional data from the original dataset and the synthetic dataset.The data are displayed in two-dimensional space, as depicted in Figure 7.It can be seen that the synthetic dataset has a similar distribution to the original dataset.Therefore, this strategy can successfully capture the distribution of real corrosion data and sample fake data that are highly similar to the real data.

The Credibility of the Synthetic Dataset
Since corrosion data augmentation involves regression problems, it is necessary to ensure that the generated input variables and output variables still have the same correlation as in the original dataset.Therefore, this paper compares the predictive performance of models trained using synthetic data with that of models trained using real data and uses different types of machine learning for robustness training.First, D fake and D train are merged into the training set D syn_train of the synthetic dataset.Then, D train and D syn_train are used to train the three machine learning algorithms (ANN, RF, and SVM) to establish the corrosion growth models.The hyperparameters of three machine learning algorithms use default values.Finally, D test is used to test the predictive performance of these corrosion growth models to assess the credibility of the synthetic data.In addition, in order to further analyze the prediction performance of the model under each coating type, this paper classifies D according to coating styles.
Table 3 depicts the prediction performances of the models Model_Ori trained by using D train and the prediction performances of models Model_Syn trained by using D syn_train under different coating styles.The smaller the values of MSE, MAE, and MAPE, the higher the prediction accuracy.The closer the R 2 value is to 1, the better the fitting performance.For each ct, the prediction performances of Model_Syn are better than those of Model_Ori.Since the number of FBE and AEC in the original dataset is very small, the numbers of them in D test are only 2 and 1, respectively.Therefore, the prediction performances of those models under these two coatings are similar and the correlation coefficient cannot be calculated under AEC.For all coatings, the prediction performances of Model_Syn are significantly improved compared to the prediction performances of Model_Ori.Figure 8 further depicts the comparison of predicted and real corrosion depth between Model_Syn and Model_Ori.The diagonal line indicates that the predicted depth is equal to the real depth.It can be observed that the points of Model_Syn are closer to the diagonal line.Therefore, the synthetic dataset derived from the data augmentation strategy proposed in this paper shows high superiority in establishing the corrosion growth model, and the model shows certain improvements in prediction performance for each coating.The synthetic dataset can be used as a substitute for real data for the purpose of corrosion growth prediction.

Comparison with Other Data Generation Methods
The corrosion depth can also be generated through methods such as deep generative models or empirical formulas.Therefore, in order to analyze the difference between the synthetic dataset obtained by the method proposed in this paper and the synthetic dataset obtained by other methods, this paper compares the prediction performance of the corrosion growth model trained on the synthetic data obtained by different methods.To improve the quality of generated output variables, this experiment made some adjustments based on the proposed strategy.For example, when optimizing the hyperparameters of CTGAN, the evaluation is selected as the RMSE of the predicted depth and the real depth, and five-fold cross-validation is used to adjust the hyperparameters of CTGAN.The total number of epochs is 1300 and the batch size is 100.The learning rate of generator G is 0.000269, and the learning rate of discriminator D is 0.000227.
(1) The generation of corrosion depth through the deep generative model Dtrain is inputted to CTGAN, and AEC and FBE in ct are selected as the conditional vector cond to generate 300 new samples each.Negative values in other variables except pp are eliminated, as depicted in Figure 9. Since CTGAN is prone to producing some outliers, data whose absolute value of the difference between fake values and real values is greater than 0.5 times the real value will be deleted.Among them, the real value is replaced by the value obtained from the empirical formula.This measure also improves

Comparison with Other Data Generation Methods
The corrosion depth can also be generated through methods such as deep generative models or empirical formulas.Therefore, in order to analyze the difference between the synthetic dataset obtained by the method proposed in this paper and the synthetic dataset obtained by other methods, this paper compares the prediction performance of the corrosion growth model trained on the synthetic data obtained by different methods.To improve the quality of generated output variables, this experiment made some adjustments based on the proposed strategy.For example, when optimizing the hyperparameters of CTGAN, the evaluation is selected as the RMSE of the predicted depth and the real depth, and five-fold cross-validation is used to adjust the hyperparameters of CTGAN.The total number of epochs is 1300 and the batch size is 100.The learning rate of generator G is 0.000269, and the learning rate of discriminator D is 0.000227.
(1) The generation of corrosion depth through the deep generative model D train is inputted to CTGAN, and AEC and FBE in ct are selected as the conditional vector cond to generate 300 new samples each.Negative values in other variables except pp are eliminated, as depicted in Figure 9. Since CTGAN is prone to producing some outliers, data whose absolute value of the difference between fake values and real values is greater than 0.5 times the real value will be deleted.Among them, the real value is replaced by the value obtained from the empirical formula.This measure also improves the quality of data generated by deep generative models.Finally, the data X ctgan and Y ctgan are obtained through CTGAN and merged with D train to form the synthetic dataset D ctgan .
the quality of data generated by deep generative models.Finally, the data Xctgan and Yctgan are obtained through CTGAN and merged with Dtrain to form the synthetic dataset Dctgan.(2) The generation of corrosion depth through an empirical formula The power function corrosion growth model is one of the most successful empirical models.Caleyo et al. [38] used multiple nonlinear regression analysis to establish a corrosion growth model based on the power function, which is widely used in current research.

(
) where 0 t is the pit initial time, i x represents environmental factors, and i k and j n are the regression coefficient corresponding to each variable.For different soil categories, i k and j n select corresponding values, as shown in Table 4.In Equation ( 12), the influence of coating is modeled through an independent variable whose value is assigned by means of the scoring model, namely FBE = 0.3, AEC = 0.9, WTC = 0.8, NC = 1, CTC = 0.7.Xctgan is inputted into Equation ( 12) to obtain Yem.They are merged with Dtrain to form the synthetic dataset Dem.(2) The generation of corrosion depth through an empirical formula The power function corrosion growth model is one of the most successful empirical models.Caleyo et al. [38] used multiple nonlinear regression analysis to establish a corrosion growth model based on the power function, which is widely used in current research.
where t 0 is the pit initial time, x i represents environmental factors, and k i and n j are the regression coefficient corresponding to each variable.
For different soil categories, k i and n j select corresponding values, as shown in Table 4.In Equation ( 12), the influence of coating is modeled through an independent variable whose value is assigned by means of the scoring model, namely FBE = 0.3, AEC = 0.9, WTC = 0.8, NC = 1, CTC = 0.7.X ctgan is inputted into Equation ( 12) to obtain Y em .They are merged with D train to form the synthetic dataset D em .(3) The generation of corrosion depth through the proposed strategy X ctgan is input into the corrosion growth model based on SVM to obtain Y ml .They are merged with D train to form the synthetic dataset D ml .
Three machine learning algorithms (ANN, RF and SVM) are trained using D ctgan , D em and D ml , respectively, to obtain the corrosion growth models, and D test is used to verify the prediction performance of these models.Table 5 shows that the predictive performance of the three machine learning algorithms trained on synthetic data obtained by the proposed strategy is better than that of models trained on the synthetic data generated by deep generative models and empirical formulas.Taylor diagrams can show the standard deviation, root mean square and correlation coefficient of the predicted values.This correlation coefficient refers to the Pearson correlation coefficient.The higher the correlation coefficient, the better the model's prediction performance.Figure 10 shows Taylor diagrams of three machine learning models trained using three datasets.No matter what machine learning algorithm is used, the model built using D ml has smaller errors and a higher correlation.Therefore, it can be said that the strategy proposed in this paper is better than the empirical formula and single-CTGAN method.Three machine learning algorithms (ANN, RF and SVM) are trained using Dctgan, Dem and Dml, respectively, to obtain the corrosion growth models, and Dtest is used to verify the prediction performance of these models.Table 5 shows that the predictive performance of the three machine learning algorithms trained on synthetic data obtained by the proposed strategy is better than that of models trained on the synthetic data generated by deep generative models and empirical formulas.Taylor diagrams can show the standard deviation, root mean square and correlation coefficient of the predicted values.This correlation coefficient refers to the Pearson correlation coefficient.The higher the correlation coefficient, the better the model's prediction performance.Figure 10 shows Taylor diagrams of three machine learning models trained using three datasets.No matter what machine learning algorithm is used, the model built using Dml has smaller errors and a higher correlation.Therefore, it can be said that the strategy proposed in this paper is better than the empirical formula and single-CTGAN method.

Conclusions
In order to achieve an accurate prediction of corrosion depth growth, the data shortage problem in the establishment of data-driven models needs to be solved.Due to the non-Gaussian distribution of continuous variables and uneven distributions of discrete variables in the corrosion dataset, this paper proposes a data augmentation

Conclusions
In order to achieve an accurate prediction of corrosion depth growth, the data shortage problem in the establishment of data-driven models needs to be solved.Due to the non-Gaussian distribution of continuous variables and uneven distributions of discrete variables in the corrosion dataset, this paper proposes a data augmentation strategy using CTGAN to develop an accurate and robust dataset.Before data generation, data cleaning is performed using various outlier detection methods, and variables are analyzed using Spearman correlation coefficients.The input variables for predicting corrosion growth are generated using CTGAN, while the output variables corresponding to the input variables are generated using a corrosion growth model based on a hybrid method.The fake data and real data are merged into a synthetic dataset.To verify the effectiveness of the proposed strategy, the synthetic dataset is analyzed and tested.The conclusion is as follows: (1) The proposed strategy can capture the real corrosion data and generate fake data that are the same as the real data.The variables in the synthetic dataset have similar distributions and Spearman correlation coefficients as the real dataset, and the two principal components obtained by the principal component analysis of the two datasets are similar.(2) The corrosion growth models established by using the synthetic dataset have better predictive performance than the models established by using the real dataset for any coating type.Therefore, the synthetic dataset can be used as a supplement to the real data for corrosion growth prediction.(3) The superiority of the proposed strategy is demonstrated by comparing it with existing deep generative models and empirical formulation methods.The results of this comparison show that the corrosion growth models established by using the synthetic dataset obtained by the proposed method have better prediction performance than those obtained via other methods.

Materials 2024 ,
17, x FOR PEER REVIEW 5 of 19continuous data and can be considered a combination of box plots and kernel density plots.

Figure 1 .
Figure 1.The violin plot of the modified dataset.
the levels of the i-th observation of x and y. ( ) R x and ( ) R y are their average values.The Spearman correlation coefficients between variables in soil properties are depicted in Figure 2. The correlation coefficient ranges from −1 to 1.The strength of the correlation between the two variables increases as the value approaches 1 or −1.Positive values signify a positive correlation, whereas negative values signify a negative correlation.In continuous variables, the relationship between soil resistivity and water content exhibits a strong negative association.The remaining variables exhibit relatively weak correlations.

Figure 1 . 19 Figure 2 .
Figure 1.The violin plot of the modified dataset.The Spearman correlation coefficients between variables in soil properties are depicted in Figure2.The correlation coefficient ranges from −1 to 1.The strength of the correlation between the two variables increases as the value approaches 1 or −1.Positive values signify a positive correlation, whereas negative values signify a negative correlation.In continuous variables, the relationship between soil resistivity and water content exhibits a strong negative association.The remaining variables exhibit relatively weak correlations.Materials 2024, 17, x FOR PEER REVIEW 6 of 19

Figure 2 .
Figure 2. The Spearman correlation coefficient between variables in soil properties.

Figure 3 .
Figure 3.The proposed data augmentation strategy.

Figure 3 .
Figure 3.The proposed data augmentation strategy.

Figure 4 .
Figure 4.The structure of GAN.
( ) z P z , while the distribution of real data is denoted as ( ) data P x .( ) D x represents the probability that the sample comes from real data, taking on values between 0 and 1. ( ) G z

Figure 4 .
Figure 4.The structure of GAN.

Figure 5 .
Figure 5. Histogram of the frequency distribution of the variable.

Figure 5 .
Figure 5. Histogram of the frequency distribution of the variable.

Figure 6 .
Figure 6.Heat map of the Spearman correlation coefficient of the synthetic dataset.

Figure 7 .
Figure 7.The principal components of the original dataset and synthetic dataset.

Figure 6 .
Figure 6.Heat map of the Spearman correlation coefficient of the synthetic dataset.

Materials 2024 , 19 Figure 6 .
Figure 6.Heat map of the Spearman correlation coefficient of the synthetic dataset.

Figure 7 .
Figure 7.The principal components of the original dataset and synthetic dataset.Figure 7. The principal components of the original dataset and synthetic dataset.

Figure 7 .
Figure 7.The principal components of the original dataset and synthetic dataset.Figure 7. The principal components of the original dataset and synthetic dataset.

Figure 8 .
Figure 8.Comparison of predicted and real corrosion depth using Model_Syn and Model_Ori.

Figure 8 .
Figure 8.Comparison of predicted and real corrosion depth using Model_Syn and Model_Ori.

Figure 9 .
Figure 9.The framework for comparing with other data generation methods.

Figure 9 .
Figure 9.The framework for comparing with other data generation methods.

( 3 )
The generation of corrosion depth through the proposed strategyXctgan is input into the corrosion growth model based on SVM to obtain Yml.They are merged with Dtrain to form the synthetic dataset Dml.

Figure 10 .
Figure 10.Taylor diagrams for comparison of models.

Figure 10 .
Figure 10.Taylor diagrams for comparison of models.

Table 1 .
The statistics of buried pipeline corrosion dataset.

Table 2 .
The statistics of the synthetic dataset.

Table 3 .
Prediction performances of models built with original and synthetic datasets under different coating types.

Table 4 .
Coefficients for the corrosion growth model.

Table 4 .
Coefficients for the corrosion growth model.

Table 5 .
Predictive performance of models built with different synthetic datasets.

Table 5 .
Predictive performance of models built with different synthetic datasets.