Prediction of Chloride Diffusion Coefﬁcient in Concrete Based on Machine Learning and Virtual Sample Algorithm

: The durability degradation of reinforced concrete was mainly caused by chloride ingress. Former studies have used component parameters of concrete to predict chloride diffusion by machine learning (ML)


Introduction
Given the advantages of combining steel and concrete, reinforced concrete (RC) has become the main component in modern structures.But the production of concrete causes high energy consumption in construction industry.Recently, sustainable materials, such as solid waste concrete, have been widely used to save energy and reduce carbon emissions.At the same time, if the RC structure has been built in a severe environment, it might be damaged by freeze-thaw cycles, water seepage, salt corrosion, carbonization corrosion, etc., which may affect the service life.Some ingredients, such as fly ash, slag, silica fume and basalt fiber, are added to the concrete to improve the overall performance of RC structures.
Concrete is an inhomogeneous porous material with internal pores of varying shapes and sizes.External mediums, such as water, CO 2 , chloride ions, etc., can enter the interior of the concrete through microscopic cracks or pores [1][2][3][4].Chloride attack is the main cause of structural deterioration of reinforced concrete structure, which can lead to the depassivation of steel bars, where the protective layer falls off, resulting into a serious degradation of reinforced concrete structures and leading to huge economic losses [5,6].If the chloride ingresses into the pores slowly, the chloride diffusion coefficient of concrete becomes low.Previous research showed that there was a relationship between the parameters of microstructure (such as porosity, pore size distribution, pore connectivity and path tortuosity) and the chloride diffusion coefficient in concrete [7][8][9].Some mathematicalphysical models or empirical formulas have been established in specific circumstances.For instance, Jin et al. [10] presented an innovative empirical model to predict the longterm migration behavior of chloride in concrete that considered the influences of the water-cement ratio, time, bonding effect, temperature, relative humidity and concrete deterioration.Meanwhile, the reliability of the model was validated by the measurements of chloride concentration in concrete exposed to the marine environment long term that were reported in the literature.Qi et al. [11] proposed a variable coefficient transport model of sulfate-chloride in concrete based on the law of conservation of mass, Fick's second law and the theory of porous media, which was verified by comparison with the experimental results.Shazali et al. [12] evaluated the effect of chloride binding in terms of the isotherm formulations on time-to-corrosion activation resulting from transport of chloride in saturated concrete through nonlinear finite element analyses.
It should be mentioned that concrete is a heterogeneous material which has pores with different shapes and sizes, so the idealized or simplified mathematical-physical models cannot accurately describe the relationship between the complex microstructure and macroscopic properties of concrete [13][14][15].
Machine learning (ML) is the computational algorithm designed to simulate human intelligence by learning from a given case.It can find patterns and derive values from large amounts of data.With the development of artificial intelligence technology, ML has been widely used in different areas.The ability to make accurate predictions with existing data can eliminate complex calculations, making ML particularly suitable for predicting chloride diffusion of concrete.
Researchers have used ML methods to solve the problem of chloride salt erosion.For example, Cai et al. [16] built a database containing 642 groups of concrete with free chloride concentration in which the prediction approach was established based on linear regression (LR), Gaussian process regression (GPR), support vector machine (SVM), multilayer perceptron artificial neural network (MLP-ANN) and random forests (RF) models, and the input variables include mix proportion, environmental conditions and exposure time.Liu et al. [17] established a database containing 653 groups of chloride diffusion coefficients in which the prediction model was proposed based on artificial neural network (ANN) and validated by the statistical analyses; a total of 13 parameters, including concrete components, experimental process, concrete mechanical properties, and others, were selected as the input variables.Tran [18] presented the prediction model based on extreme learning machine (ELM), SVM, K-nearest neighbors (KNN), light gradient boosting (LGB), extreme gradient boosting (XGB), RF, gradient boosting (GB), and AdaBoost (AdB), then a database of concrete components and chloride diffusion coefficients was established.Liu et al. [19] proposed a hybrid intelligent prediction model combining RF and the least square support vector machine (LSSVM) algorithm to predict the chloride penetration resistance of high-performance concrete based on 100 sets of experimental data which considered 12 mix input parameters; this proposed model can also provide a basis for optimizing concrete mix ratio.Jin et al. [20] established an ANN prediction model, and nine characteristic factors which might affect the chloride penetration resistance of recycled coarse aggregate were considered.The feasibility of the model was verified by analyzing the model performance and simulation results of the out-of-sample data set.Sensitive analysis of the input parameters was performed to obtain the influence index of a single variable affected by recycled coarse aggregate characteristics.
The prediction results obtained from ML methods could be affected by the size of database significantly [21].Among the described ML methods, ANN and SVM are the most widely used.It is worth noting that MLP is a relatively basic ANN model, which is robust but not sensitive to the missing data; it requires a large amount of data and might meet local problems easily [22,23].SVM is more accurate, stable and suitable for small samples with high accuracy, but it is sensitive to noisy data [24,25].
The size and quality of the database have significant influences on the prediction results by the ML methods [26].When predicting the chloride diffusivity based on the microstructural parameters of concrete using the ML method, a large number of experiments are required.Due to time and cost constraints, the availability of effective data is very limited.Recently, researchers [21,27] have tended to utilize a virtual data generation technique to generate sufficient sets of databases with good quality.The generated virtual database can be used with ML methods to build general small-sample frameworks, and thus, the accuracy of the model can be improved [28][29][30].
In most existing literature, the component of concrete was selected as the input variable, and the prediction models were only suitable to specific concrete.The connection between the microstructure and the macroparameter of concrete was not clear.Therefore, in this paper, the data of concrete with fly ash, slag, silica fume and basalt fiber (some are solid waste concrete) were collected from our published paper [31][32][33][34][35][36][37][38][39] and used for ML.The Gaussian mixture model (GMM-VSG) [40] was applied to expand the number of samples, which can improve the accuracy of prediction result.MLP [41] and SVM [42] with a normalization and standardization preprocessing method were proposed to predict the chloride diffusion coefficient in concrete.

MLP
MLP is a part of ANN which can imitate the connections and transmissions among neurons in the biological brain in parallel.The smallest unit is called the artificial neuron [41].As shown in Figure 1, the basic structure of the MLP model contains the input layer, the hidden layer and the output layer; with a few of the neurons, the neurons are connected with each other in terms of weight.The hyperparameters of the MLP model can be determined by different search strategies, such as grid search [43,44], random search and Bayesian search.In most existing literature, the component of concrete was selected as the input variable, and the prediction models were only suitable to specific concrete.The connection between the microstructure and the macroparameter of concrete was not clear.Therefore, in this paper, the data of concrete with fly ash, slag, silica fume and basalt fiber (some are solid waste concrete) were collected from our published paper [31][32][33][34][35][36][37][38][39] and used for ML.The Gaussian mixture model (GMM-VSG) [40] was applied to expand the number of samples, which can improve the accuracy of prediction result.MLP [41] and SVM [42] with a normalization and standardization preprocessing method were proposed to predict the chloride diffusion coefficient in concrete.

MLP
MLP is a part of ANN which can imitate the connections and transmissions among neurons in the biological brain in parallel.The smallest unit is called the artificial neuron [41].As shown in Figure 1, the basic structure of the MLP model contains the input layer, the hidden layer and the output layer; with a few of the neurons, the neurons are connected with each other in terms of weight.The hyperparameters of the MLP model can be determined by different search strategies, such as grid search [43,44], random search and Bayesian search.In this paper, the selected MLP model has two hidden layers, and each layer has 16 neurons, based on Ref. [45].Let the variables of input layer be as x1, ..., xn, and the output of one neuron can be determined by: where f(x) is the activation function, w is the weight of input variable and b is the deviation.The common activation functions including the Sigmoid function, Tanh function and ReLU function, as Figure 2 shows.Sigmoid function is the threshold unit for microscopic In this paper, the selected MLP model has two hidden layers, and each layer has 16 neurons, based on Ref. [45].Let the variables of input layer be as x 1 , . .., x n , and the output of one neuron can be determined by: where f (x) is the activation function, w is the weight of input variable and b is the deviation.The common activation functions including the Sigmoid function, Tanh function and ReLU function, as Figure 2 shows.Sigmoid function is the threshold unit for microscopic approximation which can be used for binary classification.Tanh function has a similar shape to Sigmoid function, but it can distinguish different patterns easily due to the great variation of output.However, both two functions would cause the gradient to vanish and hinder model training.The derivation of the ReLu function is constant, which can solve the problem of gradient vanishing.Hence, the ReLu function was selected as the model activation function.

SVM
SVM was first presented by Vapnik based on statistic learning theory [42], which can minimize the possibility of misclassification, and the schematic diagram of an SVM model can be found in Figure 3.The training process of an SVM is a quadratic optimization problem which is given by Refs.[46,47]: where a = (a1, a2, ..., ai) are the Lagrange multipliers, K( , ) is kernel matrix, C is a regularization parameter, N is the cardinality of training set and ξ is the slack variable.Normally, Radial Basis Functions (RBFs), polynomial functions and Sigmoid functions can be used as kernels.But the selected function needs to meet various criteria, such as simplicity, high generalization capability and other parameters, to be turned.In this study, Gaussian kernel (one type of RBFs) was used, that is: The training set can be defined as: where xi is the features vector for classifier and yi is the label which belongs to {+1,−1}.
The mathematical model for SVM can be expressed as:

SVM
SVM was first presented by Vapnik based on statistic learning theory [42], which can minimize the possibility of misclassification, and the schematic diagram of an SVM model can be found in Figure 3.The training process of an SVM is a quadratic optimization problem which is given by Refs.[46,47]: a i a j y i y j K(x i , x j ) where a = (a 1 , a 2 , . .., a i ) are the Lagrange multipliers, K( , ) is kernel matrix, C is a regularization parameter, N is the cardinality of training set and ξ is the slack variable.Normally, Radial Basis Functions (RBFs), polynomial functions and Sigmoid functions can be used as kernels.But the selected function needs to meet various criteria, such as simplicity, high generalization capability and other parameters, to be turned.In this study, Gaussian kernel (one type of RBFs) was used, that is: The training set can be defined as: where x i is the features vector for classifier and y i is the label which belongs to {+1,−1}.
The mathematical model for SVM can be expressed as: where #SV is the cardinality of SV set, a i * is the optimal value and b is the deviation which can be evaluated from the optimal value.According to previous research [48], the hyperparameter optimization method was used for SVM method.where #SV is the cardinality of SV set, ai* is the optimal value and b is the deviation which can be evaluated from the optimal value.According to previous research [48], the hyperparameter optimization method was used for SVM method.

GMM-VSG
GMM-VSG is a density estimation algorithm [40] based on the Gaussian mixture model, which could produce arbitrary nonlinear functions by adjusting its weight.Then, the probability density function of the mixture model might be changed, and the steps of the GMM-VSG could be found in Figure 4.If the sample X = {X1, ..., Xn} were generated by the Gaussian mixture distribution P, which is composed of G components, the maximum mixed likelihood function of P is given by:

Gird search technology
where θn is the corresponding parameter, γn is the probability of xi fitted by each component and πk is the weight parameter.
The density function fk( ) can be further expressed as:

GMM-VSG
GMM-VSG is a density estimation algorithm [40] based on the Gaussian mixture model, which could produce arbitrary nonlinear functions by adjusting its weight.Then, the probability density function of the mixture model might be changed, and the steps of the GMM-VSG could be found in Figure 4. where #SV is the cardinality of SV set, ai* is the optimal value and b is the deviation which can be evaluated from the optimal value.According to previous research [48], the hyperparameter optimization method was used for SVM method.

GMM-VSG
GMM-VSG is a density estimation algorithm [40] based on the Gaussian mixture model, which could produce arbitrary nonlinear functions by adjusting its weight.Then, the probability density function of the mixture model might be changed, and the steps of the GMM-VSG could be found in Figure 4.If the sample X = {X1, ..., Xn} were generated by the Gaussian mixture distribution P, which is composed of G components, the maximum mixed likelihood function of P is given by:

Gird search technology
where θn is the corresponding parameter, γn is the probability of xi fitted by each component and πk is the weight parameter.
The density function fk( ) can be further expressed as: If the sample X = {X 1 , . .., X n } were generated by the Gaussian mixture distribution P, which is composed of G components, the maximum mixed likelihood function of P is given by: where θ n is the corresponding parameter, γ n is the probability of x i fitted by each component and π k is the weight parameter.
The density function f k ( ) can be further expressed as: where µ k is the expectation of the sample and ∑ k is the covariance of the sample.Furthermore, the Gaussian mixture distribution can be represented by the density function, that is: The parameters of the Gaussian mixture model can be solved by the EM algorithm.In the E-step, the posterior probability of implicit variables is determined by the initial value or the last iteration of the parameter.In the M-step, the new parameter would be obtained by maximizing the likelihood function.The specific calculation processes are listed as follows: For the case that data is limited, it is difficult to balance the complexity and ability while selecting models.Based on the concept of entropy, AIC and BIC provide a standard to weigh the complexity of the estimation model and the goodness of fitting data which are defined in Equations ( 11) and (12).
where k is the number of model parameters and n is the number of samples.It can be found that the penalty term of BIC considers the number of samples which is more suitable for small samples.Hence, BIC was selected as the information criterion in this paper.

Data Sources
To predict the value of D (chloride diffusion coefficient), 118 sets of data were selected from our previous test (which includes the exposure test in a marine environment, simulated experiments in a laboratory environment and a porosity test using mercury intrusion method) as the first group of samples, and a database was created containing 194 sets of data (including the 118 sets of data in the first group) sorted from the published literature [31][32][33][34][35][36][37][38][39] as the second group of samples.In addition, the GMM-VSG method proposed by Shen and Qian [40] is used to expand the second group by 1000 sets of data.After eliminating two negative values, a total of 998 sets of data are obtained for the third group.

Input Variable Selection
Six parameters that may affect the D value are selected as input variables.Input variables can be classified into two categories: (1) concrete microscopic parameters, including porosity, contribution porosity of pores with different sizes (20 nm < d, 20 < d < 50 nm, 50 < d < 200 nm, 200 nm < d); (2) control variable (i.e., exposure age).
Among them, contribution porosity is the percentage of different pores multiplied by porosity.The reason for choosing exposure age as the control variable is to prevent the appearance of the same contributing porosity at different ages by concrete with different admixtures.

Statistical Characteristics of Data
Statistical analysis on each group of data is conducted.Tables 1-3 show the statistical parameters of each data group, including maximum value, minimum value, average value, median value, standard deviation, kurtosis and skewness.Taking the porosity as an example, the minimum value of the first group is the same as that of the second group, while the maximum value of the second group is larger than that of the first group.Besides, the maximum value of the second group is the same as that of the third group, while the minimum value of the third group is less than that of the second group.The standard deviation of the third group is the largest, followed by the second group and the first group.The main reason can be attributed to that the data in the first group is from the same author, resulting into a smaller discreteness.

Variable Correlation
Variable correlation on each group of data is conducted, as shown in Figure 5.It can be seen that the correlation of variables in the second group is partially different from that in the first group.In addition, the data obtained from VSG algorithm in the third group has a better variable correlation.

Variable Correlation
Variable correlation on each group of data is conducted, as shown in Figure 5.It can be seen that the correlation of variables in the second group is partially different from that in the first group.In addition, the data obtained from VSG algorithm in the third group has a better variable correlation.

Data Preprocessing
It is necessary to preprocess the input and output variables to improve the training accuracy of the model.Specific methods, including standardization and normalization preprocessing, were used to treat the data and train the model.This allowed numerical problems due to the magnitude differences of input data to be avoided.For the MLP model, the saturation of neurons can be prevented, which can improve the convergence speed [49].
The process of normalization preprocessing method is found by: where xmin is the minimum value, xmax is the maximum value and x is the sample value to be normalized.After normalization preprocessing, the results need to be reverse normalized, that is: The standardization preprocessing method is found by: x y μ σ − = (15) where µ is the average value and σ is the standard deviation.
The data trained by standardization or normalization preprocessing method both changed linearly and presented the scaling features without changing the order.For the data trained using the standardization preprocessing method, the standard deviation value and mean value were rescaled to 1 and 0, respectively, while the standard deviation value and mean value were adjusted to the range between 0 and 1 for data trained using the normalization preprocessing method.Thus, the standardization preprocessing method can better maintain the space of the sample, and the normalization preprocessing method is more susceptible to outliers.

Data Preprocessing
It is necessary to preprocess the input and output variables to improve the training accuracy of the model.Specific methods, including standardization and normalization preprocessing, were used to treat the data and train the model.This allowed numerical problems due to the magnitude differences of input data to be avoided.For the MLP model, the saturation of neurons can be prevented, which can improve the convergence speed [49].
The process of normalization preprocessing method is found by: y = x − x min x max − x min (13) where x min is the minimum value, x max is the maximum value and x is the sample value to be normalized.After normalization preprocessing, the results need to be reverse normalized, that is: The standardization preprocessing method is found by: where µ is the average value and σ is the standard deviation.The data trained by standardization or normalization preprocessing method both changed linearly and presented the scaling features without changing the order.For the data trained using the standardization preprocessing method, the standard deviation value and mean value were rescaled to 1 and 0, respectively, while the standard deviation value and mean value were adjusted to the range between 0 and 1 for data trained using the normalization preprocessing method.Thus, the standardization preprocessing method can better maintain the space of the sample, and the normalization preprocessing method is more susceptible to outliers.

Partition of Training and Testing Set
The 10-fold cross-validation is used for the data in training as it can improve the accuracy [50] and generalization ability of the model [51,52] so that the overfitting problem can be avoided.As shown in Figure 6, all data are divided into ten parts, one fold is used as validation set, and the others are used as training set in each part.Then, the optimal hyper parameters for the training model can be obtained.As the training progresses, 10% of all data are randomly selected for testing.

Partition of Training and Testing Set
The 10-fold cross-validation is used for the data in training as it can improve the accuracy [50] and generalization ability of the model [51,52] so that the overfitting problem can be avoided.As shown in Figure 6, all data are divided into ten parts, one fold is used as validation set, and the others are used as training set in each part.Then, the optimal hyper parameters for the training model can be obtained.As the training progresses, 10% of all data are randomly selected for testing.

Performance Evaluation
Statistical parameters including correlation coefficient R 2 , mean absolute error MAE and mean square error MSE are evaluated for the reliability analysis.The above parameters can be determined by: ^2 where m is the number of samples, ŷi, yi and yi represent the predicted value, true value and average value of the sample, respectively.The OBJ indicator proposed by Golafshani et al. [53] is adopted to compare the performance of different models, that is:

Performance Evaluation
Statistical parameters including correlation coefficient R 2 , mean absolute error MAE and mean square error MSE are evaluated for the reliability analysis.The above parameters can be determined by: where m is the number of samples, ŷi , y i and y i represent the predicted value, true value and average value of the sample, respectively.The OBJ indicator proposed by Golafshani et al. [53] is adopted to compare the performance of different models, that is:

MLP Model
The measured values of D were taken as the horizontal coordinate, and the predicted values were taken as the vertical coordinate.The prediction results obtained by the MLP model with normalization and standardization preprocessing methods were plotted in Figure 7.
As seen from Figure 7, only a small number of points in the first and second groups are within the ±15% error bars, while more points in the third group are within the ±15% error bars regardless of data preprocessing methods.Moreover, the mean value of single point error in the three groups is 1.17, 1.42 and 0.95 when using normalization preprocessing method, indicating that the model trained by the third group of data is more stable than the other groups, while the prediction accuracy of the second group of data is worse than that of the first group.As seen from Figure 7, only a small number of points in the first and second groups are within the ±15% error bars, while more points in the third group are within the ±15% error bars regardless of data preprocessing methods.Moreover, the mean value of single point error in the three groups is 1.17, 1.42 and 0.95 when using normalization preprocessing method, indicating that the model trained by the third group of data is more stable than the other groups, while the prediction accuracy of the second group of data is worse than that of the first group.
It should be mentioned that, if the value of MAE, MSE and OBJ is low or R 2 is close to 1, the MLP model has a good prediction performance.The prediction evaluation indicators of the MLP model are listed in Table 4.It can be found that the data of the third group had a better performance than that of the other two groups whether using the normalization or standardization preprocessing method.This reveals that the GMM-VSG algorithm can help improve the prediction accuracy of the MLP model.Although the prediction effect of the second group is better than that of the first group, its MAE and MSE are larger than those of the first group.The reason may be that the increase of data makes the prediction more accurate, while the second group of data comes from multiple different experiments so there are more test errors, which increases the proportion of noise.Furthermore, it seems that the OBJ value of the MLP model with normalization preprocessing is lower than that of the MLP model with standardization preprocessing in the third group.Thus, the normalization preprocessing method is more suitable for MLP.This is mainly because the distribution of the data set becomes uniform after using the GMM-VSG algorithm and the impact of noise value is reduced.After normalization, the gradient descent of the data becomes stable and slow and the convergence speed of the MLP model became faster so that the neurons will not supersaturate.The loss function is the deviation between the real value and prediction value which can be used to evaluate the performance of the model.Figure 8 shows the relationships between the loss value and the number of iterations.It can be seen that the loss value first decreased rapidly and then leveled off in all three groups.The curves of loss function were oscillating in the first group and second group while kept stable in the third group.The oscillations appeared due to the outliers in the dataset which would affect the accuracy of the MLP model.In general, the loss value of the data trained by standardization preprocessing method was larger than that trained using the normalization preprocessing method.The oscillation quantity of the curves of the data trained using the standardization preprocessing method are greater than those trained using the normalization preprocessing method.Furthermore, the loss value of the data trained by the normalization preprocessing method oscillated significantly only at the 800th and 2700th iterations in the first two groups and remained stable in third group, i.e., it was lower than 0.007 after 3000 iterations.But the loss value of the data trained by the standardization preprocessing method was about 0.091 after 3000 iterations for the third group.This also indicates that the normalization preprocessing is more robust and more suitable for the MLP model.

SVM Model
The prediction results obtained by the SVM model with normalization and standardization preprocessing methods were shown in Figure 9. Similar to the MLP model, for the first group and second group, few data were in the ±15% error bars, while for the third group, most points were in the ±15% error bars regardless of the data preprocessing method.The mean value of single point error in the three groups is 1.31, 1.43, and 1.03 by using the standardization preprocessing method, indicating that the SVM model trained by the data of the third group is more reliable compared to that of the other two groups.As the SVM model is more sensitive to noise values, the effect of the data increment is not so obvious, and the prediction performance of the second group is much worse than that of the first group.

SVM Model
The prediction results obtained by the SVM model with normalization and standardization preprocessing methods were shown in Figure 9. Similar to the MLP model, for the first group and second group, few data were in the ±15% error bars, while for the third group, most points were in the ±15% error bars regardless of the data preprocessing method.The mean value of single point error in the three groups is 1.31, 1.43, and 1.03 by using the standardization preprocessing method, indicating that the SVM model trained by the data of the third group is more reliable compared to that of the other two groups.As the SVM model is more sensitive to noise values, the effect of the data increment is not The prediction evaluation Indicators of the SVM model are shown In Table 5.As expected, the data of the third group had a better performance whether using the normalization or standardization preprocessing method.It is interesting to find that the OBJ value of the SVM model with standardization preprocessing is lower than that of the MLP model with normalization preprocessing in the third group.This is because SVM is the distance-based algorithm and the standardization preprocessing approach can maintain the space of the sample.
pected, the data of the third group had a better performance whether using the normalization or standardization preprocessing method.It is interesting to find that the OBJ value of the SVM model with standardization preprocessing is lower than that of the MLP model with normalization preprocessing in the third group.This is because SVM is the distance-based algorithm and the standardization preprocessing approach can maintain the space of the sample.

Comparisons
The comparisons of two machine learning methods with standardization and/or normalization preprocessing approaches are shown in Figure 10.In general, the SVM model can make a more accurate prediction while the MLP model is more stable.Because the hyperplane of SVM is determined by the support vector, in the condition of a fixed regularization, noise value will cause unreasonable division.SVM is more sensitive to the noise value.Since MLP updates the weights through multiple backpropagations, it is less sensitive to the noise value.

Comparisons
The comparisons of two machine learning methods with standardization and/or normalization preprocessing approaches are shown in Figure 10.In general, the SVM model can make a more accurate prediction while the MLP model is more stable.Because the hyperplane of SVM is determined by the support vector, in the condition of a fixed regularization, noise value will cause unreasonable division.SVM is more sensitive to the noise value.Since MLP updates the weights through multiple backpropagations, it is less sensitive to the noise value.In the first and second group, the OBJ values obtained from the MLP model with standardization and normalization preprocessing method are close, while the OBJ value obtained from the SVM model with normalization preprocessing method is lower than that with the standardization preprocessing method.In the third group, both the MLP and SVM models can produce good predictions, and the SVM model with standardization preprocessing method is the most accurate one.

Conclusions
In this paper, a reliable and accurate model for predicting the chloride diffusion coefficient in concrete has been established.A total of 194 sets of data were collected from the existing literature, and an expanded virtual database was obtained from the GMM-VSG algorithm.The following conclusions that can be drawn: In the first and second group, the OBJ values obtained from the MLP model with standardization and normalization preprocessing method are close, while the OBJ value obtained from the SVM model with normalization preprocessing method is lower than that with the standardization preprocessing method.In the third group, both the MLP and SVM models can produce good predictions, and the SVM model with standardization preprocessing method is the most accurate one.

Conclusions
In this paper, a reliable and accurate model for predicting the chloride diffusion coefficient in concrete has been established.A total of 194 sets of data were collected from the existing literature, and an expanded virtual database was obtained from the GMM-VSG algorithm.The following conclusions that can be drawn:

•
The connection between macroscopic properties and microstructure of concrete can be assessed by machine learning methods.

•
The MLP and SVM models built by the virtual database are capable of predicting the chloride diffusion coefficients in concrete.The R 2 obtained from MLP and SVM models is 0.95 (by normalization) and 0.97 (by standardization), respectively.The OBJ value obtained by the MLP and SVM model is 0.68 (by normalization) and 0.30 (by standardization), respectively.

•
The expended data set produced by GMM-VSG algorithm can help to improve the accuracy of MLP and SVM models, and the improvement is greater for SVM model, probably because the increase in the amount of data weakens the impact of noise values.

•
The normalization preprocessing method is more suitable to the MLP model, while the standardization preprocessing method is adapted to the SVM model.This may be due to the fact that normalization preserves sample spacing better, and normalization prevents oversaturation of neurons.The normalization preprocessing method is also more robust than the standardization preprocessing method.

Figure 1 .
Figure 1.The basic structure of MLP model.

Figure 1 .
Figure 1.The basic structure of MLP model.

Figure 3 .
Figure 3. Schematic diagram of an SVM model.(The red and blue dots indicate the samples that need to be classified.)

Figure 3 .
Figure 3. Schematic diagram of an SVM model.(The red and blue dots indicate the samples that need to be classified.)

Figure 3 .
Figure 3. Schematic diagram of an SVM model.(The red and blue dots indicate the samples that need to be classified.)

Figure 5 .
Figure 5. Variable correlation of each data group (a) First group; (b) Second group; (c) Third group

Figure 5 .
Figure 5. Variable correlation of each data group (a) First group; (b) Second group; (c) Third group.

Figure 7 .
Figure 7. Prediction results of the MLP model (a) First group; (b) Second group; (c) Third group.Figure 7. Prediction results of the MLP model (a) First group; (b) Second group; (c) Third group.

Figure 7 .
Figure 7. Prediction results of the MLP model (a) First group; (b) Second group; (c) Third group.Figure 7. Prediction results of the MLP model (a) First group; (b) Second group; (c) Third group.

Figure 8 .
Figure 8. Relationship between loss value and number of iterations (a) First group; (b) Second group; (c) Third group.

Figure 8 .
Figure 8. Relationship between loss value and number of iterations (a) First group; (b) Second group; (c) Third group.

Figure 9 .
Figure 9. Prediction results of the SVM model (a) First group; (b) Second group; (c) Third group.Figure 9. Prediction results of the SVM model (a) First group; (b) Second group; (c) Third group.

Figure 9 .
Figure 9. Prediction results of the SVM model (a) First group; (b) Second group; (c) Third group.Figure 9. Prediction results of the SVM model (a) First group; (b) Second group; (c) Third group.

Figure 10 .
Figure 10.Comparisons of OBJ value obtained from MLP and SVM models with standardization and/or normalization preprocessing method.

Figure 10 .
Figure 10.Comparisons of OBJ value obtained from MLP and SVM models with standardization and/or normalization preprocessing method.

Table 1 .
Statistical parameters of the first group.

Table 2 .
Statistical parameters of the second group.

Table 3 .
Statistical parameters of the third group.

Table 4 .
Prediction evaluation indicators of the MLP model.

Table 5 .
Prediction evaluation indicators of the SVM model.