Data-Driven Parameter Selection and Modeling for Concrete Carbonation

Concrete carbonation is known as a stochastic process. Its uncertainties mainly result from parameters that are not considered in prediction models. Parameter selection, therefore, is important. In this paper, based on 8204 sets of data, statistical methods and machine learning techniques were applied to choose appropriate influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors’ influence on the uncertainties of carbonation depth; and (3) the correlation between factors. Both single parameters and parameter groups were evaluated quantitatively. The results showed that compressive strength had the highest correlation with carbonation depth and that using the aggregate–cement ratio as the parameter significantly reduced the dispersion of carbonation depth to a low level. Machine learning models manifested that selected parameter groups had a large potential in improving the performance of models with fewer parameters. This paper also developed machine learning carbonation models and simplified them to propose a practical model. The results showed that this concise model had a high accuracy on both accelerated and natural carbonation test datasets. For natural carbonation datasets, the mean absolute error of the practical model was 1.56 mm.


Introduction
It is a well-known fact that carbonation does not normally cause damage to concrete directly [1]. However, the chemical reaction slowly destroys the alkalinity environment. CO 2 diffuses into the concrete through interconnected pores and reacts with calcium hydroxide (CH) and hydrated calcium silicate (C-S-H) [2,3]. Consequently, it destroys the passive oxide layer of the rebar and ultimately initiates corrosion [4].
Many models, including theoretical formulas [5,6], numerical models [7][8][9], and machine learning models [10,11], have been developed over the last few decades to evaluate the carbonation status of concrete. Parameter research plays an important role in the modeling process. Previous parameter research focused on the mechanism of influence factors. Many experimental tests [12][13][14][15] have been performed to qualitatively analyze the effects that factors have on concrete carbonation and investigate the mechanism in terms of chemical reactions.
For example, Papadakis et al. [16] suggested that replacing cement with fly ash would increase the porosity, while replacing aggregate with fly ash would decrease the porosity. Experiments in [17][18][19] suggest that replacing cement with silica fume reduces porosity since silica fume has very fine particles and a high amorphous silicon dioxide content. Large amounts of active alumina and amorphous SiO 2 in fly ash consume the CH, but the ferrous phase in crystalline form does not participate in the pozzolanic reactions [16]. Qiang et al. [20] pointed out that steel slag had a weak reactivity. Han et al. [21] further pointed out that the main active components in steel slag were only C 2 S and C 3 S. Li et al. [22] demonstrated that the compressive strength, porosity, and permeability of concrete changed significantly during carbonation. Jiang et al. [23] explored the influence of various binder types and geometrical parameters (i.e., concrete cover thickness) on concrete carbonation and steel corrosion. The effects of supplementary cementitious materials and environmental factors such as relative humidity have been studied by many experimental tests [24,25].
However, present carbonation models have not involved enough work on parameter research. Some models choose parameters subjectively. For example, some empirical models take the water-cement ratio as the main parameter, while others choose concrete strength, but few studies can explain the reasons quantitatively. Currently, judging whether or not a variate can affect concrete carbonation is easy, but putting all possible factors in a model is also superfluous. Therefore, it is important to quantitatively analyze these factors.
Moreover, in terms of concrete mix design, quantitative parameter research is also necessary, since standards are needed for specifying the limitation of several indicators to control the durability of concrete. For instance, the new European Standard EN 206-1 specifies the minimum binder content and maximum water-binder ratio to guarantee the performance of concrete [26]; Chinese code GB 50010-2010 specifies the maximum water-binder ratio and the minimum strength level.
The residual of carbonation models and the uncertainties in controlling the durability mainly result from factors that are not considered in models and concrete mix design. These uncertainties can be decreased by selecting appropriate parameters. For example, Figure 1 illustrates how parameter selection affects the residual of models. The dataset was generated by y = e x 1 +x 2 . The model in Figure 1b has the best performance as it has the smallest residuals. Moreover, it is better to use x 1×2 (Figure 1d) rather than x 1 or x 2 (Figure 1a,c) to establish a model, as the former can significantly reduce the uncertainty of the prediction. Therefore, it is important to find appropriate parameters in modeling.
further pointed out that the main active components in steel slag were only C2S and C3S. Li et al. [22] demonstrated that the compressive strength, porosity, and permeability of concrete changed significantly during carbonation. Jiang et al. [23] explored the influence of various binder types and geometrical parameters (i.e., concrete cover thickness) on concrete carbonation and steel corrosion. The effects of supplementary cementitious materials and environmental factors such as relative humidity have been studied by many experimental tests [24,25].
However, present carbonation models have not involved enough work on parameter research. Some models choose parameters subjectively. For example, some empirical models take the water-cement ratio as the main parameter, while others choose concrete strength, but few studies can explain the reasons quantitatively. Currently, judging whether or not a variate can affect concrete carbonation is easy, but putting all possible factors in a model is also superfluous. Therefore, it is important to quantitatively analyze these factors.
Moreover, in terms of concrete mix design, quantitative parameter research is also necessary, since standards are needed for specifying the limitation of several indicators to control the durability of concrete. For instance, the new European Standard EN 206-1 specifies the minimum binder content and maximum water-binder ratio to guarantee the performance of concrete [26]; Chinese code GB 50010-2010 specifies the maximum waterbinder ratio and the minimum strength level.
The residual of carbonation models and the uncertainties in controlling the durability mainly result from factors that are not considered in models and concrete mix design. These uncertainties can be decreased by selecting appropriate parameters. For example, Figure 1 illustrates how parameter selection affects the residual of models. The dataset was generated by = . The model in Figure 1b has the best performance as it has the smallest residuals. Moreover, it is better to use x1×2 (Figure 1d) rather than x1 or x2 (Figure 1a,c) to establish a model, as the former can significantly reduce the uncertainty of the prediction. Therefore, it is important to find appropriate parameters in modeling.  Quantitative analysis requires a large amount of test data, which is difficult to find in previous studies. In this project, many studies were consulted and a dataset including 8204 samples was established. Statistical methods were used for data-driven analysis, as well as machine learning techniques. This paper quantitatively studied influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors' influence on the fluctuation of carbonation depth; and (3) the correlation between factors, which reflects the redundancy [27]. A total of 29 material-related parameters were involved. Then, we selected some parameter groups in terms of these three aspects and developed several machine learning models to verify the effectiveness. After that, one machine learning model involving a few factors was established and a practical model was proposed by simplifying it. The effectiveness of the practical model was verified via accelerated and natural carbonation datasets.

Data Collection and Description
It is necessary first to provide a concise introduction to the dataset. To make the results more reliable, 8204 sets of data, including 161 papers in Web of Science, CNKI, and WAN FANG DATA, were collected in this paper, as shown in Table 1. All concrete samples used in this study were cured for 28 days before the accelerated carbonation tests, and carbonation depth was determined by phenolphthalein. The experimental environment conditions of all the referred studies were constant. Generally, empirical models based on accelerated carbonation datasets are not appropriate for predicting actual concrete carbonation as they are often obtained under a high CO 2 concentration. However, in terms of the relationship between factors and carbonation depth, accelerated carbonation tests are still appropriate. It is noted that the degree of the correlation is important in this part, regardless of whether this influence is positive or negative for carbonation. This implies that short-term accelerated carbonation datasets can work in this research, as some parameters such as fly ash develop a significant proportion of concrete's strength and durability after 28 days. Parameters p i i = C, S, A, F, or S, i.e., CaO, SiO 2 , Al 2 O 3 , Fe 2 O 3 , or SO 3 can be calculated by: where p i is the weight of i used per unit volume of concrete; p k is the weight of material k (k = FA, FS, SA, or SS, i.e., fly ash, furnace slag, silica ash, or steel slag) used per unit volume of concrete; p i,k is the content of i in material k; p Cemj is the weight of cement of class j; p i,Cemj is the content of i in class j cement, as is explained in Table 1. In addition, furnace slag and fly ash were classified by their fineness and specific surface area according to ground granulated blast-furnace slag used for cement, mortar, and concrete GB/T 18046-2017 and fly ash used for cement and concrete GB/T 1596-2017, respectively. Table 1 exhibits detailed information about the dataset. Quantitative analysis requires a large amount of test data, which is difficult to find in previous studies. In this project, many studies were consulted and a dataset including 8204 samples was established. Statistical methods were used for data-driven analysis, as well as machine learning techniques. This paper quantitatively studied influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors' influence on the fluctuation of carbonation depth; and (3) the correlation between factors, which reflects the redundancy [27]. A total of 29 material-related parameters were involved. Then, we selected some parameter groups in terms of these three aspects and developed several machine learning models to verify the effectiveness. After that, one machine learning model involving a few factors was established and a practical model was proposed by simplifying it. The effectiveness of the practical model was verified via accelerated and natural carbonation datasets.

Data Collection and Description
It is necessary first to provide a concise introduction to the dataset. To make the results more reliable, 8204 sets of data, including 161 papers in Web of Science, CNKI, and WAN FANG DATA, were collected in this paper, as shown in Table 1. All concrete samples used in this study were cured for 28 days before the accelerated carbonation tests, and carbonation depth was determined by phenolphthalein. The experimental environment conditions of all the referred studies were constant. Generally, empirical models based on accelerated carbonation datasets are not appropriate for predicting actual concrete carbonation as they are often obtained under a high CO2 concentration. However, in terms of the relationship between factors and carbonation depth, accelerated carbonation tests are still appropriate. It is noted that the degree of the correlation is important in this part, regardless of whether this influence is positive or negative for carbonation. This implies that short-term accelerated carbonation datasets can work in this research, as some parameters such as fly ash develop a significant proportion of concrete's strength and durability where is the weight of used per unit volume of concrete; is the weight of material ( = , , , or , i. e. , fly ash, furnace slag, silica ash, or steel slag) used per unit volume of concrete; , is the content of in material ; is the weight of cement of class ; , is the content of in class cement, as is explained in Table 1. In addition, furnace slag and fly ash were classified by their fineness and specific surface area according to ground granulated blast-furnace slag used for cement, mortar, and concrete GB/T 18046-2017 and fly ash used for cement and concrete GB/T 1596-2017, respectively. Table  1 exhibits detailed information about the dataset. Quantitative analysis requires a large amount of test data, which is difficult to find in previous studies. In this project, many studies were consulted and a dataset including 8204 samples was established. Statistical methods were used for data-driven analysis, as well as machine learning techniques. This paper quantitatively studied influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors' influence on the fluctuation of carbonation depth; and (3) the correlation between factors, which reflects the redundancy [27]. A total of 29 material-related parameters were involved. Then, we selected some parameter groups in terms of these three aspects and developed several machine learning models to verify the effectiveness. After that, one machine learning model involving a few factors was established and a practical model was proposed by simplifying it. The effectiveness of the practical model was verified via accelerated and natural carbonation datasets.

Data Collection and Description
It is necessary first to provide a concise introduction to the dataset. To make the results more reliable, 8204 sets of data, including 161 papers in Web of Science, CNKI, and WAN FANG DATA, were collected in this paper, as shown in Table 1. All concrete samples used in this study were cured for 28 days before the accelerated carbonation tests, and carbonation depth was determined by phenolphthalein. The experimental environment conditions of all the referred studies were constant. Generally, empirical models based on accelerated carbonation datasets are not appropriate for predicting actual concrete carbonation as they are often obtained under a high CO2 concentration. However, in terms of the relationship between factors and carbonation depth, accelerated carbonation tests are still appropriate. It is noted that the degree of the correlation is important in this part, regardless of whether this influence is positive or negative for carbonation. This implies that short-term accelerated carbonation datasets can work in this research, as some parameters such as fly ash develop a significant proportion of concrete's strength and durability where is the weight of used per unit volume of concrete; is the weight of material ( = , , , or , i. e. , fly ash, furnace slag, silica ash, or steel slag) used per unit volume of concrete; , is the content of in material ; is the weight of cement of class ; , is the content of in class cement, as is explained in Table 1. In addition, furnace slag and fly ash were classified by their fineness and specific surface area according to ground granulated blast-furnace slag used for cement, mortar, and concrete GB/T 18046-2017 and fly ash used for cement and concrete GB/T 1596-2017, respectively. Table  1 exhibits detailed information about the dataset. Quantitative analysis requires a large amount of test data, which is difficult to find in previous studies. In this project, many studies were consulted and a dataset including 8204 samples was established. Statistical methods were used for data-driven analysis, as well as machine learning techniques. This paper quantitatively studied influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors' influence on the fluctuation of carbonation depth; and (3) the correlation between factors, which reflects the redundancy [27]. A total of 29 material-related parameters were involved. Then, we selected some parameter groups in terms of these three aspects and developed several machine learning models to verify the effectiveness. After that, one machine learning model involving a few factors was established and a practical model was proposed by simplifying it. The effectiveness of the practical model was verified via accelerated and natural carbonation datasets.

Data Collection and Description
It is necessary first to provide a concise introduction to the dataset. To make the results more reliable, 8204 sets of data, including 161 papers in Web of Science, CNKI, and WAN FANG DATA, were collected in this paper, as shown in Table 1. All concrete samples used in this study were cured for 28 days before the accelerated carbonation tests, and carbonation depth was determined by phenolphthalein. The experimental environment conditions of all the referred studies were constant. Generally, empirical models based on accelerated carbonation datasets are not appropriate for predicting actual concrete carbonation as they are often obtained under a high CO2 concentration. However, in terms of the relationship between factors and carbonation depth, accelerated carbonation tests are still appropriate. It is noted that the degree of the correlation is important in this part, regardless of whether this influence is positive or negative for carbonation. This implies that short-term accelerated carbonation datasets can work in this research, as some parameters such as fly ash develop a significant proportion of concrete's strength and durability where is the weight of used per unit volume of concrete; is the weight of material ( = , , , or , i. e. , fly ash, furnace slag, silica ash, or steel slag) used per unit volume of concrete; , is the content of in material ; is the weight of cement of class ; , is the content of in class cement, as is explained in Table 1. In addition, furnace slag and fly ash were classified by their fineness and specific surface area according to ground granulated blast-furnace slag used for cement, mortar, and concrete GB/T 18046-2017 and fly ash used for cement and concrete GB/T 1596-2017, respectively. Table  1 exhibits detailed information about the dataset. Quantitative analysis requires a large amount of test data, which is difficult to find in previous studies. In this project, many studies were consulted and a dataset including 8204 samples was established. Statistical methods were used for data-driven analysis, as well as machine learning techniques. This paper quantitatively studied influence factors in terms of three aspects: (1) the correlation between factors and concrete carbonation; (2) factors' influence on the fluctuation of carbonation depth; and (3) the correlation between factors, which reflects the redundancy [27]. A total of 29 material-related parameters were involved. Then, we selected some parameter groups in terms of these three aspects and developed several machine learning models to verify the effectiveness. After that, one machine learning model involving a few factors was established and a practical model was proposed by simplifying it. The effectiveness of the practical model was verified via accelerated and natural carbonation datasets.

Data Collection and Description
It is necessary first to provide a concise introduction to the dataset. To make the results more reliable, 8204 sets of data, including 161 papers in Web of Science, CNKI, and WAN FANG DATA, were collected in this paper, as shown in Table 1. All concrete samples used in this study were cured for 28 days before the accelerated carbonation tests, and carbonation depth was determined by phenolphthalein. The experimental environment conditions of all the referred studies were constant. Generally, empirical models based on accelerated carbonation datasets are not appropriate for predicting actual concrete carbonation as they are often obtained under a high CO2 concentration. However, in terms of the relationship between factors and carbonation depth, accelerated carbonation tests are still appropriate. It is noted that the degree of the correlation is important in this part, regardless of whether this influence is positive or negative for carbonation. This implies that short-term accelerated carbonation datasets can work in this research, as some parameters such as fly ash develop a significant proportion of concrete's strength and durability where is the weight of used per unit volume of concrete; is the weight of material ( = , , , or , i. e. , fly ash, furnace slag, silica ash, or steel slag) used per unit volume of concrete; , is the content of in material ; is the weight of cement of class ; , is the content of in class cement, as is explained in Table 1. In addition, furnace slag and fly ash were classified by their fineness and specific surface area according to ground granulated blast-furnace slag used for cement, mortar, and concrete GB/T 18046-2017 and fly ash used for cement and concrete GB/T 1596-2017, respectively. Table  1 exhibits detailed information about the dataset.

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim 32

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29]

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim

Parameter Evaluation and Selection
Parameter selection is always a hot topic in data science, and it has recently been applied in civil engineering. Some studies [28,29] adopted conventional parameter selection methods for solving energy issues in buildings. Conventional selection methods aim to remove irrelevant and redundant information from the dataset according to two criteria: the correlation between the parameters and the object (i.e., remove irrelevant information), and the correlation between the parameters (i.e., find redundant information) [30].
Correlation analysis methods mainly include Pearson's correlation coefficient, Spearman's correlation coefficient, Kendall's correlation coefficient, maximal information coefficient (MIC) [31,32], etc. Selection methods based on them are called filter methods [33]. In addition, wrappers and embedded methods are also widely used for selection [27]. Decision trees, naive Bayes, and support vector machines [34,35] are several popular predictors. For these predictors, the criterion essentially depends on the loss function of the predictors. The selection of approaches for correlation analyses is based on the statistical characteristics of data and the target of the study.
In terms of statistics, the dataset used in this study was not appropriate for some methods, which require the dataset to comply with the Gaussian distribution. In addition, some new attributes should be included:

•
Parameter performance in controlling and predicting the durability of concrete under the impact of uncertainties of carbonation depth needs to be evaluated.

•
Parameters should reduce the dispersion of carbonation depth.

Method
Except for the correlation analysis, a quantitative analysis of a parameter's effects on the dispersion of carbonation depth is needed.
For the first aspect, CORR k was used to denote the correlation between parameter k and carbonation depth. A high CORR k represented a strong correlation. Spearman's correlation coefficient, Pearson's correlation coefficient, and Kendall's correlation coefficient are common correlation analysis indices. Spearman's correlation coefficient and Kendall's correlation coefficient are copula-based random variable dependency measurement indices. Compared with Pearson's correlation coefficient, they do not require that datasets conform to a special distribution. Generally, Spearman's correlation coefficient is the Pearson correlation coefficient calculated from the vectors of ranks [36]. The Pearson correlation coefficient of vectors X and Y can be calculated by [33]: where N is the number of samples and X is the average value. Then, the Spearman correlation coefficient of vectors X and Y can be calculated by replacing the values in each vector with their ranks. For the second aspect, VARR k depicted k's effect on the dispersion of carbonation depth. The uncertainties of an observed value such as carbonation depth are usually described by its variance. A high variance means a large dispersion; predicting the carbonation depth with its average value is thus unreliable. Therefore, VARR k was defined as the reduction degree of uncertainties of the observed carbonation depth data after parameter k was used, which can be written as: where Var represents the standard deviation of carbonation depth x when no parameter is considered. If the usage of parameter k does not affect the dispersion of carbonation depth, Var will be equal to Var k and VARR will be equal to zero. Var can be obtained by Equation (4): where N is the number of samples in the original dataset D, X i is the carbonation depth of the ith sample, and X is the average value. The uncertainties of x after adopting parameter k can be obtained by Equation (5): where Var i,k denotes the standard deviation of x at k = k i and k i is one of the values of k. N i is the number of samples at k = k i . Assume the dataset including N i samples named D i , and N i /N represents the weight of D i . Suppose that k has t values (k = k 1 , k 2 , . . . , k t ), and Var i,k can be calculated by Equation (6): where X ij is the carbonation depth of the jth sample in D i and X i is the mean of x in D i . According to Equation (6), if k does not affect the distribution of carbonation depth, D and D i have the same distribution (i.e., Var = Var i,k ). This can be verified by linking an irrelevant variate k produced by random sampling to carbonation depth, and calculating the value of Var i,k . Therefore, the importance of parameter k can be evaluated by Equation (7): where I k is the importance coefficient of parameter k, maxVARR is the max value of VARR, and maxCORR is the max value of CORR. As shown in Equation (7), parameters that have a weak correlation with carbonation or that do not reduce the uncertainties will make I k be equal to zero. In addition, it is noted that there is still one possible issue in the calculation of Var i,k : Var i,k 's sensitivity to the outlier. This was handled by identifying and excluding outliers. IQR is a common index for finding outliers in statistics. IQR can be calculated by: where Q 1 is the 25th percentile and Q 3 is the 75th percentile. Moreover, x is considered an outlier if it meets: x Q 1 − 1.5(IQR) or x Q 3 + 1.5(IQR) Deleting outliers changes the dataset used for calculating VARR k . To maintain the consistency of the dataset, the calculation of CORR k and VARR k used the same dataset. In addition, to improve the stability of evaluation results, the original dataset, D, was split into three child datasets D l (l = 1, 2, 3). Simple random sampling without replacement was used for splitting. For dataset D l , about 2000~2600 sets of valid data were included in it. The final results of the evaluation are the mean of the results of the three child datasets, as shown in Figure 2. where is the number of samples in the original dataset , is the carbonation depth of the th sample, and is the average value. The uncertainties of after adopting parameter can be obtained by Equation (5): where , denotes the standard deviation of at = and is one of the values of .
is the number of samples at = . Assume the dataset including samples named , and / represents the weight of . Suppose that has values ( = , , … , ), and , can be calculated by Equation (6): where is the carbonation depth of the th sample in and is the mean of in . According to Equation (6), if does not affect the distribution of carbonation depth, and have the same distribution (i.e., = , ). This can be verified by linking an irrelevant variate produced by random sampling to carbonation depth, and calculating the value of , . Therefore, the importance of parameter can be evaluated by Equation (7): where is the importance coefficient of parameter , max is the max value of , and max is the max value of . As shown in Equation (7), parameters that have a weak correlation with carbonation or that do not reduce the uncertainties will make be equal to zero. In addition, it is noted that there is still one possible issue in the calculation of , : , 's sensitivity to the outlier. This was handled by identifying and excluding outliers. IQR is a common index for finding outliers in statistics. IQR can be calculated by: where is the 25th percentile and is the 75th percentile. Moreover, is considered an outlier if it meets: Deleting outliers changes the dataset used for calculating . To maintain the consistency of the dataset, the calculation of and used the same dataset. In addition, to improve the stability of evaluation results, the original dataset, D, was split into three child datasets ( = 1, 2, 3). Simple random sampling without replacement was used for splitting. For dataset , about 2000~2600 sets of valid data were included in it. The final results of the evaluation are the mean of the results of the three child datasets, as shown in Figure 2.

Results and Discussions
Based on the above discussions, for parameter k, its CORR k , VARR k and I k can be calculated, as shown in Figure 3. It is noted that these parameters can be divided into three parts (Parts A, B, and C) by the blue and red dash lines according to their I k . The compressive strength f was most important in comparison with other factors. In addition, Figure 3 exhibits the CORR k and VARR k of the parameters. The compressive strength f had the highest CORR k and a high VARR k , which implied that its correlation with carbonation depth was the highest and that it could also largely reduce the uncertainty of carbonation depth. The aggregate-cement ratio p a /p Cem had the highest VARR k and could thus reduce the uncertainty to the lowest value, while it only had a medium CORR k . Based on the above discussions, for parameter , its , and can be calculated, as shown in Figure 3. It is noted that these parameters can be divided into three parts (Parts A, B, and C) by the blue and red dash lines according to their . The compressive strength f was most important in comparison with other factors. In addition, Figure 3 exhibits the and of the parameters. The compressive strength f had the highest and a high , which implied that its correlation with carbonation depth was the highest and that it could also largely reduce the uncertainty of carbonation depth. The aggregate-cement ratio / had the highest and could thus reduce the uncertainty to the lowest value, while it only had a medium . The parameters in Part A had a high and also had a high and (Figure 3). In Part B, parameters that had a high usually had a low , which implied that indices and revealed different aspects of the influence of the factors. In terms of mechanism, it is easy to understand that the compressive strength f had the highest . For example, Papadakis et al. [37] proposed a function to calculate the compressive strength of fly ash concrete in terms of chemical reactions: Equation (10) shows that the water content is negative in relation to the concrete strength f. Moreover, is negative in relation to carbonation depth. Further, it is believed that the effects of on porosity are also the main reason for its influence on f. In addition, Equation (10) also demonstrates that replacing cement with fly ash (assume is a constant) causes a low strength, but replacing the aggregate with fly ash ( is not a constant) increases the strength. This might not comply with some situations as fly ash mainly affects the early strength. Fly ash and other supplementary cementitious materials also have a similar correlation with carbonation depth. In sum, compressive strength f shows a strong uniformity with carbonation depth .
The results also showed that / had the highest . The performance of the aggregate-cement ratio / in reducing the uncertainty of carbonation depth As the correlation between factors reflects the possibility that one factor can be replaced by others and I k denotes the importance, the performance of different factor groups can be estimated through the following steps: 1.
Determine the number of parameters included in the group; 2.
Assume that one group consists of m factors, sort all factors from largest to smallest according to their I k , and calculate S m ; where R i implies the possibility that factor i cannot be replaced by previous factors and I i denotes the I k of factor i. R i can be calculated by: where r i,j denotes Spearman's correlation coefficient between i and j, and H is used to make sure that 1 − r i,j ≥ 0; i.e., if 1 < r i,j , H = 0; otherwise, H = 1. It is noted that R 1 should be equal to one.

3.
Sort all groups from largest to smallest according to their S m .
As shown in Table 3, several combinations were provided and groups containing the same number of factors have similar S m . In the next part, this paper discusses the effectiveness of these combinations via machine learning methods.

Validation of Suggested Parameters
To verify the validity of the parameter combinations listed in Table 3, machine learning (ML) methods were used in this section. It is noted that environmental factors such as temperature were also included in machine learning models. To improve the reliability, three ML approaches were used: support vector regression (SVR), XGboost, and deep neural networks (DNN). Current ML techniques combine datasets and algorithms to find the relationship between parameters and the target. Once appropriate parameters are selected, models usually have very a high accuracy [38]. Therefore, ML models were used in this study to investigate whether or not the suggested combinations had significant advantages in predicting carbonation depth. Considering that readers might be unfamiliar with these ML methods, some brief introductions were given.

ML Methods
Conventional regression or fitting algorithms such as linear regression first assume a formula with undetermined coefficients. For example, x = w·t + b. Then, this approach uses optimization algorithms to adjust the values of undetermined coefficients (w and b) to reduce the error between the true carbonation depth x and the predicted value x (e.g., minimize loss = (x − x ) 2 ) to the minimum. These methods generate linear models. For complex issues, researchers need to guess the basic form of fitted curves.
SVR uses the kernel function K(t 1 ·t 2 ) = φ(t 1 )·φ(t 2 ) to convert a low-dimensional space into a high-dimensional space [38]. Therefore, nonlinear curves in the low-dimensional space can be fitted by a hyperplane. In addition, the loss function SVR used has high robustness due to the addition of relaxation coefficient . If the gap between x and x is lower than , the error is ignored. Detailed discussions and tutorials of SVR can be found in [39]. The expression can be written as: where w = ([â] − [a])·φ(t).t is the vector of the parameters of the dataset; [â] and [a] are the diagonal matrices of undetermined coefficients. φ(t)·φ(t) denotes the kernel function.
In this study, the radial basis kernel function K(t 1 ·t 2 ) = e −γ(t 1 −t 2 ) 2 was used. For the widely noted overfitting problem in ML models, SVR also has one important characteristic: w 2 in its regularized risk function can help it avoid overfitting.
Different from SVR, XGboost is one kind of boosting algorithm [40]. XGboost first develops a weak regression model. For example, in this paper, the regression tree model F 0 (t) was used. Then, the next weak model F 1 (t) is built to reduce the error between x and x of the tree model. This step cycles many times, and for the mth time, the predicted carbonation depth x is: Mean square error (MSE) was used to evaluate the error between x and x , and MSE can be calculated by: To avoid overfitting, the regularization item L2 was used in the training of the XGboost models. Detailed information on XGboost is listed in [40].
DNN is similar to the human nervous system, composed of many neurons. Those neurons are stored in different layers and neurons in different layers are connected. Every neuron in the neural network receives signals from neurons that are linked to it. Processed by the activation function in the neuron, signals are passed to later neurons, as shown in Figure 4. The hyperbolic tangent activation function was used in this study. DNN can simulate the most complex function due to the combination of many neurons and the process of the activation function [41]. Figure 5 shows the structure of a DNN with six parameters. To avoid overfitting, the dropout approach was used [42]. Dropout randomly makes some neurons invalid at a probability of p during each period of training, and all neurons are used for final models. Of course, to maintain the scale of the predicted value, the weight of the neurons in the final models will multiply p. In this study, p = 0.3, meaning that, for each layer, 30% of the neurons were randomly set as invalid during each period of training. In addition, the regularization item L2 was also used to avoid overfitting.
neurons are stored in different layers and neurons in different layers are conn neuron in the neural network receives signals from neurons that are linked to by the activation function in the neuron, signals are passed to later neurons, Figure 4. The hyperbolic tangent activation function was used in this stud simulate the most complex function due to the combination of many neurons cess of the activation function [41]. Figure 5 shows the structure of a DNN rameters. To avoid overfitting, the dropout approach was used [42]. Dropo makes some neurons invalid at a probability of during each period of trai neurons are used for final models. Of course, to maintain the scale of the pre the weight of the neurons in the final models will multiply . In this study, ing that, for each layer, 30% of the neurons were randomly set as invalid period of training. In addition, the regularization item L2 was also used to a ting.

Verification and Discussions
In this part, 1825 sets of data were used for verification. For each group were built to compute the MSE of given combinations. Combinations with lo believed to be effective. To improve the stability, five-fold cross-validation w dataset was evenly divided into five subsets. Four subsets were used to train To avoid overfitting, the regularization item L2 was used in the training of the XGboost models. Detailed information on XGboost is listed in [40].
DNN is similar to the human nervous system, composed of many neurons. Those neurons are stored in different layers and neurons in different layers are connected. Every neuron in the neural network receives signals from neurons that are linked to it. Processed by the activation function in the neuron, signals are passed to later neurons, as shown in Figure 4. The hyperbolic tangent activation function was used in this study. DNN can simulate the most complex function due to the combination of many neurons and the process of the activation function [41]. Figure 5 shows the structure of a DNN with six parameters. To avoid overfitting, the dropout approach was used [42]. Dropout randomly makes some neurons invalid at a probability of during each period of training, and all neurons are used for final models. Of course, to maintain the scale of the predicted value, the weight of the neurons in the final models will multiply . In this study, = 0.3, meaning that, for each layer, 30% of the neurons were randomly set as invalid during each period of training. In addition, the regularization item L2 was also used to avoid overfitting.

Verification and Discussions
In this part, 1825 sets of data were used for verification. For each group, ML models were built to compute the MSE of given combinations. Combinations with low MSE were believed to be effective. To improve the stability, five-fold cross-validation was used. The dataset was evenly divided into five subsets. Four subsets were used to train a model and

Verification and Discussions
In this part, 1825 sets of data were used for verification. For each group, ML models were built to compute the MSE of given combinations. Combinations with low MSE were believed to be effective. To improve the stability, five-fold cross-validation was used. The dataset was evenly divided into five subsets. Four subsets were used to train a model and one subset was used to test the model. This process cycles five times and uses different subsets as the testing data each time. The final results were based on the mean results of five models, as is shown in Figure 6. In addition, data normalization was used for preprocessing. Table 4 shows the MSE results of the models. As is shown in Table 4, Group 17-1 used all parameters. Parameters in Group 5-3 reflected the mix design of concrete. one subset was used to test the model. This process cycles five times and uses subsets as the testing data each time. The final results were based on the mean r five models, as is shown in Figure 6. In addition, data normalization was used for cessing. Table 4 shows the MSE results of the models. As is shown in Table 4, Gr used all parameters. Parameters in Group 5-3 reflected the mix design of concret  For groups listed in Table 3, results showed that the MSE of models decrea the addition of new parameters. However, with the appending of new parame effect declined. Compared with Group 17-1, Groups 5-1 and 5-2 had similar MSE study, five parameters can finely approximate the performance of all parameters. more, as using too many parameters would increase the complexity of the model 5-2 showed a better accuracy than Group 17-1. Compared with Group 5-3, which the mix design of concrete, Groups 3-1~3-3 showed a similar accuracy. All ML showed that the method proposed in this paper can choose appropriate parame reduce the demand for the number of parameters.

Practical Carbonation Models for Existing Concrete Structures
In previous sections, factor groups were proposed and several ML models w veloped. For existing concrete structures, it is very difficult to obtain parameters, original design information may be unavailable. According to Sections 2 and 3, th developed a prediction model containing necessary environmental factors and t crete-related factors (compressive strength and aggregate-cement ratio) via neu work methods.
Neural network methods and the settings of the models were discussed in 3.2. This model included six input parameters (humidity, temperature, the conce  For groups listed in Table 3, results showed that the MSE of models decreased with the addition of new parameters. However, with the appending of new parameters, this effect declined. Compared with Group 17-1, Groups 5-1 and 5-2 had similar MSEs. In this study, five parameters can finely approximate the performance of all parameters. Furthermore, as using too many parameters would increase the complexity of the models, Group 5-2 showed a better accuracy than Group 17-1. Compared with Group 5-3, which reflected the mix design of concrete, Groups 3-1~3-3 showed a similar accuracy. All ML models showed that the method proposed in this paper can choose appropriate parameters and reduce the demand for the number of parameters.

Practical Carbonation Models for Existing Concrete Structures
In previous sections, factor groups were proposed and several ML models were developed. For existing concrete structures, it is very difficult to obtain parameters, as some original design information may be unavailable. According to Sections 2 and 3, this paper developed a prediction model containing necessary environmental factors and two concrete-related factors (compressive strength and aggregate-cement ratio) via neural network methods.
Neural network methods and the settings of the models were discussed in Section 3.2. This model included six input parameters (humidity, temperature, the concentration of CO 2 , compressive strength, aggregate-cement ratio, and time), four hidden layers containing 25 units in each layer, and one output value ( Figure 5). The dataset used for training and testing was the dataset described in Table 1. A total of 90% of the dataset was used for training and 10% was used for testing the model. It is noted that the concentration of CO 2 ranged from 1% to 50%, the temperature ranged from 10 • C to 60 • C, the relative humidity ranged from 35% to 95%, and the carbonation time ranged from 1 day to 364 days. The results are shown in Figure 7. Most points were located in the blue area (±2.65 mm), and the mean error was 2.5 mm. It is not necessary to compare ML models with other empirical models or theoretical models, as ML models always perform better on the testing dataset, as has been demonstrated by many studies [43,44].
ing and testing was the dataset described in Table 1. A total of 90% of the dat for training and 10% was used for testing the model. It is noted that the co CO2 ranged from 1% to 50%, the temperature ranged from 10 °C to 60 °C humidity ranged from 35% to 95%, and the carbonation time ranged from days. The results are shown in Figure 7. Most points were located in (±2.65 mm), and the mean error was 2.5 mm. It is not necessary to compa with other empirical models or theoretical models, as ML models always p on the testing dataset, as has been demonstrated by many studies [43,44].
To make the model more convenient, this part also aims to propose a p for the carbonation prediction of existing structures based on the ML mode

Establishment of the Practical Model
According to the ML model, six parameters were used to predict th depth; the function of the practical model can thus be written as: is the relative humidity, denotes the temperature, [CO ] is t tion of CO2, is the compressive strength, / denotes the aggregate-cem is the carbonation time. A dataset was created to explore the relationship bonation depth and parameters contained in the ML model. The results are ure 8. To make the model more convenient, this part also aims to propose a practical model for the carbonation prediction of existing structures based on the ML model.

Establishment of the Practical Model
According to the ML model, six parameters were used to predict the carbonation depth; the function of the practical model can thus be written as: where RH is the relative humidity, T denotes the temperature, [CO 2 ] is the concentration of CO 2 , f is the compressive strength, a/c denotes the aggregate-cement ratio, and t is the carbonation time. A dataset was created to explore the relationship between carbonation depth and parameters contained in the ML model. The results are shown in Figure 8. of CO2, compressive strength, aggregate-cement ratio, and time), four hidden layers containing 25 units in each layer, and one output value ( Figure 5). The dataset used for training and testing was the dataset described in Table 1. A total of 90% of the dataset was used for training and 10% was used for testing the model. It is noted that the concentration of CO2 ranged from 1% to 50%, the temperature ranged from 10 °C to 60 °C, the relative humidity ranged from 35% to 95%, and the carbonation time ranged from 1 day to 364 days. The results are shown in Figure 7. Most points were located in the blue area (±2.65 mm), and the mean error was 2.5 mm. It is not necessary to compare ML models with other empirical models or theoretical models, as ML models always perform better on the testing dataset, as has been demonstrated by many studies [43,44]. To make the model more convenient, this part also aims to propose a practical model for the carbonation prediction of existing structures based on the ML model.

Establishment of the Practical Model
According to the ML model, six parameters were used to predict the carbonation depth; the function of the practical model can thus be written as: where is the relative humidity, denotes the temperature, [CO ] is the concentration of CO2, is the compressive strength, / denotes the aggregate-cement ratio, and is the carbonation time. A dataset was created to explore the relationship between carbonation depth and parameters contained in the ML model. The results are shown in  Figure 8a shows the influence of relative humidity, where f denotes the compressive strength and a/c is the aggregate-cement ratio. The influence of the relative humidity had a peak value for two reasons: (1) carbonation reactions occurred in the pore solution, and water was thus needed; (2) water can form water films on the pore surface and then impede the diffusion of CO2 in pores. Since the data whose relative humidity was below 50% was insufficient in comparison with the data whose relative humidity was greater than 50%, only samples whose relative humidity was greater than 50% were involved, and ( ) can be assumed to be linear with an insignificant loss of accuracy. This also largely simplifies the calculation. Figure 8b shows the influence of temperature on carbonation. High temperatures can raise the rate of diffusion of CO2 gas and accelerate chemical reactions. Arrhenius formulas are usually used to depict the temperature's effects on chemical reactions. By the same token, a linear function was assumed to simplify the calculation and the approximate temperature's effects on carbonation with an insignificant loss of accuracy. Figure 8c shows the influence of CO2 concentration on carbonation. In the early stages of carbonation, with the increase in CO2 concentration, the carbonation reaction rate increases. However, CaCO3 generated by CO2 concentrations that are too high will fill pores and impede the contact between Ca(OH)2 and CO2, which finally hinders carbonation.
([CO ]) was assumed to be a square root function. Figure 8d shows the relationship between concrete strength and carbonation depth. The influence of compressive strength has been widely discussed in many studies, and a power function is usually used. Figure 8e shows that the influence of the aggregate-cement ratio can be approximated as a linear function. Considering that both the aggregatecement ratio and the concrete strength are concrete-related parameters, the relationship between them in ( , / ) can be further divided into the linear combination form  Figure 8a shows the influence of relative humidity, where f denotes the compressive strength and a/c is the aggregate-cement ratio. The influence of the relative humidity had a peak value for two reasons: (1) carbonation reactions occurred in the pore solution, and water was thus needed; (2) water can form water films on the pore surface and then impede the diffusion of CO 2 in pores. Since the data whose relative humidity was below 50% was insufficient in comparison with the data whose relative humidity was greater than 50%, only samples whose relative humidity was greater than 50% were involved, and f 1 (RH) can be assumed to be linear with an insignificant loss of accuracy. This also largely simplifies the calculation. Figure 8b shows the influence of temperature on carbonation. High temperatures can raise the rate of diffusion of CO 2 gas and accelerate chemical reactions. Arrhenius formulas are usually used to depict the temperature's effects on chemical reactions. By the same token, a linear function was assumed to simplify the calculation and the approximate temperature's effects on carbonation with an insignificant loss of accuracy. Figure 8c shows the influence of CO 2 concentration on carbonation. In the early stages of carbonation, with the increase in CO 2 concentration, the carbonation reaction rate increases. However, CaCO 3 generated by CO 2 concentrations that are too high will fill pores and impede the contact between Ca(OH) 2 and CO 2 , which finally hinders carbonation. f 3 ([CO 2 ]) was assumed to be a square root function. Figure 8d shows the relationship between concrete strength and carbonation depth. The influence of compressive strength has been widely discussed in many studies, and a power function is usually used. Figure 8e shows that the influence of the aggregate-cement ratio can be approximated as a linear function. Considering that both the aggregatecement ratio and the concrete strength are concrete-related parameters, the relationship between them in f 4 ( f , a/c) can be further divided into the linear combination form (Ag 1 ( f ) + Bg 2 (a/c) + C) or the product form (g 1 ( f )g 2 (a/c)). Figure 8e shows that with the increase in concrete strength, the effects of the aggregate-cement ratio on carbonation are weakened, which indicates that g 1 ( f )g 2 (a/c) can better depict this change.
The testing dataset and the training dataset used to determine the values of the coefficients were split from the same dataset given in Table 1, and the testing dataset constituted 10% of the total dataset. Finally, the function can be written as: The median value of the errors is also listed in Table 5. Since the median value is not affected by outlier samples, it can better reflect the actual performance of the models. The median value was higher than the mean value, which means that there were some outlier experimental results in the dataset. Table 5 also shows that this practical model exhibited a better accuracy at each stage of the carbonation process.
To further verify the effectiveness of this model, this paper collected an extra natural carbonation dataset [49][50][51][52] to explore its accuracy. This dataset included 76 sets of data. The natural carbonation time ranged from 28 to 9125 days, and the locations involved the northern and southern regions as well as the central and western regions of China. It is noted that the curing conditions of the concrete specimens used in natural carbonation tests were not standard curing. Some of the specimens were placed in an indoor environment and others were placed in an outdoor environment. Previous studies suggested the carbonation depth should be multiplied by the coefficients of 2.81 and 1.50 for specimens in the indoor and the outdoor environment, respectively. Table 6 shows the final results. The mean absolute error was 1.56 mm; the practical model thus had a high accuracy.