Application of Novel Machine Learning Techniques for Predicting the Surface Chloride Concentration in Concrete Containing Waste Material

Structures located on the coast are subjected to the long-term influence of chloride ions, which cause the corrosion of steel reinforcements in concrete elements. This corrosion severely affects the performance of the elements and may shorten the lifespan of an entire structure. Even though experimental activities in laboratories might be a solution, they may also be problematic due to time and costs. Thus, the application of individual machine learning (ML) techniques has been investigated to predict surface chloride concentrations (Cc) in marine structures. For this purpose, the values of Cc in tidal, splash, and submerged zones were collected from an extensive literature survey and incorporated into the article. Gene expression programming (GEP), the decision tree (DT), and an artificial neural network (ANN) were used to predict the surface chloride concentrations, and the most accurate algorithm was then selected. The GEP model was the most accurate when compared to ANN and DT, which was confirmed by the high accuracy level of the K-fold cross-validation and linear correlation coefficient (R2), mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE) parameters. As is shown in the article, the proposed method is an effective and accurate way to predict the surface chloride concentration without the inconveniences of laboratory tests.


Introduction
Reinforced concrete (RC) structures are known for their longevity and resistance, which are very valuable in civil engineering practices [1]. This includes the construction of harbor docks, marine structures, and coastal roads. Concrete structures with both steel and composite reinforcement provide strong stability against corrosive action in an alkaline environment [2]. This is mainly due to the massive passive oxide film around reinforcing steel, which can be affected by the presence of chloride ions in sea and coastal structures [3]. These ions first accumulate on the surface of a RC structure and then slowly penetrate into the concrete element [4]. This ultimately demolishes the oxide film and provokes the corrosion of steel, which, in turn, leads to spalling and cracking of the concrete and the reduction of the load-carrying capacity of RC structures [5]. This process significantly reduces the serviceability of an RC structure, causing it to not last for the amount of time for which it was designed. The corrosion process of steel bars embedded in porous concrete is presented in Figure 1. Surface chloride concentrations affect the corrosion of steel and amount of time for which it was designed. The corrosion process of steel bars embedded in porous concrete is presented in Figure 1. Surface chloride concentrations affect the corrosion of steel and the performance of entire buildings and have a devastating effect in civil engineering [6][7][8]. The durability of a structure is an important parameter in RC buildings for maintaining an adequate service life of concrete structures [9]. Thus, predicting the design service life of concrete structures has become more popular in recent times. However, all prediction models are different and depend on influential factors and the destruction mechanisms of structures [10]. The appearance of chloride ions is a major issue in marine environments and has a very negative effect on RCC structures [11,12]. The possible transmission mechanism of chloride ions into concrete is dependent on the zone where the element is, i.e., the tidal, splash, submerged, and atmospheric zones [13]. The transport of chloride ions in concrete is mainly due to diffusion or the absorption mechanism [14]. The diffusion mechanism that occurs in the submerged zone is due to the saturation of the concrete. However, the absorption mechanism takes place in the tidal and splash zones. This allows the transport of chloride ions into RCC structures. However, the ingress of chloride ions in the atmospheric zone is quite complex when compared to the other zones [15]. This is because of factors associated with the zone, such as the direction and speed of the wind, the salinity level of the water reservoir, and the distance between the sea and the RCC structures. This study focuses on three zones, and omits the atmospheric zone due to its previously described limitations.
The ingress movement of chloride ions in RCC structures is calculated using Fick's second law of diffusion, as shown in Equation (1). This is mostly used when designing the service life of structures located in a marine environment [16].
where C(x,t)-chloride concentration at distance (x) from the surface after the time of exposure (t), mol/m 3 ; Co-concentration of chloride ions in concrete at the initial stage of their occurrence, mol/m 3 ; x-depth from the exposed concrete surface, mm; D-coefficient of apparent chloride diffusion, mm 2 /s; Cs-apparent surface chloride amount, mol/m 3 ; The appearance of chloride ions is a major issue in marine environments and has a very negative effect on RCC structures [11,12]. The possible transmission mechanism of chloride ions into concrete is dependent on the zone where the element is, i.e., the tidal, splash, submerged, and atmospheric zones [13]. The transport of chloride ions in concrete is mainly due to diffusion or the absorption mechanism [14]. The diffusion mechanism that occurs in the submerged zone is due to the saturation of the concrete. However, the absorption mechanism takes place in the tidal and splash zones. This allows the transport of chloride ions into RCC structures. However, the ingress of chloride ions in the atmospheric zone is quite complex when compared to the other zones [15]. This is because of factors associated with the zone, such as the direction and speed of the wind, the salinity level of the water reservoir, and the distance between the sea and the RCC structures. This study focuses on three zones, and omits the atmospheric zone due to its previously described limitations.
The ingress movement of chloride ions in RCC structures is calculated using Fick's second law of diffusion, as shown in Equation (1). This is mostly used when designing the service life of structures located in a marine environment [16].
where C(x,t)-chloride concentration at distance (x) from the surface after the time of exposure (t), mol/m 3 ; C o -concentration of chloride ions in concrete at the initial stage of their occurrence, mol/m 3 ; x-depth from the exposed concrete surface, mm; D-coefficient of apparent chloride diffusion, mm 2 /s; C s -apparent surface chloride amount, mol/m 3 ; Erf-error function; C o is constant and is not affected by any type of concrete. However, the movement of chloride ions in the marine environment is determined by C s and D. The coefficient of apparent chloride diffusion is a material property that depends primarily on time. It can be determined based on information referring to the composition and microstructure of a material. C s concentration in diffusion law has a complex nature, as it not only depends on a material's properties, but also on the environmental conditions and time. This creates ambiguity when making an accurate prediction of chloride ingress in a marine environment. Therefore, studies are needed to build a strong model that uses machine learning approaches, which can accurately predict the amount of apparent surface chloride.
The aim of this study is to predict the surface chloride concentration in marine structures through the application of the gene expression programing algorithm. The levels of accuracy of GEP, DT and an ANN were evaluated and compared in order to choose the most accurate algorithm for the purpose of the study. The most effective algorithm was GEP when compared to DT and ANN. Statistical analyses and K-fold cross validation were used to check the accuracy and validity of all the models. The usability of the proposed algorithm was also compared with other algorithms in the literature, which was important for the purpose of this research.

Apparent Surface Chloride Content
The value of surface chloride concentration C s is an important variable when describing the transmission of chloride into structures [17]. It is obtained on site, or during laboratory investigations, as depicted in Figure 2. It can be seen that a convention zone is present in the investigated concrete element of the marine structure; however, there is no such zone in the concrete element investigated during the laboratory tests. The value of C s was obtained through the use of the bulk diffusion profile of chloride using a fitting curve. This variable is the most significant, as it describes the aggression of chloride, quantitative durability, and the prediction of service life of RCC structures. Erf-error function; Co is constant and is not affected by any type of concrete. However, the movement of chloride ions in the marine environment is determined by Cs and D. The coefficient of apparent chloride diffusion is a material property that depends primarily on time. It can be determined based on information referring to the composition and microstructure of a material. Cs concentration in diffusion law has a complex nature, as it not only depends on a material's properties, but also on the environmental conditions and time. This creates ambiguity when making an accurate prediction of chloride ingress in a marine environment. Therefore, studies are needed to build a strong model that uses machine learning approaches, which can accurately predict the amount of apparent surface chloride.
The aim of this study is to predict the surface chloride concentration in marine structures through the application of the gene expression programing algorithm. The levels of accuracy of GEP, DT and an ANN were evaluated and compared in order to choose the most accurate algorithm for the purpose of the study. The most effective algorithm was GEP when compared to DT and ANN. Statistical analyses and K-fold cross validation were used to check the accuracy and validity of all the models. The usability of the proposed algorithm was also compared with other algorithms in the literature, which was important for the purpose of this research.

Apparent Surface Chloride Content
The value of surface chloride concentration Cs is an important variable when describing the transmission of chloride into structures [17]. It is obtained on site, or during laboratory investigations, as depicted in Figure 2. It can be seen that a convention zone is present in the investigated concrete element of the marine structure; however, there is no such zone in the concrete element investigated during the laboratory tests. The value of Cs was obtained through the use of the bulk diffusion profile of chloride using a fitting curve. This variable is the most significant, as it describes the aggression of chloride, quantitative durability, and the prediction of service life of RCC structures. It is worth mentioning that the Cs value used in the calculations is considered to be constant for the zone in which an element is located. This creates uncertainty due to the complex nature of chloride ion transmission, as it depends on many factors, such as material properties (cement composition, binder properties, and water-to-cement ratio), and environmental factors (zonation, chloride content, depth, relative humidity, and tem- It is worth mentioning that the C s value used in the calculations is considered to be constant for the zone in which an element is located. This creates uncertainty due to the complex nature of chloride ion transmission, as it depends on many factors, such as material properties (cement composition, binder properties, and water-to-cement ratio), and environmental factors (zonation, chloride content, depth, relative humidity, and temperature). Many attempts have been made to predict apparent chloride concentrations using logarithmic and exponential functions, or by correlating a concentration with the binder-to-water ratio, material variables, and environmental effects. However, there is still no accurate prediction model that is based on only a small number of variables. In contrast, when using machine learning algorithms, prediction models are more accurate and might be successfully used [18]. In this article, 642 data samples obtained from the literature survey [19] were used to predict surface chloride concentrations through the use of machine learning algorithms.

Machine Learning (ML) and Ensemble Learning (EL) Approaches
Machine learning (ML) is used as an efficient way to predict the mechanical properties of concrete, and its exemplary use is illustrated in Table 1. ML algorithms are more effective than simple correlation models due to the fact that they use the values of more than one variable to predict surface chloride concentrations. Artificial neural networks (ANN), the decision tree (DT), support vector machines (SVM), random forests (RF), gene expression programming (GEP), and deep learning (DL) are the most common algorithms for analyzing the mechanical properties of concrete [20]. Behnood et al. [21] used an ANN with an optimizer as a multi-objective grey wolves (MOGW) model for predicting the mechanical response of silica fume concrete. Getahun et al. [22] used an ANN algorithm to very accurately predict the compressive strength and tensile strength of waste concrete. Ling et al. [23] predicted the compressive strength of concrete in marine environments using SVM and then compared the obtained results with ANN and DT models. It was proved that the SVM was the most accurate. Zaher et al. [24] predicted the compressive property of lightweight foamed concrete using various machine learning techniques. The authors concluded that the extreme learning machine (ELM) was the most accurate and it was successfully applied for predicting the compressive strength of concrete. Woubish et al. [25] used a machine learning approach for the assessment of the durability of reinforced concrete structures. The author revealed that machine learning techniques are useful and that they play a substantial role in predicting the durability of structures when compared to functional CO 2 and Clingress models. Suguru et al. [26] developed a model to automatically detect cracks in concrete structures with the use of machine learning. Photographs of concrete structures were used as learning data, and then deep learning was used to detect the cracks. Similarly, Wassim et al. [27] indicated that machine learning models have a high level of accuracy.

Description of the Obtained Data
The data used to model the prediction of the surface chloride concentration are taken from published literature. These data were taken from articles describing the surface chloride concentrations in the tidal zone [48][49][50][51][52][53][54][55][56][57][58][59][60][61][62], splash zone [49,50,54,55,60,63,64] and submerged zone [54,56,60,[65][66][67][68]. The data consist of 12 inputs (cement, fine and coarse aggregate, silica fume, fly ash, blast furnace slag, superplasticizer, water, exposure time, annual mean temperature, chloride content, and exposure time) and one output (surface chloride concentration). Jupitar python was used to describe the distribution of each input parameter that was applied in the prediction model and is presented in Figure 3. It is well stated that the performance of a model is significantly affected by its variables [69]. The data variables that were used for modeling, with their ranges, are listed in Tables 2 and 3.

Machine Learning Algorithms
This section describes the algorithms used when modeling the prediction of surface chloride concentrations in the concrete elements of marine structures. The prediction of C s was made using ANN, DT and GEP. A detailed flowchart of the used methodology is presented in Figure 4.
Materials 2021, 14, x FOR PEER REVIEW 6 of 18 [69]. The data variables that were used for modeling, with their ranges, are listed in Tables 2 and 3.

Machine Learning Algorithms
This section describes the algorithms used when modeling the prediction of surface chloride concentrations in the concrete elements of marine structures. The prediction of Cs was made using ANN, DT and GEP. A detailed flowchart of the used methodology is presented in Figure 4. The decision tree algorithm is based on a classification technique with supervised learning and is used to solve various computational problems with both a regression and classification nature. The tree-like structure of the decision tree can be used to solve the problem. In this algorithm, the nodes are divided into two consecutive sub-nodes up to the end nodes, which, in turn, determine the shape of the decision tree. The identification of the attributes from the root node at each level is considered to be a challenging task when using the decision tree algorithm. The whole procedure is called "the selection of attributes". The division in the nodes is made on the basis of criteria. In the case of regression issues, the division is made by determining the point of separation, while in the case of classification issues, the criterion for division is the value of one of the classes. Different learning algorithms were used to split the nodes in order to obtain a conver-  The decision tree algorithm is based on a classification technique with supervised learning and is used to solve various computational problems with both a regression and classification nature. The tree-like structure of the decision tree can be used to solve the problem. In this algorithm, the nodes are divided into two consecutive sub-nodes up to the end nodes, which, in turn, determine the shape of the decision tree. The identification of the attributes from the root node at each level is considered to be a challenging task when using the decision tree algorithm. The whole procedure is called "the selection of attributes". The division in the nodes is made on the basis of criteria. In the case of regression issues, the division is made by determining the point of separation, while in the case of classification issues, the criterion for division is the value of one of the classes. Different learning algorithms were used to split the nodes in order to obtain a convergence of the results. Moreover, a convergence of the results can also be obtained by using n number of trees in the algorithm.
ANN is considered to be one of the most popular machine learning algorithms. An ANN can be used for learning, predicting and making decisions based on input data. The basic datasets for ANN models include training, testing, and validation. During the training process, the ANN learns based on the patterns of the prediction model. The validation process evaluates the accuracy of the trained model. According to the literature survey, the feed forward and the feed forward back propagation (FFBP) neural networks are the most commonly used when solving engineering problems [70]. These types of ANN consist of input, hidden and output layers that contain neurons [71]. It is activated using the activation function, as can be seen in Figure 5.
GEP is a transformative algorithm that designs computer-based programs and models. These programs usually have a tree structure that is capable of modifying its size (size, shape and arrangement), like in the case of chromosomes. For this reason, GEP, being a genotype-phenotypic system, can be much more effective when compared to adaptive techniques. The programming language of GEP is known as Karva language and is the same as the LISP languages. The stages of GEP are shown in Figure 6. GEP has many advantages over other classical regression techniques, as, in other methods, some functions are initially defined and then analyzed. However, in GEP, no predefined function is taken into consideration. basic datasets for ANN models include training, testing, and validation. During the training process, the ANN learns based on the patterns of the prediction model. The validation process evaluates the accuracy of the trained model. According to the literature survey, the feed forward and the feed forward back propagation (FFBP) neural networks are the most commonly used when solving engineering problems [70]. These types of ANN consist of input, hidden and output layers that contain neurons [71]. It is activated using the activation function, as can be seen in Figure 5. GEP is a transformative algorithm that designs computer-based programs and models. These programs usually have a tree structure that is capable of modifying its size (size, shape and arrangement), like in the case of chromosomes. For this reason, GEP, being a genotype-phenotypic system, can be much more effective when compared to adaptive techniques. The programming language of GEP is known as Karva language and is the same as the LISP languages. The stages of GEP are shown in Figure 6. GEP has many advantages over other classical regression techniques, as, in other methods, some functions are initially defined and then analyzed. However, in GEP, no predefined function is taken into consideration. Machine learning algorithms, such as ANNs and ensemble models, are successfully used to predict the concentration of ions in various conditions. It is possible to predict the chloride concentration in columns at different heights above the water level. In such cases, the error of each individual estimation is less than 20% [72]. Moreover, the service life of a concrete element can be modeled based on the chloride concentration at different depths in a sample. Linear regression was used for this purpose. The accuracy obtained by the linear correlation coefficient varied from 0.83 to 1.0 and was dependent on the zone in which the element was located [73]. In turn, ensemble models were used to predict the surface chloride concentration of marine concrete elements with good accuracy. This accuracy was proved by obtaining a relatively high value of the linear coefficient of correlation R 2 -equal to 0.83 [74]. Even though algorithms were previously used to predict the chloride concentration in concrete elements, there are no records of using gene expression programming for predicting chloride concentrations on the surface of marine concrete elements. Machine learning algorithms, such as ANNs and ensemble models, are successfully used to predict the concentration of ions in various conditions. It is possible to predict the chloride concentration in columns at different heights above the water level. In such cases, the error of each individual estimation is less than 20% [72]. Moreover, the service life of a concrete element can be modeled based on the chloride concentration at different depths in a sample. Linear regression was used for this purpose. The accuracy obtained by the linear correlation coefficient varied from 0.83 to 1.0 and was dependent on the zone in which the element was located [73]. In turn, ensemble models were used to predict the surface chloride concentration of marine concrete elements with good accuracy. This accuracy was proved by obtaining a relatively high value of the linear coefficient of correlation R 2 -equal to 0.83 [74]. Even though algorithms were previously used to predict the chloride concentration in concrete elements, there are no records of using gene expression programming for predicting chloride concentrations on the surface of marine concrete elements.

Statistical Analysis
The results of the statistical analyses (presented as a relation between the measured value of C s and the value identified by the machine learning algorithms) and the error distribution charts are presented in Figure 7. ANN gives a strong relation in the form of R 2 = 0.84, as can be seen in Figure 7a-with its error distribution shown in Figure 7b. The error distribution in Figure 7b illustrates that the average error of the training set is equal to 0.108 MPa. Moreover, the maximum and minimum error values of the training set were noted as 0.801 MPa and 0.0035 MPa, respectively. In addition, 69.1 percent of the data showed an error of less than 0.10 MPa, however, 64.3 percent of the data showed an error between 0.01 MPa and 0.10 MPa, as illustrated in Figure 7b.
The prediction of surface chloride concentration by employing the GEP algorithm yields a strong relationship between the targeted and output values of chloride concentrations, as shown in Figure 7c. It is also clear that this model gives a better response with less variance. GEP with an R 2 value of 0.88 had a better accuracy when compared to the ANN (R 2 equal to 0.84) and DT (R 2 equal to 0.72), as depicted in Figure 7. In turn, Figure 7d indicates the error distribution of the linear regression model. It can be seen that 72.86 percent of the data showed an error between 0.01 MPa and 0.10 MPa, and that the average error of the training set was equal to 0.080 MPa. Moreover, the maximum and minimum errors were equal to 0.76 MPa and 0.004 MPa.
The influence of variables on the prediction of surface chloride concentrations using linear regression is illustrated in Figure 7e. The algorithm yields a decreased or poor correlation when predicting targets, which is indicated by the value of R 2 being equal to 0.72, as shown in Figure 7e. In addition, Figure 7f

K-Fold Cross Validation
The actual performance of the models was analyzed using the statistical cross-validation method. This method was used to evaluate the performance of the model. The K-fold validation test takes place in such a way that data are set randomly and split into k-groups. In this case, the data were divided into 10 groups, of which nine were used for training, and one was used for validation of the model. This was then repeated ten times in order to obtain the average value of these repetitions. When using the 10-fold cross-validation method, it is possible to obtain a relatively high performance of a model. Moreover, a statistical check was also applied to evaluate the model [72]. This statistical analysis is a check that shows the response of the model towards the prediction, as illustrated in the form of the equations listed below (Equations (2)-(5)).
where, ex i -experimental value; mo i -predicted value; ex i -mean experimental value; mo i -mean predicted value obtained by the model; n-number of samples.
Correlation coefficient (R 2 ), mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE) were all used to evaluate the result of cross-validation, as can be seen in Figure 8.  linear regression is illustrated in Figure 7e. The algorithm yields a decreased or poor correlation when predicting targets, which is indicated by the value of R 2 being equal to 0.72, as shown in Figure 7e. In addition, Figure 7f presents the error distribution of the linear regression model and shows an average value of error equal to 0.

K-Fold Cross Validation
The actual performance of the models was analyzed using the statistical cross-validation method. This method was used to evaluate the performance of the model. The K-fold validation test takes place in such a way that data are set randomly and split into k-groups. In this case, the data were divided into 10 groups, of which nine were used for training, and one was used for validation of the model. This was then repeated ten times in order to obtain the average value of these repetitions. When using the 10-fold cross-validation method, it is possible to obtain a relatively high performance of a model. Moreover, a statistical check was also applied to evaluate the model [72]. This statistical analysis is a check that shows the response of the model towards the predic- model gave an average R value of 0.82, with maximum and minimum values being 0.93 and 0.68, as illustrated in Figure 8b,c. The values of the errors of all the models were relatively low in the case of the validation process. For GEP, they were: MAE = 7.03 MPa, MSE = 6.12 MPa, and RMSE = 2.46 MPa (Figure 8a). In the case of the ANN, they were: MAE = 7.56 MPa, MSE = 6.60 MPa, and RMSE = 2.54 MPa (Figure 8b); and in the case of the decision tree, they were: MAE = 7.66 MPa, MSE = 6.85 MPa, and RMSE = 2.61 MPa (Figure 8c). Moreover, the K-fold cross validation of all the applied models and statistical checks are listed in Tables 4 and 5, respectively.

Discussion
This research describes the predictive performance of chloride concentrations on the surface of marine structures using individual supervised machine learning algorithms. The three machine learning algorithms used during the investigation were: artificial neural network, decision tree, and gene express programming. GEP, with an R 2 value of 0.88, was the most accurate when compared to the ANN (R 2 equal to 0.84) and DT (R 2 equal to 0.72). This algorithm was also compared with those used in [18], and the results of the comparison accuracy are presented in Figure 9.
It can be seen from Figure 9 that the proposed GEP algorithm accurately describes chloride surface concentrations when compared to other algorithms. This is confirmed by the very high value of linear correlation coefficient R 2 , which is on a comparable level to other algorithms used in the literature.
This research describes the predictive performance of chloride concentrations on the surface of marine structures using individual supervised machine learning algorithms. The three machine learning algorithms used during the investigation were: artificial neural network, decision tree, and gene express programming. GEP, with an R 2 value of 0.88, was the most accurate when compared to the ANN (R 2 equal to 0.84) and DT (R 2 equal to 0.72). This algorithm was also compared with those used in [18], and the results of the comparison accuracy are presented in Figure 9. It can be seen from Figure 9 that the proposed GEP algorithm accurately describes chloride surface concentrations when compared to other algorithms. This is confirmed by the very high value of linear correlation coefficient R 2 , which is on a comparable level to other algorithms used in the literature.

Conclusions
This research describes the predictive performance of chloride concentrations on the surface of marine structures using individual supervised machine learning algorithms. The three algorithms-the artificial neural network, the decision tree, and gene expression programming-were used for the investigations. The most accurate among these three was GEP, which was proved by the fact that it obtained the highest value of the linear correlation coefficient and the lowest values of the parameters describing the errors of prediction. The following conclusions can be drawn:

•
The GEP algorithm is very effective for predicting chloride surface concentrations and can be successfully used for this purpose. This was also proved by comparing it with other algorithms used in the literature.

•
The presented method does not depend on the zone in which it is used (except the atmospheric zone where the transport of chloride ions is more difficult to describe).

•
The high performance of the GEP algorithm was also proved using k-fold validation.
The chloride surface concentration model, which uses gene expression programming, was proposed in this work. It can be successfully used without the need of investing significant time and money, as is the case with long term experiments. However, there is still room for improvement:

•
The dataset can be expanded with laboratory tests, field tests, or numerical analyses using different upsizing methods (e.g., Monte Carlo).

•
There is still the possibility of expanding the dataset with the results of surface chloride concentrations obtained for elements located in the atmospheric zone.

•
Due to the fact that there is no model in the literature that is 100% accurate, there is still the possibility of using a different, more accurate, algorithm.