Mixture Optimization of Recycled Aggregate Concrete Using Hybrid Machine Learning Model

Recycled aggregate concrete (RAC) contributes to mitigating the depletion of natural aggregates, alleviating the carbon footprint of concrete construction, and averting the landfilling of colossal amounts of construction and demolition waste. However, complexities in the mixture optimization of RAC due to the variability of recycled aggregates and lack of accuracy in estimating its compressive strength require novel and sophisticated techniques. This paper aims at developing state-of-the-art machine learning models to predict the RAC compressive strength and optimize its mixture design. Results show that the developed models including Gaussian processes, deep learning, and gradient boosting regression achieved robust predictive performance, with the gradient boosting regression trees yielding highest prediction accuracy. Furthermore, a particle swarm optimization coupled with gradient boosting regression trees model was developed to optimize the mixture design of RAC for various compressive strength classes. The hybrid model achieved cost-saving RAC mixture designs with lower environmental footprint for different target compressive strength classes. The model could be further harvested to achieve sustainable concrete with optimal recycled aggregate content, least cost, and least environmental footprint.


Introduction
Portland cement concrete is the most widely used construction material and the most consumed industrial product in the world. However, the more frequent local shortages of natural aggregates (NA) and the enormous environmental footprint caused by its extraction are global concerns regarding the production of Portland cement concrete. The global annual consumption of NA is estimated at 8 to 12 billion tons [1][2][3].
Moreover, there has been a growing worldwide shortage of available landfilling sites to dispose of construction and demolition waste (CDW). In Canada, it is estimated that 9 million tons of CDW are produced every year. Despite the vast country, its biggest cities are encountering stern CDW disposal challenges [4]. Likewise, several reports are forecasting that landfills in Hong Kong will be overfilled in about eight years [2]. The use of recycled aggregate (RA) offers a potential solution to overcome the drawbacks of concrete production. Among the most promising advantages of RA are reductions in CO 2 emissions and beneficiation of CDW in applications with added value. About 75% of construction waste, including concrete and masonry, can be reused in concrete production [5].
However, the inclusion of RA in concrete has been reported to reduce the compressive strength [6]. Several researchers have explored the influential factors on recycled aggregate concrete (RAC) compressive strength [7][8][9][10]. The moisture content of RA, replacement level of NA, and the water-to-cement (w/c) ratio were found as the parameters with the highest effect on RAC compressive strength. However, both ANN and ANFIS models proved to be powerful in modeling the compressive strength of RAC, with a coefficient of determination of 0.9185 and 0.9075 for ANN and ANFIS, respectively. Furthermore, Khademi et al. [15] performed a sensitivity analysis in which they concluded that the inclusion of more input features resulted in higher model predictive accuracy. Likewise, Naderpour et al. [1] developed an ANN model to predict the compressive strength of RAC with a coefficient of determination of 0.829 for the testing dataset. They also performed a sensitivity analysis via the weights of the input features. Accordingly, it was found that the water absorption of aggregates and the water-to-total materials mass ratio had the highest importance. In another study, Deng et al. [16] built a convolutional neural network to predict the compressive strength of RAC. Experimental work was carried out along with the development of the deep learning model. The authors compared the convolutional neural network with a support vector machine and a back propagation neural network concluding that the convolutional neural network has superior capability to predict the compressive strength of RAC. They used the relative error percentage to measure the performance of the models, and thus the error for the convolutional neural network, back propagation neural network and support vector machine was 3.65%, 6.63%, and 4.35%, respectively. Deshpande et al. [11] compared three different techniques: ANN, model tree, and non-linear regression. They studied the influence of adding non-dimensional parameters as input features. To accomplish such analysis, they created 10 different models for each algorithm and added a different non-dimensional input feature to the parameters corresponding to the ingredients content. The accuracy of the ANN model was at least 2% higher than that of the other techniques, even when the non-dimensional parameters were considered. Using a larger dataset, Gholampour et al. [22] predicted the compressive strength and other mechanical properties of RAC employing three types of algorithms including multivariate adaptive regression splines, M5 model tree, and least squares support vector regression. They created two different models for each algorithm corresponding to the cube and cylinder compressive strengths, respectively. For these models, results on 332 cube-specimens and 318 cylinder-specimens were collected from the open literature. It was found that least squares support vector regression achieved higher performance than the remaining models, with at least 12.6% better mean absolute percentage error. Duan et al. [2] proposed using the characteristics of the recycled aggregate as input parameters, including the saturated surface dry mass, water absorption, and volume fraction of coarse aggregate. They concluded that the inclusion of these features had a positive effect on model accuracy. Moreover, Topcu and Saridemir [6] found that ANN had better predictive accuracy than of the RAC compressive and splitting tensile strengths than fuzzy logic. The ANN model demonstrated to be a powerful tool to determine the mechanical properties of RAC, achieving a coefficient of determination of 0.9984 and 0.9979 for the prediction of compressive strength and splitting tensile strength, respectively. Dantas et al. [23] gathered the largest dataset and used ANN to develop an equation to describe the compressive strength of RAC. Their model included 17 input features, from which, the ratio of recycled concrete, absorption rate of fine recycled aggregate, content of dry aggregate, and finesses modulus of aggregates were the parameters with highest effect on the compressive strength of RAC. The reported accuracy for the training and testing sets were 0.928, and 0.971, respectively.
In summary, Khademi et al. [15], Naderpour et al. [1], Deng et al. [16], Deshpande et al. [11], Gholampour et al. [22], Duan et al. [2], Topcu and Saridemir [6], and Dantas et al. [23] used 257, 139, 74, 257, 650, 168, 210, and 1178 data points, respectively, to predict the compressive strength of RAC. In addition to the quality and size of existing datasets, the advent of new and more powerful ML algorithms has stimulated researchers to explore the ability of state-of-the-art methods to enhance the accuracy and robustness of predictive models. Among various ML techniques to predict the compressive strength of RAC, artificial neural networks (ANNs) and fuzzy logic have been the most widely applied methods (Table 1).

Research Significance
As elaborated on above, there have been various studies on the application of traditional ML techniques to predict the compressive strength of RAC. The present study aims at creating a large and more comprehensive dataset and deploy it with state-of-the-art ML techniques that have not yet been explored for RAC in the open literature. The models presented herein are executed using the Python programming language. Therefore, to utilize these models, the user can simply apply the development steps along with hyperparameters reported in this study. Furthermore, the compressive strength predictive tools developed in this study are further complemented with optimization of the mixture proportions using a coupled PSO-GBRT model. The proposed mixture proportions can be used as a reference guideline for designing eco-friendlier and more economical RAC mixtures in practice.

Machine Learning Basis
ML refers to a computer's capacity of analyzing data and learning complex patterns within the data without being rigorously programmed [24]. Depending on the nature of the data, ML algorithms are categorized into supervised learning, unsupervised learning, and reinforcement learning [25]. Supervised learning aims at capturing underlying patterns in the data with known outputs. Depending on the type of the output, it can be further categorized as classification for discrete outputs, and regression for continuous outputs. Unsupervised learning, on the other hand, is associated with the data with unknown outputs and thus, clusters the data by finding relationships within the observations [26]. The third type of machine learning, reinforcement learning, bridges the gap between supervised and unsupervised learning since it clusters similar data given the correct answers [27]. Three powerful ML models were developed herein to forecast the compressive strength of RAC: GP, recurrent neural networks (RNNs), and GBRT. The three algorithms have different approaches for data analysis. Whilst GBRT is an ensemble of decision trees, GP uses the Gaussian distribution, and finally, RNNs are an advanced type of neural networks. The diverse nature of these algorithms is considered to explore the robustness of ML algorithms. The fundamentals of GP, RNNs, and GBRT are discussed below.

Gaussian Processes
GP are stochastic processes that generalize the Gaussian probability distribution [28]. In contrast to single-or multi-variable probability distribution in which a scalar or a vector is mapped, a process describes the properties of functions [29]. Therefore, a GP is defined as a probability distribution of functions, P(f), where P(f) has a Gaussian distribution [30]. GPs are parametrized with mean and covariance by analogy with Gaussian distribution, whereas mean and covariance for GP are functions [31]. The purpose of training a supervised learning algorithm using the available training dataset is to develop a model capable of predicting the output for unseen input data. There are two common approaches to determine the appropriate function that fits a set of data with promising accuracy [29]. In the first approach, the model is generated by considering only certain types of functions, e.g., exponential functions [29]. However, the prediction accuracy of such models strongly depends on the performance of the given functions. Conversely, the second approach considers pre-assigned probabilities of several types of functions such that higher probability is assigned to those that are more likely to predict with higher accuracy [32]. The complexity of the first approach is limited to the selected functions. By contrast, the second approach is computationally more costly since there is an infinite number of possible functions to consider [29]. GPs are based on the second approach. The probabilistic formulation of GPs gives rise to a phenomenon called computational tractability in which the properties of functions are inferred even when some of the functions are ignored [33].

Recurrent Neural Networks
DL models are multiple-level computational algorithms able to learn complex underlying structures within a database [34]. DL models have proven successful in diverse applications such as image recognition, language understanding, and deoxyribonucleic acid (DNA) biological processes prediction [34]. However, applications of recent DL algorithms in civil engineering, including convolutional neural networks (CNNs) and RNNs have been more common in structural health monitoring and crack detection owing to the large data sets available in these fields [35,36]. CNNs and RNNs are among the most popular DL algorithms. In the present study, a novel type of RNN is deployed to predict the compressive strength of RAC.
RNN is a class of neural networks with an internal loop that allows the algorithm to keep memories from past information, commonly referred to as hidden state [37,38]. In RNNs, the output of a certain step, t, is used as input for the next step, t+1, emphasizing that every single step is based on the previous one, a process referred to as long-term dependencies; see Figure 1 [37]. Simple RNNs have a limitation regarding the contribution of earlier steps to the later ones, known as vanishing gradients [37]. Two main variants of layers have been proposed for RNN to overcome vanishing gradients: long-short-term memory (LSTM) and gated recurrent unit (GRU) [38]. The main difference of these RNN algorithms is the inclusion of gates for computing data. For example, LSTM layers incorporate a third gate, named the forget gate, in addition to the input and output gates in the simple RNN [37]. The forget gate maintains information and includes it in a non-consecutive step [38]. Conversely, GRU layers have only two types of gates: reset gate and update gate. In the reset gate, the previous information is combined with the most recent information, whereas in the update gate, it is decided how much information is to be passed to the following step. Figure 2 displays the structure of the first GRU layer used in this study [37]. Like LSTMs, GRUs are not affected by vanishing gradients. Nonetheless, GRU is considered a more efficient algorithm due to its simpler structure and formulation [27]. The formulation of GRU is summarized in the following: where and r t are the reset and update gate, respectively, h t is the candidate output, and h t is the corresponding output of the cell for the time step t. Accordingly, W r , W z , W h , U r , U z , and U h are the weight matrices that operate the input vector x t and the previous state h t−1 , and ReLU is the rectified linear unit activation function [39,40].

Gradient Boosting Decision Trees
GBRT algorithm integrates multiple weak learners using a boosting approach in which additional trees are appended in sequence without model parameters being changed. The objective of the gradient boosting is to find the function ℱ( ) that minimizes the loss function ℒ(ℱ( ), ) (e.g., mean squared error or mean absolute error) using a given dataset, ( , ), ( , ), … , ( , ) [41,42]. The predictions of the GBRT model, for a given input data can be expressed as: where the are referred to as weak learners. The constant M represents the number of weak learners which is known as the n_estimators hyperparameter. The loss function represents to what extent the predicted value is close to the output in the dataset using a specific metric. GBRT approaches the best function using the weighting of weak learner models, ( ), which is the basic decision tree fit by the input variables and the negative gradient of the last model's loss function. GBRT develops the model in a greedy manner considering a constant initial function ( ) as follows [41][42][43][44]:

Gradient Boosting Decision Trees
GBRT algorithm integrates multiple weak learners using a boosting approach in which additional trees are appended in sequence without model parameters being changed. The objective of the gradient boosting is to find the function ℱ( ) that minimizes the loss function ℒ(ℱ( ), ) (e.g., mean squared error or mean absolute error) using a given dataset, ( , ), ( , ), … , ( , ) [41,42]. The predictions of the GBRT model, for a given input data can be expressed as: where the are referred to as weak learners. The constant M represents the number of weak learners which is known as the n_estimators hyperparameter. The loss function represents to what extent the predicted value is close to the output in the dataset using a specific metric. GBRT approaches the best function using the weighting of weak learner models, ( ), which is the basic decision tree fit by the input variables and the negative gradient of the last model's loss function. GBRT develops the model in a greedy manner considering a constant initial function ( ) as follows [41][42][43][44]:

Gradient Boosting Decision Trees
GBRT algorithm integrates multiple weak learners using a boosting approach in which additional trees are appended in sequence without model parameters being changed. The objective of the gradient boosting is to find the function F (X) that minimizes the loss function L(F (X), Y) (e.g., mean squared error or mean absolute error) using a given dataset, (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ) [41,42]. The predictions of the GBRT model, y t for a given input data can be expressed as: Figure 1. RNN structure using one GRU hidden layer.

. Gradient Boosting Decision Trees
GBRT algorithm integrates multiple weak learners using a boosting approach in which ditional trees are appended in sequence without model parameters being changed. The objective the gradient boosting is to find the function ℱ( ) that minimizes the loss function ℒ(ℱ( ), ) .g., mean squared error or mean absolute error) using a given dataset, ( , ), ( , ), … , ( , ) 1,42]. The predictions of the GBRT model, for a given input data can be expressed as: here the are referred to as weak learners. The constant M represents the number of weak rners which is known as the n_estimators hyperparameter. The loss function represents to what tent the predicted value is close to the output in the dataset using a specific metric. GBRT proaches the best function using the weighting of weak learner models, ( ), which is the basic cision tree fit by the input variables and the negative gradient of the last model's loss function. RT develops the model in a greedy manner considering a constant initial function ( ) as llows [41][42][43][44]: where the using one GRU hidden layer.
sponding to the first layer of the developed deep k learners using a boosting approach in which ut model parameters being changed. The objective ℱ( ) that minimizes the loss function ℒ(ℱ( ), ) using a given dataset, ( , ), ( , ), … , ( , ) or a given input data can be expressed as: s. The constant M represents the number of weak perparameter. The loss function represents to what put in the dataset using a specific metric. GBRT g of weak learner models, ( ), which is the basic negative gradient of the last model's loss function. r considering a constant initial function ( ) as m are referred to as weak learners. The constant M represents the number of weak learners which is known as the n_estimators hyperparameter. The loss function represents to what extent the predicted value is close to the output in the dataset using a specific metric. GBRT approaches the best function using the weighting of weak learner models, als 2020, 13, x FOR PEER REVIEW 6 of 25

radient Boosting Decision Trees
GBRT algorithm integrates multiple weak learners using a boosting approach in which ional trees are appended in sequence without model parameters being changed. The objective e gradient boosting is to find the function ℱ( ) that minimizes the loss function ℒ(ℱ( ), ) mean squared error or mean absolute error) using a given dataset, ( , ), ( , ), … , ( , ) 2]. The predictions of the GBRT model, for a given input data can be expressed as: e the are referred to as weak learners. The constant M represents the number of weak ers which is known as the n_estimators hyperparameter. The loss function represents to what t the predicted value is close to the output in the dataset using a specific metric. GBRT oaches the best function using the weighting of weak learner models, ( ), which is the basic ion tree fit by the input variables and the negative gradient of the last model's loss function.
develops the model in a greedy manner considering a constant initial function ( ) as s [41][42][43][44]: , which is the basic decision tree fit by the input variables and the negative gradient of the last model's loss function. GBRT develops the model in a greedy manner considering a constant initial function F 0 (X) as follows [41][42][43][44]: Materials 2020, 13, 4331 where the are referred to as weak learners. The constant M represents the number of weak learners which is known as the n_estimators hyperparameter. The loss function represents to what extent the predicted value is close to the output in the dataset using a specific metric. GBRT approaches the best function using the weighting of weak learner models, ( ), which is the basic decision tree fit by the input variables and the negative gradient of the last model's loss function. GBRT develops the model in a greedy manner considering a constant initial function ( ) as follows [41][42][43][44]: where a given input data can be expressed as: he constant M represents the number of weak parameter. The loss function represents to what in the dataset using a specific metric. GBRT f weak learner models, ( ), which is the basic ative gradient of the last model's loss function. onsidering a constant initial function ( ) as m (x) is the mth regression tree and γ m is its weighting coefficient, also called learning rate. In a GBRT model, the number of trees, the learning rate, and the max depth of the tree are amongst the most essential hyperparameters that noticeably affect the predictive performance of the model. Larger number of trees increases the prediction accuracy of the model; however, excessive trees could result in an over-fitted model with lack of generalization for new unseen data. On the other hand, the learning rate controls the contribution of each tree to the predictions, while the max depth indicates the complexity of each tree. Immoderate values of such hyperparameters could bring about either over-fitted or erroneous models [41][42][43][44]. Other parameters of the GBRT model, such as subsample, maximum number of features, etc., also have noticeable effects on the model output and should be considered. Hence, tuning the GBRT hyperparameters is essential to propound robust and reliable performance.

Data Collection and Preprocessing
The experimental data used in this study was collected from 55 peer-reviewed publications ( Table 2). The collected data consists of 1134 RAC mixture design examples, with nine input features and one output. Statistical characteristics of the dataset are given in Table 3. Figure 3 illustrates the Pearson correlation coefficient between different attributes of the dataset. The Pearson correlation coefficient is an indicator of linear dependencies within two random variables; i.e., a coefficient of correlation close to one within two variables indicates that an increase in one of those variables will result in a proportional increment of the other [45]. Accordingly, the w/c ratio and superplasticizer dosage were the features with the highest correlation to the compressive strength. Conversely, aggregates (sand, natural gravel and RCA) did not have significant linear correlation to the compressive strength. Furthermore, since gravel is an ingredient replaced by recycled coarse aggregate (RCA), there was a high correlation between these two features.     Feature normalization is commonly applied prior to modeling. Although normalization is not required for all ML algorithms, it has been proven to improve the model performance [27]. Linear transformation and statistical standardization are among the most popular normalization techniques [98]. In the linear transformation, values are ranged within the domain of [0, 1], whereas in the statistical standardization, the mean and the standard deviation values of the data are set equal to 0 and 1, respectively [98]. In this study, statistical standardization was used prior to GP, GBRT and RNNs modeling. The data was then randomly divided into training and testing sets using 70% (793 samples) for training and the remaining (341 samples) for model testing.
A common practice to assess the performance of ML models is to divide the entire dataset into three different subsets: training, validation and testing. Whilst the learning process is accomplished with the training set, the validation set is used to track the performance of the model, while the testing set serves to assess the extrapolation capabilities of the model by performing it over unseen samples that are not known to the model [27]. However, the partition of data into three subsets leads to a reduction of the training samples, which consequently can result in an insufficiently trained model [27]. Thus, cross-validation is a common technique to prevent the over reduction of the training set, especially for small datasets. There are several techniques to perform cross-validation, most of which consist in leaving out random data to validate the model [99]. In this study, K-fold cross-validation was used. K-fold cross-validation is a resampling method that splits the data into k number of subsets and keeps one subset for validation, while the other k − 1 subsets are used for training [100]. The 5-fold cross-validation employed for hyperparameter selection in this study is schematically depicted in Figure 4. [98]. In the linear transformation, values are ranged within the domain of [0,1], whereas in the statistical standardization, the mean and the standard deviation values of the data are set equal to 0 and 1, respectively [98]. In this study, statistical standardization was used prior to GP and recurrent neural networks (RNNs) modeling. The data was then randomly divided into training and testing sets using 70% (793 samples) for training and the remaining (341 samples) for model testing.
A common practice to assess the performance of ML models is to divide the entire dataset into three different subsets: training, validation and testing. Whilst the learning process is accomplished with the training set, the validation set is used to track the performance of the model, while the testing set serves to assess the extrapolation capabilities of the model by performing it over unseen samples that are not known to the model [27]. However, the partition of data into three subsets leads to a reduction of the training samples, which consequently can result in an insufficiently trained model [27]. Thus, cross-validation is a common technique to prevent the over reduction of the training set, especially for small datasets. There are several techniques to perform cross-validation, most of which consist in leaving out random data to validate the model [99]. In this study, K-fold cross-validation was used. K-fold cross-validation is a resampling method that splits the data into k number of subsets and keeps one subset for validation, while the other − 1 subsets are used for training [100]. The 5fold cross-validation employed for hyperparameter selection in this study is schematically depicted in Figure 4.

Hyperparameter Tuning
Hyperparameter tuning is a crucial step in developing robust ML models. Tuning of ML model mitigates over-fitting and thus, enhances the versatility of the model to unseen data [101]. The selection of optimum hyperparameters is also a determinant factor in increasing the model accuracy [102]. Aiming to avoid manual tuning, there have been different approaches proposed to automize the selection of hyperparameters, such as grid search and random search hyperparameter optimization [103]. These approaches are distinct from each other by the domain of the potential values considered in the search attempt. Whilst grid search explores all possible values in a predefined domain for hyperparameters, random search algorithms select the different hyperparameter values in a random manner for a specific number of iterations [103]. A randomized search procedure with 5-fold cross-validation was used herein for exploring possible values of hyperparameters using the Scikit-learn package in python [104].

Hyperparameter Tuning
Hyperparameter tuning is a crucial step in developing robust ML models. Tuning of ML model mitigates over-fitting and thus, enhances the versatility of the model to unseen data [101]. The selection of optimum hyperparameters is also a determinant factor in increasing the model accuracy [102]. Aiming to avoid manual tuning, there have been different approaches proposed to automize the selection of hyperparameters, such as grid search and random search hyperparameter optimization [103]. These approaches are distinct from each other by the domain of the potential values considered in the search attempt. Whilst grid search explores all possible values in a pre-defined domain for hyperparameters, random search algorithms select the different hyperparameter values in a random manner for a specific number of iterations [103]. A randomized search procedure with 5-fold cross-validation was used herein for exploring possible values of hyperparameters using the Scikit-learn package in Python [104].

GP Model
GP is a non-parametric model [29] and thus, the selection of hyperparameters is less challenging, especially compared to DL models. The hyperparameters of GP models are those required for the kernel function. Therefore, the kernel function, also known as the covariance function, is key to creating robust GP models [29]. In this study, a linear combination of several default kernel functions was implemented as defined in Equation (8). This kernel function includes the periodic kernel, matern kernel, and dot-product kernel. It is worth mentioning that all available kernels, such as the periodic kernel, the rational quadratic kernel, white kernel, matern kernel, and dot-product kernel, were tested for tuning the GP model.
According to the former equation, parameters associated with the considered kernels were tuned as the hyperparametrs of the GP model, including the length scale 1 (l 1 ) and periodicity (p) corresponding to the periodic kernel; ν and length scale 2 (l 2 ) corresponding to the matern kernel; and σ 0 of the dot-product kernel. The optimizing of hyperparameters was carried out using 5-fold cross-validation (CV) as described earlier. The tuned values of the hyperparameters are listed in Table 4. Scikit-learn library in Python was employed for tuning and executing the GP model [104]. Table 4. Hyperparameters for Gaussian processes model.

RNN Model
The developed architecture of the RNN model consists of 3-GRU layers and 1 dense layer with 239, 238, 217, and 1 hidden neuron, respectively. In the first layer, ReLU activation function and sigmoid recurrent activation function were utilized (Figure 2). In the second layer, the activation function and recurrent activation function were the sigmoid and ReLU, respectively. In the third layer, the scaled exponential linear unit (SELU) and softsign were used as activation and recurrent activation functions, respectively. For the dense layer, only the softplus activation function was used. Moreover, the kernel initializer and recurrent initializer were tuned for GRU layers. The kernel initializer was fixed as random uniform for the first and second layers, whereas a constant initializer was used for the third layer. The recurrent initializer was set as constant for the first layer, and zero recurrent initializer for the second and third layers. Mean squared error (MSE) was used as the model loss function, whereas the Adam optimization algorithm was employed as the model optimizer, with a learning rate of 0.0002. Ultimately, the number of epochs and batch size were set to 360 and 11, respectively. According to Whang and Matsukawa [105], the performance of GRU models was improved when batch normalization was applied. Batch normalization mitigates the so-called internal covariate shift [105]. Internal covariate shift is a frequent problem in the training step of deep neural networks in which the distribution of inputs at each layer is changed and thus, finer tuning for models along with smaller learning rates are required [106]. Hence, batch normalization was implemented in the developed RNN model because it improves the performance of GRU networks [105]. Momentum and epsilon are the parameters associated with batch normalization. The optimum momentum and epsilon were found to be 0.95 and 0.0001, respectively. Table 5 summarizes the tuned hyperparameters of the RNN model. The hyperparameter selection for the DL models was performed using a randomized search approach along with 5-fold CV. Keras API and Scikit-learn packages in Python were utilized for building and tuning the RNN model [104,107].

GBRT Model
GBRT has multiple hyperparameters that need tuning prior to model training. In the current study, a randomized search procedure alongside 5-fold CV was used to obtain optimum hyperparameters of the GBRT model. Generally, n_estimators and learning_rate, which indicate the number of the weak learners in the model and the weighting of each estimator, respectively, are the most influential hyperparameters of the GBRT model that are essential to be tuned. Additionally, max_depth, max_features, and subsample can greatly affect the prediction performance of the GBRT model [108]. Table 6 presents the tuned values of the seven hyperparameters considered. The mean absolute error (MAE) was monitored as the statistical error to achieve optimum hyperparameters yielding highest accuracy while mitigating over-fitting. The Scikit-learn package was implemented to perform GBRT modeling and tuning [104].

RAC Mixture Optimization
This section presents the framework adopted for optimizing the mixture design of RAC using the ML model with best predictive performance. The objective of the optimization is to propose the most economic mixture proportions of RAC considering different classes of compressive strength. The PSO algorithm, which is a metaheuristic method that mimics the social interactions of birds or insects (particles) in the search of an optimal solution, was adopted [109]. The particles modify their position in every iteration based on the individual velocity vector of each particle, which in turn is dependent on both the best-found particle and swarm positions [110]. The PSO minimizes an objective function while limiting the domain of the solution. According to the optimization procedure proposed by Yeh [19], the function that is to be optimized herein is the cost to produce a batch of RAC as defined in Equation (9). The considered unit costs, which are averages of values retrieved from multiple material suppliers across Canada, are presented in Table 7. These values can easily be replaced by costs corresponding to other locations. The unit cost of RCA was considered equal to that of NA as recommended in ref. [111]. P = C 1 I 1 + C 2 I 2 + · · · + C i I i (9) were C i represents the unit cost of the ith ingredient of the mixture and I i is the ith ingredient dosage in kg/m 3 . To limit the domain of the solution, two bounder vectors were defined: upper limit and lower limit. The bounder vectors (Table 8) were strategically defined based on a real experiment from the dataset with certain compressive strength to draw meaningful comparison and thus, better validate the performance of the algorithm. In other words, for sand, cement, and water, the upper and lower bounder limits were defined in average 20% up and down the values given for the base mixture.
To promote the use of recycled aggregate, the assigned values to the lower and upper bounder vectors were kept high, and the corresponding values for gravel were maintained low. Also, due to the high cost of superplasticizer, the assigned values for the bounder vectors were kept as low as possible. The 28-day compressive strength of a standard 15 × 30 cm cylinder specimen was considered for the sake of comparison. The results of the optimized mixture proportions are given in Table 9.
The optimized mixture was subsequently tested using the GBRT (being the best predictive model in this study) and compared to the real concrete sample extracted from the dataset to ensure the required compressive strength criteria, as shown in Table 10.

Results, Discussion, and Recommendations
This section presents the results of ML modeling of RAC. The three different models outlined earlier were implemented and their prediction performance is discussed herein. Purposefully, the root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination R 2 were used for assessing the performance of each model. Moreover, the best acquired ML model was employed to perform RAC mixture design optimization for different ranges of 28-day compressive strength. The optimization results along with mixture proportion recommendations are discussed below.

Prediction Performance of ML Models
GP, GRU, and GBRT models were trained using 793 data examples and tested with the remaining 341 data. The final tuned models were executed over five different seed numbers of data split to assess the robustness of the models trained with randomized split of the data for training and the testing sets. The predictive performance of the GP model for the five random seed numbers is summarized in Table 11. The model predicted the output with average RMSE, MAE and R 2 of 7.087 MPa, 4.911 MPa, and 0.844, respectively for the test dataset. However, the model performace was greatly superior for the training dataset with average RMSE, MAE and R 2 of 0.735, 0.138, and 0.998, respectively. This trend can be further observed in the residual plot of the GP model shown in Figure 5. The residualts for the training data were less than 10 MPa, while they were as high as 40 MPa for some data points in the testing set. The actual versus predicted output of the GP model is illustrated in Figure 6.
The GRU model attained better performance compared to that of the GP model (see Table 12). The difference between the GRU statistical errors of train and test data were less than that of the GP model. The RMSE, MAE, and R 2 values for the test dataset were 6.502 MPa, 4.364 MPa, and 0.868, respectively, while the corresponding values were 3.183 MPa, 2.285 MPa, and 0.968, respectively, for the train dataset. This demonstrates more robust predictive performance along with higher accuracy compared to the GP model. The residuals of the predictions varied in a narrower range compared to that in the GP model, as depicted in Figure 7. The residuals for both testing and training datasets had similar normal distribution, indicating more robust predictive performance. Figure 8 shows the actual versus predicted compressive strength of the test data for the GRU model. Table 11. Measured performance of Gaussian process model.

Random Seed and Global Performance
Set   The GRU model attained better performance compared to that of the GP model (see Table 12). The difference between the GRU statistical errors of train and test data were less than that of the GP model. The RMSE, MAE, and values for the test dataset were 6.502 MPa, 4.364 MPa, and 0.868, respectively, while the corresponding values were 3.183 MPa, 2.285 MPa, and 0.968, respectively, for the train dataset. This demonstrates more robust predictive performance along with higher accuracy compared to the GP model. The residuals of the predictions varied in a narrower range compared to that in the GP model, as depicted in Figure 7. The residuals for both testing and training datasets had similar normal distribution, indicating more robust predictive performance. Figure 8 shows the actual versus predicted compressive strength of the test data for the GRU model.   The GRU model attained better performance compared to that of the GP model (see Table 12). The difference between the GRU statistical errors of train and test data were less than that of the GP model. The RMSE, MAE, and values for the test dataset were 6.502 MPa, 4.364 MPa, and 0.868, respectively, while the corresponding values were 3.183 MPa, 2.285 MPa, and 0.968, respectively, for the train dataset. This demonstrates more robust predictive performance along with higher accuracy compared to the GP model. The residuals of the predictions varied in a narrower range compared to that in the GP model, as depicted in Figure 7. The residuals for both testing and training datasets had similar normal distribution, indicating more robust predictive performance. Figure 8 shows the actual versus predicted compressive strength of the test data for the GRU model.      The GBRT model scored superior predictive execution, as indicated in Table 13, with lowest RMSE and MAE values for the test dataset, along with the highest coefficient of determination compared to that of the GP and GRU models. RMSE and MAE were 5.074 and 3.396 MPa, respectively for the GBRT model. Figure 9 depicts the residuals of the predicted compressive strength for the training and testing datasets of the GBRT model. It can be observed that the model captured the trend in the data and demonstrated powerful performance on both the train and test datasets. The model achieved value of 0.997 and 0.925 for training and testing data, respectively. Furthermore, less scatter of the GBRT predicted values of the test dataset was accomplished compared to the GRU and GP models. The actual versus GBRT predicted compressive strength of the test data is displayed in Figure 10.  The GBRT model scored superior predictive execution, as indicated in Table 13, with lowest RMSE and MAE values for the test dataset, along with the highest coefficient of determination compared to that of the GP and GRU models. RMSE and MAE were 5.074 and 3.396 MPa, respectively for the GBRT model. Figure 9 depicts the residuals of the predicted compressive strength for the training and testing datasets of the GBRT model. It can be observed that the model captured the trend in the data and demonstrated powerful performance on both the train and test datasets. The model achieved R 2 value of 0.997 and 0.925 for training and testing data, respectively. Furthermore, less scatter of the GBRT predicted values of the test dataset was accomplished compared to the GRU and GP models. The actual versus GBRT predicted compressive strength of the test data is displayed in Figure 10.

Comparison of Model Performance
Based on the results discussed above, all developed ML models could predict the compressive strength of RAC with reasonable accuracy. However, the GRU and GBRT models demonstrated

Comparison of Model Performance
Based on the results discussed above, all developed ML models could predict the compressive strength of RAC with reasonable accuracy. However, the GRU and GBRT models demonstrated

Comparison of Model Performance
Based on the results discussed above, all developed ML models could predict the compressive strength of RAC with reasonable accuracy. However, the GRU and GBRT models demonstrated higher generalization capacity since their prediction errors for training and testing datasets were highly analogous, in contrast to the GP model. The prediction accuracy for the training dataset in the GP model was very high, while it was quite low for the testing dataset. Thus, the GP model suffers from over-fitting and lack of generalization to new unseen data. Although DL models are recognized to be more accurate on large datasets, the finely tuned GRU model, despite the relatively small dataset, reached outstanding prediction performance with high generalization capacity. Figure 11 illustrates the Taylor diagram of the GP, GRU and GBRT models using the RMSE, Pearson correlation and standard deviation of the predictions. The Taylor diagram suggests that the GBRT model had superior performance in terms of RMSE, whereas the GRU model provided predictions of the output with a highly correlated standard deviation to the actual observations. It is worth mentioning that the GBRT model required considerably shorter execution time for training compared to the GRU model. This comparison was performed using the same computer without mounting or connecting it to a hosted GPU. Ultimately, it was concluded that the GBRT model had the best performance and will be considered for the mixture optimization process.
Materials 2020, 13, x FOR PEER REVIEW 18 of 25 Figure 11 illustrates the Taylor diagram of the GP, GRU and GBRT models using the RMSE, Pearson correlation and standard deviation of the predictions. The Taylor diagram suggests that the GBRT model had superior performance in terms of RMSE, whereas the GRU model provided predictions of the output with a highly correlated standard deviation to the actual observations. It is worth mentioning that the GBRT model required considerably shorter execution time for training compared to the GRU model. This comparison was performed using the same computer without mounting or connecting it to a hosted GPU. Ultimately, it was concluded that the GBRT model had the best performance and will be considered for the mixture optimization process.

Comparison with Previous Studies
A prime goal in ML is to create models that can accurately predict the output for new unseen data never presented to the model, i.e., achieving models that can generalize [38]. ML models generalize a phenomenon through learning the underlying principles within the training data. Hence, they are capable of generalizing when predicting sensible outputs from inputs different than those of the training dataset [27]. Testing the model on a high number of unseen data samples is the rational way to determine whether the model is generalizing or not, thus the importance of having large datasets [27]. The models proposed in the present study have demonstrated better generalization capability than those in former studies. A major reason for this superior performance is that the test dataset used in this study has more data samples than the entire datasets used in developing previous models, including Khademi Table 14). It is important to mention that the Deng et al. [16] model was not included in Table 14 because they neither reported

Comparison with Previous Studies
A prime goal in ML is to create models that can accurately predict the output for new unseen data never presented to the model, i.e., achieving models that can generalize [38]. ML models generalize a phenomenon through learning the underlying principles within the training data. Hence, they are capable of generalizing when predicting sensible outputs from inputs different than those of the training dataset [27]. Testing the model on a high number of unseen data samples is the rational way to determine whether the model is generalizing or not, thus the importance of having large datasets [27]. The models proposed in the present study have demonstrated better generalization capability than those in former studies. A major reason for this superior performance is that the test dataset used in this study has more data samples than the entire datasets used in developing previous models, including Khademi et al. [15], Duan et al. [2], Deshpande et al. [11], Topçu and M. Saridemir [6], Deng et al. [16], and Naderpour et al. [1] (see Table 14). It is important to mention that the Deng et al. [16] model was not included in Table 14 because they neither reported the coefficient of determination nor the root mean squared error. However, they reported the relative error percentage, which corresponded to 6.63, 4.35, and 3.65 for the back propagation neural network, support vector machine, and convolutional neural network, respectively.  Table 14 shows the coefficient of determination and the root mean squared error of models in previous studies that predicted the compressive strength of RAC. It can be observed that models in the present study achieved better accuracy than that of Gholampour et al. [22] and Deshpande et al. [11] who used relatively large data samples. As expected, studies that reported smaller databases reached higher accuracy. For instance, Duan et al. [2] and Khademi et al. [15] used 168 and 257 samples, respectively. The reported accuracy was 0.995 for Duan et al. [2] and 0.919 for Khademi et al. [15], both studies using ANNs. This indicates that although higher number of samples might result in a better generalized model, the accuracy can decrease, and thus accuracy metrics alone may not be enough to assess predictive models. The ability to generalize predictions beyond a limited dataset is a more important feature. Also, several models which used smaller data sets than that in the present study, including Khademi et al. [15], Duan et al. [2], Deshpande et al. [11], Topçu and Saridemir [6], Deng et al. [16], and Naderpour et al. [1], have compromised generalization capability. Furthermore, in the case of Gholampour et al. [22], the authors decided to split the available data and create two different models to predict the compressive strength of those samples corresponding to cylindrical specimens and those corresponding to cube specimens. Conversely, the present study considered the specimen type as an input feature, resulting in higher accuracy. Generally, the present study along with Dantas et al. [23] used the highest number of data. However, Dantas et al. [23] reported a coefficient of determination higher for the testing set than that for the training set, 0.971 and 0.928, respectively. This is a sign that their model was not sufficiently trained, as suggested by Gulli and Pal [37].

RAC Mixture Proportioning and Optimization
A PSO was coupled with the GBRT model to optimize the mixture design and predict the compressive strength of RAC, such that the most economic mixture proportion is obtained for a given compressive strength class. The optimization was performed considering the unit costs of materials presented in Table 7. Not only does the optimization process reduce the higher unit cost ingredients, but it also reduces cement in the mixture, providing both economic benefit and sustainable mixture designs with less CO 2 emission. High upper limit of RCA was considered in the optimization to ensure maximum replacement of RCA as presented in Table 8. Although using higher portions of RCA may contradict with compressive strength requirements, the optimization was carried out to maintain highest possible recycle aggregate content along with the desired compressive strength class. Table 9 presents the optimized mixture designs of RAC for different compressive strength classes as obtained by the PSO model. The mixture proportions were then used to predict the compressive strength using the GBRT model. Silica fume was not considered in the optimization process, and thus was set to zero when predicting the compressive strength with the GBRT model. Ultimately, considerable reduction of cost in all cases, especially for the lower compressive strength range, was achieved as outlined in Table 10. For instance, there was 25% reduction in the cost of the RAC mixture without affecting its compressive strength when the target compressive strength was 35 MPa. The optimization process demonstrated the outstanding capability of the PSO-GBRT model in capturing complex relationships within the data to select the best mixture proportions, while maintaining a similar water-to-cement ratio to that of the base mixture. This can be observed for instance when considering the 25 and 30 MPa compressive strength classes in which high water-to-cement ratio was proposed with high RCA content with high water absorption capacity, as observed in experimental studies [9].

Conclusions
The present study explores deploying state-of-the-art machine learning models to predict the compressive strength of RAC and optimize its mixture design. For this purpose, one of the largest existing experimental datasets, including 1134 mixture design examples and featuring 10 attributes was built from studies in the open literature. Three advanced machine learning models, including Gaussian processes (GP), deep learning (DL), and gradient boosting regression trees (GBRT) were tuned, trained, and tested using the dataset. To guarantee that the developed models were able to generalize the compressive strength of RAC, K-fold cross-validation was used during the tuning process. The results show that the three models successfully captured the underlying principles contributing to the compressive strength of RAC. Furthermore, the diverse nature of the algorithms used herein proves the robustness of ML algorithms for data analysis despite the complexity within the dataset. The comparison of the models' performance revealed that the GBRT and DL (recurrent neural network) models had superior predictive performance compared to the GP model in terms of different statistical metrics and performance indicators. Accordingly, the obtained coefficient of determination of the testing set for GBRT, DL, and GP was 0.919, 0.868, and 0.844, respectively. Furthermore, the GBRT model was coupled with PSO to create a hybrid model for optimizing the mixture design of RAC with various compressive strength classes. Accordingly, the GBRT-PSO hybrid model successfully proposed economic mixture designs that fulfill the compressive strength requirement, reduce cost, and mitigate the environmental footprint of concrete production. To further the high potential of the developed ML models, it is proposed to integrate supplementary cementitious materials, such as fly ash and blast furnace slag in the dataset, and to extend the models to also capture durability requirements of RAC in future work.