Prediction of the Compressive Strength for Cement-Based Materials with Metakaolin Based on the Hybrid Machine Learning Method

Cement-based materials are widely used in construction engineering because of their excellent properties. With the continuous improvement of the functional requirements of building infrastructure, the performance requirements of cement-based materials are becoming higher and higher. As an important property of cement-based materials, compressive strength is of great significance to its research. In this study, a Random Forests (RF) and Firefly Algorithm (FA) hybrid machine learning model was proposed to predict the compressive strength of metakaolin cement-based materials. The database containing five input parameters (cement grade, water to binder ratio, cement-sand ratio, metakaolin to binder ratio, and superplasticizer) based on 361 samples was employed for the prediction. In this model, FA was used to optimize the hyperparameters, and RF was used to predict the compressive strength of metakaolin cement-based materials. The reliability of the hybrid model was verified by comparing the predicted and actual values of the dataset. The importance of five variables was also evaluated, and the results showed the cement grade has the greatest influence on the compressive strength of metakaolin cement-based materials, followed by the water-binder ratio.


Introduction
Cement-based materials are composite materials that are composed of cement-based reinforcement, filler, chemical additives, and water through composite technology [1,2]. They are widely used in the construction industry because of their early strength, high strength, high mobility, strong durability, and other characteristics [3][4][5][6][7]. A large amount of CO 2 is generated during the cement configuration process, which brings a great burden to the environment [8][9][10][11]. To reduce resource consumption and ease the burden of carbon emissions on the environment, researchers are looking into replacing some cement with active materials such as fly ash and silica fume [5,[12][13][14][15][16]. Although the application of active materials is an effective method to reduce resource consumption and mitigate the greenhouse effect, due to the limited output of active materials, there are certain limitations in improving the performance of cement-based materials. Metakaolin, a kind of high-performance mineral admixture, is formed by calcination of kaolin at 600~800 • C. Metakaolin is rich in raw materials, has similar activity to silica fume, and has a better effect on improving the properties of cement-based materials [17][18][19][20].
At present, many scholars at home and abroad studied the application of metakaolin in cement-based materials and achieved abundant results. He et al. studied the influence of the content of metakaolin on the properties of sulfate cement-based materials [20]. The study showed that the initial fluidity, the expansion rate, the 28 d bending strength, and et al. used an evolutionary LS-SVM model to predict the irritability of soil improvement based on micro-cement [47]. Guo et al. proposed an effective model for predicting the initial and final setting time of cement on a generalized learning system [48]. Yuanni et al. predicted the strength of concrete on the machine learning LGBM regression algorithm [49]. Fatih Ozcan et al. compared the prediction effect of neural network and fuzzy logic model on the long-term compressive strength of silica fume concrete [50]. Cheng Yeh used a neural network model to simulate the slump flow of concrete [51]. The above machine methods achieved good results in the performance prediction of cement-based materials, and machine learning methods were widely used in the prediction of cement-based materials [33,34,[52][53][54][55][56][57][58][59]. Machine learning technology has been widely used in the cement-based materials performance evaluation process, but these methods still have some limitations, such as uncertainty, time-consuming, and low efficiency [60][61][62][63][64][65][66][67][68][69][70][71][72]. Therefore, it is necessary to propose a more efficient and simple machine learning technology to predict the compressive strength of metakaolin cement-based materials. A single machine learning model is difficult to solve the common shortcomings of machine learning models such as time-consuming and low efficiency [73][74][75][76][77][78][79][80]. To avoid the common problems of machine learning models and improve its application in the field of cement-based materials, the RF and FA hybrid machine learning model was proposed in this study. This hybrid model was employed for the prediction of the compressive strength of the cement-based materials with metakaolin.

Dataset Collection
Accurate prediction results are inseparable from an efficient evaluation model and reliable data. In previous studies, researchers focused more on developing more simple and efficient models to predict the properties of cement-based materials but often ignored the importance of a reliable database for predicting results. A database with reliable and sufficient data is the basis for verifying the accuracy of the model. In this study, the author collected the data from previous studies and established a large and reliable database as a dataset for predicting the compressive strength of cement-based materials with metakaolin. The specimen of the cement-based materials collected from the literature is the standard 150 mm size cube. In this database, the cement grade, the water to binder ratio, the binder to sand ratio, the metakaolin to binder ratio, and the superplasticizer were the input parameters, while the compressive strength of cement-based materials with metakaolin was the output parameter. The influence of these five parameters on the compressive strength of cement-based materials with metakaolin was confirmed in previous studies. Therefore, they were selected as the input variables in the present study; because compressive strength has been regarded as one of the most important parameters to evaluate the performance of cement-based materials, it was selected as the output variable. In the process of data collection, input variables are strictly screened: datasets containing five input variables (i.e., none of them is null) at the same time were selected. The database contains 361 datasets, which are randomly divided into the training set and test set (as shown in Appendix A, Table A1). The training set contains about 80% of the data, while the test set contains about 20% of the data.

RF and FA hybrid Machine Learning Method
RF model has great advantages over other machine learning models, such as better performance, fast computing speed, strong anti-interference ability, and strong fitting ability [81,82]. However, the RF model is similar to a black box, and researchers cannot control its internal operation, so they can only try among different parameters and random seeds, which reduces the efficiency of and model operation to some extent [34,42]. In order to solve this problem, the optimal hyperparameters need to be determined before the RF model runs. Finding the optimal hyperparameter is a difficult task in machine learning. The performance of machine learning is directly related to the hyperparameter. The better the hyperparameter tuning is, the better the model running effect. In this study, the author used FA to tune the hyperparameter of the RF model. In other words, an RF and FA hybrid machine learning model was proposed to predict the compressive strength of cement-based materials in this study. In this hybrid machine learning model, FA is used to adjust the hyperparameters of the RF model, and the RF model is used to determine the complex nonlinear relationship between the compressive strength of metakaolin cementbased materials and the cement grade, the water to binder ratio, the binder to sand ratio, the metakaolin to binder ratio, the superplasticizer.

Random Forest (RF) Model
RF is an integrated learning method that takes a decision tree as the basic unit and completes learning by integrating multiple decision trees. Intuitively speaking, RF is a classification method using decision trees as classifiers. For an input sample, n trees have n classification results; RF integrates all classification voting results and specifies the category with the largest number of votes as the final output. RF constructs multiple decision trees. In order to predict a sample, it is necessary to count the prediction results of each tree in the forest for the sample and then select the result with the highest vote as the final prediction result. The randomness of RF is reflected in the two aspects of random sampling, which make each decision tree in RF have the features of similarity and difference. RF construction includes randomly selected data and randomly selected eigenvalues to be selected.
In the machine learning process, the samples (which were named bootstrap sample S Θ n ) of the compressive strength was determined from the training dataset S n randomly. Hence, the probability regarding each sample should be 1/n. Afterward, the q bootstrap samples (S   (1) in which the q output parametersŶ 1 =ĥ X, S Θ 1 n ,Ŷ 2 =ĥ X, S Θ 2 n , . . . ,Ŷ q =ĥ X, S Θ q n are determined from the q regression trees. Finally, the q output parameters should be averaged to determine the desired variable. The detailed process can be described as follows.
(i) Random sampling of data The random selection of data first involves sampling from the original dataset and constructing a sub-dataset with the same amount of data as the original data. Elements of different subsets and elements of the same subset can both be repeated. Then, the sub-decision tree is constructed by using the sub-dataset, and the input data are put into each sub-decision tree, and each sub-decision tree output a result. Finally, the data to be tested are put into each decision tree, and the output result of the random forest is obtained by voting the judgment result of the sub-decision tree. (ii) Random selection of features to be selected Each split process of the random forest subtree only uses part of the features to be selected, which are randomly selected from all the features to be selected, and then the optimal feature is selected from the randomly selected features. Random selection of features to be selected can improve the diversity of the system and thus improve classification skills.

Firefly Algorithm (FA)
FA works by treating each point in space as a firefly and completing the optimization process by taking advantage of the characteristic that fireflies with strong luminescence attract fireflies with weak luminescence [7,83]. The weak firefly moves to the strong firefly to complete position iteration, find the optimal position and complete the search process. FA needs to meet the following conditions: (ii) The attraction between fireflies is only related to luminous intensity and location.
The strong fireflies move randomly and attract the weak fireflies around, and the attraction is inversely proportional to the distance between fireflies; (iii) Luminescence intensity is determined by the objective function and is proportional to the specified function in the specified region. The search process is related to the luminance and mutual attraction of fireflies, and these two parameters are inversely proportional to the distance. The brighter the firefly is, the better its position is, and the brightest firefly represents the optimal solution for the function. The brighter the firefly is, the more attracted it is to the surrounding fireflies, and if the fireflies glow at the same intensity, they move randomly.

Correlation Analysis
Correlation analysis refers to the analysis of two or more correlated variables to measure the degree of closeness between variables through the analysis results. The high correlation between input parameters means that the correlation coefficient is a high negative value or high positive value, which may lead to low efficiency of the model or difficulty to explain the influence of input parameters on output parameters. Therefore, before training with the RF model, the correlation between cement grade, metakaolin to binder ratio, water to binder ratio, superplasticizer, and binder to sand ratio should be analyzed first. In this study, the author used Statistical Product Service Solutions (SPASS) to analyze the correlation between input parameters, and the analysis results are shown in Figure 1. Figure 1 shows that the correlation coefficient between the same input parameters is 1. The correlation between water to binder ratio and superplasticizer, water to binder ratio and binder to sand ratio, cement grade and binder to sand ratio was the highest, with a correlation coefficient of 0.5, while the correlation between metakaolin to sand ratio and cement grade, metakaolin to binder ratio and superplasticizer, metakaolin to binder ratio and binder to sand ratio was the lowest, with a correlation coefficient of 0.2. In summary, the correlation coefficients between the five input parameters were all lower than 0.6, indicating these parameters were independent of each other. Therefore, there is no multicollinear problem as the cement grade, metakaolin to binder ratio, water to binder ratio, superplasticizer, and cement to sand ratio were employed as the input parameters to predict the compressive strength of cement-based materials with metakaolin.
to the specified function in the specified region. The search process is luminance and mutual attraction of fireflies, and these two parameters proportional to the distance. The brighter the firefly is, the better its p the brightest firefly represents the optimal solution for the function. Th firefly is, the more attracted it is to the surrounding fireflies, and if the at the same intensity, they move randomly.

Correlation Analysis
Correlation analysis refers to the analysis of two or more correlate measure the degree of closeness between variables through the analysis res correlation between input parameters means that the correlation coefficient ative value or high positive value, which may lead to low efficiency of the m culty to explain the influence of input parameters on output parameters. Th training with the RF model, the correlation between cement grade, metaka ratio, water to binder ratio, superplasticizer, and binder to sand ratio shoul first. In this study, the author used Statistical Product Service Solutions (SPA the correlation between input parameters, and the analysis results are show Figure 1 shows that the correlation coefficient between the same input pa The correlation between water to binder ratio and superplasticizer, water t and binder to sand ratio, cement grade and binder to sand ratio was the h correlation coefficient of 0.5, while the correlation between metakaolin to s cement grade, metakaolin to binder ratio and superplasticizer, metakaolin and binder to sand ratio was the lowest, with a correlation coefficient of 0.2 the correlation coefficients between the five input parameters were all lowe dicating these parameters were independent of each other. Therefore, ther collinear problem as the cement grade, metakaolin to binder ratio, water t superplasticizer, and cement to sand ratio were employed as the input para dict the compressive strength of cement-based materials with metakaolin.

Correlation Coefficients Matrix Diagram
Hyperparameters are parameter values set before the machine learning process, not parameter data found through training. Hyperparameter tuning refers to the optimization of hyperparameters and the selection of a group of optimal hyperparameters for machine learning. Therefore, hyperparameter tuning plays an important role in improving the performance and efficiency of machine learning. In this study, the FA algorithm is used to tune the hyperparameter of the RF model. In order to select the optimal hyperparameters, 50 iterations were carried out in this study, and the relationship between the RMSE value and iteration times is shown in Figure 2. It can be seen clearly from Figure 2 that the RMSE value decreases sharply at first and then tend to be stable with the increase in iterations, proving that the FA algorithm can effectively adjust the hyperparameters of the RF model. Before the 10th iteration, the minimum RMSE value was obtained, and then with the increase in iterations, the RMSE value tends to be stable. Therefore, 10-fold cross-validation was used to obtain the optimal hyperparameters. learning. Therefore, hyperparameter tuning plays formance and efficiency of machine learning. In tune the hyperparameter of the RF model. In orde 50 iterations were carried out in this study, and th and iteration times is shown in Figure 2. It can be s value decreases sharply at first and then tend to proving that the FA algorithm can effectively adju Before the 10th iteration, the minimum RMSE va crease in iterations, the RMSE value tends to be st was used to obtain the optimal hyperparameters. Ten-fold cross-validation means that the train group is selected as the test set in turn, the remain set, and the optimal hyperparameter is selected th machine learning models to predict the compressi materials, using 10-fold cross-validation for hype over-learning or under-learning state and impro results of the model. RMSE values of the RF mod Figure 3. Figure 3 shows that the minimum RMS Ten-fold cross-validation means that the training sets are divided into 10 groups, one group is selected as the test set in turn, the remaining 9 groups are selected as the training set, and the optimal hyperparameter is selected through the results 10 times. Before using machine learning models to predict the compressive strength of metakaolin cement-based materials, using 10-fold cross-validation for hyperparameter tuning can effectively avoid over-learning or under-learning state and improve the reliability of the final prediction results of the model. RMSE values of the RF model with different folds are shown in Figure 3. Figure 3 shows that the minimum RMSE value of the RF model is obtained at the 7th fold; that is, selecting this value as the optimal hyperparameter of the RF model can make the prediction results of the final model more persuasive. machine learning models to predict the compressive strength of metakaolin cement-based materials, using 10-fold cross-validation for hyperparameter tuning can effectively avoid over-learning or under-learning state and improve the reliability of the final prediction results of the model. RMSE values of the RF model with different folds are shown in Figure 3. Figure 3 shows that the minimum RMSE value of the RF model is obtained at the 7th fold; that is, selecting this value as the optimal hyperparameter of the RF model can make the prediction results of the final model more persuasive.

Model Evaluation
After the RF and FA hybrid machine learning model is established to predict the compressive strength of metakaolin cement-based materials, it is very important to evaluate the model. The evaluation results determine whether the model has practical value, which is whether the model can accurately predict the compressive strength of metakaolin cement-based materials. This study evaluated the accuracy of the model by comparing the predicted and actual values of the training set and the test set. The prediction results of the compressive strength training dataset and test dataset of metakaolin cement-based materials are shown in Figure 4.   Figure 4A shows the predicted results of the training set, and Figure 4B shows the predicted results of the test set. As can be seen from Figure 4A Figure 4A shows the predicted results of the training set, and Figure 4B shows the predicted results of the test set. As can be seen from Figure 4A 143 and 11.6643). Considering the 361 databases come from different studies. Therefore, there are great differences regarding the raw materials in morphology characteristics, chemical composition, and other factors. Hence, the RF-FA mixed machine learning model proposed in this study can be used to predict the compressive strength of metakaolin cement-based materials, and the predicted value is in good agreement with the measured value; thus, this method can accurately and effectively predict the compressive strength of metakaolin cement-based materials.
The comparison between predicted and measured compressive strength values of the training set and test set of metakaolin cement-based materials is shown in Figure 4. The horizontal line in the figure represents the difference between the predicted value of compressive strength and the actual value. As shown in Figure 4C,D, the predicted values of compressive strength in the training set and test set have a high consistency with the actual values but fewer error points. This proves once again that the hybrid machine learning model has a good effect on the prediction of compressive strength of metakaolin cement-based materials.

Variable Importance Evaluation
The above analysis shows that RF and FA hybrid machine learning method provides an efficient and simple prediction method for the compressive strength of metakaolin cement-based materials. It is of great practical significance to determine the importance of cement grade, metakaolin to binder ratio, water to binder ratio, superplasticizer, and binder to sand ratio on the compressive strength of metakaolin cement-based materials. In this study, the machine learning method was used to determine the importance of these five input parameters to the compressive strength of metakaolin cement-based materials, and the results are shown in Figure 5. As can be seen from Figure 5, the influence scores of cement grade, metakaolin to binder ratio, water to binder ratio, superplasticizer, and binder to sand ratio on the compressive strength of metakaolin cement-based materials are 1.4400, 1.4155, 0.9970, 0.6422 and 0.5981, respectively, the degree of influence decreases one by one. The influence scores of the five input parameters on the compressive strength of metakaolin cement-based materials are all positive; thus, the compressive strength of metakaolin cement-based materials increases with the increase in any one of the five parameters and decreases with the decrease in any one of the five parameters. The most important factor affecting the compressive strength of metakaolin cement-based materials is cement grade, followed by water to binder ratio, while the superplasticizer has the least influence on the compressive strength of metakaolin cement-based materials. The analysis of the importance of the cement grade, the water to binder ratio, the binder to sand ratio, the metakaolin to binder ratio, the superplasticizer on the compressive strength of metakaolin cement-based materials can provide some references for engineers in designing metakaolin cement-based materials with high compressive strength. In order to obtain a higher compressive strength of cement-based materials with metakaolin, the engineers can pay more attention to the cement grade and the water to binder ratio when designing the mixture ratio of the cement-based materials. However, the content of the superplasticizer can be paid less attention considering its little influence on the compressive strength.

Conclusions
In order to solve the shortcomings of traditional machine learning models in cementbased material performance prediction, such as uncertainty, low time consumption, and low efficiency, and improve the accuracy of model prediction, this paper proposes an RF and FA hybrid machine learning model to predict the compressive strength of metakaolin cement-based materials. The accuracy of the hybrid machine learning model is verified by comparing the predicted and actual values of the training set and the test set. The following conclusions are drawn: • Through correlation analysis, it is found that the correlation coefficient of cement grade, the proportion of water binder, the ratio of binder sand, the proportion of metakaolin binder, and the efficient, reducing agent are all less than 0.6, and these five parameters are independent of each other. Therefore, using these five parameters as input parameters to predict the compressive strength of metakaolin cement-based materials will not appear multicollinearity; • The results of 50 iterations show that RMSE decreases sharply with the increase in iterations and then tends to be basically stable. Therefore, using the FA model to adjust the hyperparameters of the RF model can achieve desired results. RF and FA hybrid machine learning algorithms were used to predict the compressive strength of metakaolin cement-based materials, and the training set and test set between predicted values and measured values had a high consistency (RMSE of the training and testing datasets are 11.143 and 11.6643, respectively; R of the training and testing datasets are 0.8392 and 0.8347, respectively), indicating the hybrid model can accurately predict the compressive strength of metakaolin cement-based materials; • Among the five input variables (cement grade, water-binder ratio, cement-sand ratio, metakaolin ratio, and high-efficiency water-reducing agent), cement grade has the greatest influence on the compressive strength of metakaolin cement-based materials, followed by the water-binder ratio. High-efficiency water reducing agent has the least effect. Therefore, cement gradation and water-binder ratio should be mainly considered in the mix design of metakaolin cement-based materials.
For future development, a comparative study should be carried out based on different algorithms from the perspectives of computing efficiency, reliability, and accuracy. Moreover, more possible data on cement-based materials with metakaolin should be collected to increase the reliability of the prediction model.

Conclusions
In order to solve the shortcomings of traditional machine learning models in cementbased material performance prediction, such as uncertainty, low time consumption, and low efficiency, and improve the accuracy of model prediction, this paper proposes an RF and FA hybrid machine learning model to predict the compressive strength of metakaolin cement-based materials. The accuracy of the hybrid machine learning model is verified by comparing the predicted and actual values of the training set and the test set. The following conclusions are drawn:

•
Through correlation analysis, it is found that the correlation coefficient of cement grade, the proportion of water binder, the ratio of binder sand, the proportion of metakaolin binder, and the efficient, reducing agent are all less than 0.6, and these five parameters are independent of each other. Therefore, using these five parameters as input parameters to predict the compressive strength of metakaolin cement-based materials will not appear multicollinearity; • The results of 50 iterations show that RMSE decreases sharply with the increase in iterations and then tends to be basically stable. Therefore, using the FA model to adjust the hyperparameters of the RF model can achieve desired results. RF and FA hybrid machine learning algorithms were used to predict the compressive strength of metakaolin cement-based materials, and the training set and test set between predicted values and measured values had a high consistency (RMSE of the training and testing datasets are 11.143 and 11.6643, respectively; R of the training and testing datasets are 0.8392 and 0.8347, respectively), indicating the hybrid model can accurately predict the compressive strength of metakaolin cement-based materials; • Among the five input variables (cement grade, water-binder ratio, cement-sand ratio, metakaolin ratio, and high-efficiency water-reducing agent), cement grade has the greatest influence on the compressive strength of metakaolin cement-based materials, followed by the water-binder ratio. High-efficiency water reducing agent has the least effect. Therefore, cement gradation and water-binder ratio should be mainly considered in the mix design of metakaolin cement-based materials.
For future development, a comparative study should be carried out based on different algorithms from the perspectives of computing efficiency, reliability, and accuracy. Moreover, more possible data on cement-based materials with metakaolin should be collected to increase the reliability of the prediction model.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare that they have no conflict of interest in this work.