Concrete Strength Prediction Using Machine Learning Methods CatBoost, k-Nearest Neighbors, Support Vector Regression

: Currently, one of the topical areas of application of machine learning methods in the construction industry is the prediction of the mechanical properties of various building materials. In the future, algorithms with elements of artiﬁcial intelligence form the basis of systems for predicting the operational properties of products, structures, buildings and facilities, depending on the characteristics of the initial components and process parameters. Concrete production can be improved using artiﬁcial intelligence methods, in particular, the development, training and application of special algorithms to determine the characteristics of the resulting concrete. The aim of the study was to develop and compare three machine learning algorithms based on CatBoost gradient boosting, k-nearest neighbors and support vector regression to predict the compressive strength of concrete using our accumulated empirical database, and ultimately to improve the production processes in construction industry. It has been established that artiﬁcial intelligence methods can be applied to determine the compressive strength of self-compacting concrete. Of the three machine learning algorithms, the smallest errors and the highest coefﬁcient of determination were observed in the KNN algorithm: MAE was 1.97; MSE, 6.85; RMSE, 2.62; MAPE, 6.15; and the coefﬁcient of determination R 2 , 0.99. The developed models showed an average absolute percentage error in the range 6.15 − 7.89% and can be successfully implemented in the production process and quality control of building materials, since they do not require serious computing resources.


Introduction
The construction industry is currently one of the main engines of the economy.The requirements and levels of responsibility for buildings and structures are increasing; new cities and districts are growing, and densely populated regions continue to develop their urbanized territories.In this regard, the processes of production of building materials, products and structures should be singled out separately.The fact is that the production of building materials is at the junction between the manufacturing and construction industries.In particular, a concept such as concrete: concrete mixture refers simultaneously to the concept of the construction industry, that is, to the factory sector, and to the concept of construction technology, for example, in monolithic concreting.Owing to the fact that concrete is the main building material throughout the world, but at the same time is one of the most complex artificial composites created by man, the prediction of its properties is not always fully possible.There are a huge number of factors and criteria that affect the final quality of concrete, and ultimately, the safety of products, structures, buildings and structures created from it.Thus, one of the main tasks of process engineers and scientists in the field of materials science is the search for the most effective prescription and technological methods aimed at achieving the goals of controlling the structure and regulating the properties of concrete and products based on them.In this regard, it is obvious that the problem of modern production and construction requires a high degree of manual labor and the influence of a strong human factor.Often, the calculations of technologists, errors in the recipe and probable violations of technology lead to disasters in construction, accidents during the construction of buildings and structures, and premature collapse of load-bearing structures.In addition, enclosing structures made of various types of concrete also suffer significantly.Thus, the problem expressed in the influence of the human factor is relevant [1][2][3][4][5][6][7].
Currently, the construction industry is on the verge of digitalization, which is destroying traditional ideas about the construction process, and also opens up many opportunities.The construction industry lags behind other sectors in terms of the implementation of modern information technologies due to its size and heterogeneity, and it will take many more years for it to reach the level of automation that has already been achieved today, for example, mechanical engineering.However, the movement of the industry toward the introduction of modern information technologies is inevitable.Companies that do not think about using big data, data analysis and the use of artificial intelligence methods in their work after the crisis are at risk of leaving the market during the next crisis.Prospects for improving the quality of manufactured products, services provided, and the formation of a positive image of modern companies lie in the use of artificial intelligence methods for digitalization, systematization of accumulated and incoming information, and forecasting cost, time and technological parameters in construction.Artificial intelligence solutions, which are already successfully used in other industries, are gradually being introduced into the construction process at all stages, including quality control in the production of building materials [8][9][10][11][12][13][14].
Table 1 provides an overview of the application of different machine learning methods to predict various characteristics of concrete and concrete products and structures.In the production of building materials, researchers generate a large amount of data containing important information about the mechanical properties of the resulting material.Data such as the volumetric content of various components, together with the description of the process and results of experiments, often have an unstructured and complex form (in the form of texts in natural language, tables, graphs) [45].The introduction of artificial intelligence methods, in particular machine learning, for the analysis of accumulated data arrays will improve the quality of construction technology and optimize costs by reducing time costs [46][47][48][49][50][51][52][53][54].In this regard, the purpose of our study is the development and comparison of three machine learning algorithms based on CatBoost gradient boosting, k-nearest neighbors and support vector regression methods for predicting the compressive strength of concrete using our accumulated empirical database, and ultimately, improvement of production processes in the construction industry.The objectives of the study were: -Deep analysis of existing machine learning methods in concrete technology, analysis of the experience of their application, evaluation of such experience and the conclusion of scientific and practical deficits from the information received.-Docking of experimental empirical results obtained in the course of real physical experiments and training on their basis of special tools that allow control of the properties and predict the performance of concretes and structures based on using machine learning methods.-After processing and applying the data of a physical experiment, the development of an algorithm is based on three methods of machine learning: CatBoost gradient boosting, the k-nearest neighbors method and the support vector regression method, for processing the empirical base with further comparison of the results based on the values of the main metrics.-Assessment of the prospects for applying the developed methods in practice and the possibility of translating and projecting the results obtained on various types of concrete, and developing specific proposals for construction industry enterprises.
The proposals developed must be tested and substantiated by verifying them against real data.Thus, the scientific novelty of our study is new relationships between real physical experimental data, empirical relationships and values based on them, together with an assessment of the applicability of machine learning methods in predicting the properties of similar concretes for given initial parameters comparable to the main and control ones.The practical significance of the study is the methodology developed for predicting the strength of concrete using machine learning methods, determining the rational parameters of such a methodology and identifying factors and criteria that affect the effectiveness of the proposed solutions.

CatBoost Algorithm
In gradient boosting, predictions are made based on an ensemble of weak learning algorithms, while decision trees are built sequentially.The previous trees in the model are not changed and the results of the previous step are used to improve the next one.In gradient boosting, decision trees are iteratively trained in order to minimize the loss function, as shown in Figure 1.In this study, the CatBoost method is used, which is a gradient boosting library created by Yandex.When implementing decision trees in this method, the same functions are used to create left and right splits at each level of the tree, as shown in Figure 2. Unlike some other machine learning algorithms, CatBoost works well with a small dataset; however, in such cases, you should be aware of overfitting.To avoid overfitting, the model parameters should be tuned.

k-Nearest Neighbors Method
The k-nearest neighbors method is a supervised machine learning algorithm used to solve a regression problem that performs well with a small amount of data.
In practice, the KNN method is more often used in classification problems, but currently the regression version of the k-nearest neighbors algorithm is also common.It is a good basic algorithm to try first before considering more advanced methods.
The algorithm finds the distances between the query and all examples in the data by choosing a certain number of examples (k) closest to the query, then averages the labels in the case of a regression problem.
The k-nearest neighbors algorithm follows: 1.

Input:
Training examples {x i , y i } x i are values of training examples attributes; y i are actual values of the output characteristic.
Test point x for which we are making a prediction.

Forecasting:
Calculating the distance D(x, x i ) to each training example x i ; Selection of k-nearest instances and their labels y i1 , . . ., y ik ; Determination of the mean value y for y i1 , . . ., y ik by Formula (1): where k is the number of nearest instances, y i is the actual value of the output parameter.

Support Vector Regression (SVR)
The support vector regression (SVR) was proposed based on the support vector machine (SVM) for a standard classification problem.
The SVR algorithm in its implementation as a whole is very similar to SRM, but there are several other features: SVR has an additional adjustable parameter ε (epsilon).The epsilon value determines the width of the "tube" around the evaluated function (hyperplane).Points falling inside this "tube" are considered correct predictions and are not penalized by the algorithm.Support vectors are points that extend outside the pipe, not just those that are on the edge, as in classification problems.The value of the additional sliding variable (ξ) measures the distance to points outside the pipe, which can be controlled by adjusting the regularization parameter C.
The analyzed data set is presented in Supplementary Materials.The features of machine learning models are the content of cement (kg/m 3 ), slag (kg/m 3 ), water (L), sand (kg/m 3 ), crushed stone (kg/m 3 ) and additives (kg).The predicted parameter is compressive strength (MPa).
Figure 3 shows the correlation between the variables.It is observed that the linear correlation between the individual input variables and the output variable is strong (>0.5).There is also a negative correlation, in which an increase in one variable is associated with a decrease in another.The statistical characteristics of the dataset are shown in Table 2.

Performance Evaluation Methods
When analyzing regression models, it is important to use various evaluation metrics to evaluate their performance.This study uses five metrics: mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE) and the coefficient of determination R 2 .These metrics are defined as follows: where y i is the actual measured compressive strength; ŷi is predicted value of compressive strength; y i is the average value for y i ; ŷi is the mean value for ŷi .

Model Building and Training
In this study, algorithms based on machine learning methods are developed in the Jupyter Notebook interactive computing web platform in the high-level Python programming language.
The search for the optimal values of the main parameters of the model is one of the key points for achieving the best generalizing ability.In this study, the grid search method was used in combination with five-block cross-validation, which allows us to analyze all possible combinations of parameters of interest for each of the implemented models.
The general workflow of the model in the case of using cross-validation and a grid of parameters is shown in Figure 4.For the algorithm based on the CatBoost method, learning rate and tree depth are selected as adjustable parameters.
The learning rate factor is a parameter that allows control of the amount of weight correction at each iteration.In practice, the learning rate coefficient is usually selected experimentally; its tuning allows for achieving the highest possible quality of the model.
The second adjustable coefficient is the depth of the tree.In most cases, the optimal value is between 4 and 10, so this range of values is used in the parameter lattice.All possible combinations form a table-grid of model parameter settings, as shown in Table 3.As a result of five-box cross-validation, for all combinations of learning rate and tree depth, we need to train 60 models (3 × 4 × 5).
Figure 5 shows a heatmap for the cross-validation average R 2 expressed as a function of two parameters: tree depth and learning rate.Each heatmap value corresponds to an R 2 value for a specific combination of parameters, where light tones correspond to a high value and dark tones to a low value.It can be seen from the graph that the implemented CatBoost algorithm is sensitive to parameter settings, so their optimization is necessary to obtain good generalization ability.Various combinations of learning rates and tree depths increase R 2 from 87% (learning rate 0.5, tree depth 4) to 98% (learning rate 0.1, tree depth 8).
As a result of lattice search and cross-validation, the best parameters of the model were determined: tree depth equal to 8 and learning rate 0.1.

Model Building for k-Nearest Neighbors Algorithm
For the k-nearest neighbors algorithm, the following parameters were selected as adjustable parameters: the number of neighbors, the leaf size and the weight function (Table 4).As a result of the five-block cross-validation, for all combinations of variable parameter values, we need to check the performance of 60 models (6 × 5 × 2).
An important component of the k-nearest neighbors method is normalization.Different attributes typically have different ranges of represented values in the sample, so distance values can be highly dependent on attributes with larger ranges.Therefore, the data were normalized (Z-normalization).

Model Building for SVR Algorithm
For the support vector machine, the following parameters are selected as adjustable parameters (Table 5): -Kernel type: using this parameter, you can determine the type of hyperplane used for data separation; when using "linear" a linear hyperplane is applied; a nonlinear hyperplane can also be used.-Regularization parameter C: the strength of regularization is inversely proportional to C. -Epsilon (ε): acceptable margin of error ε allows deviations within some threshold value.As a result of the five-block cross-validation, for all combinations of variable parameter values, we need to check the performance of 140 models (4 × 5 × 7).

Model Training 4.2.1. Model Training CatBoost
Table 6 shows the parameters of the final CatBoost model: the number of iterations corresponding to the number of decision trees is 500; tree depth and learning rate are defined in Section 4.1.1;RMSE (3) is used as the loss function; the greedy search algorithm provides for sequential deepening of the tree; training is stopped when the error value does not decrease within 30 iterations.The interpretation of the gradient boosting algorithm is facilitated by the ability to represent the decision rules in the form of a visual tree structure.Figure 7 shows part of one of the decision trees.As you can see from the figure, the same functions are used to create left and right splits at each level of the tree.Owing to the peculiarities of the structure of decision trees, gradient boosting is able to cope with nonlinearities.

Model Training k-Nearest Neighbors
The selection of the number of neighbors parameter affects the generalizing ability of the developed model.The choice of the parameter k is important for obtaining correct model results.If the value of the parameter is small, then an overfitting effect occurs when the decision on the output characteristic is made on the basis of a small number of examples and has low significance, and it should be taken into account that the use of small values of k increases the influence of noise on the results.On the contrary, if the value of the parameter is too high, then objects that poorly reflect the local features of the data set take part in the process of solving the regression problem.Thus, the choice of the parameter k significantly affects the generalizing ability of the model.
The leaf size parameter is also significant for the model, as it affects the speed of its work along with the amount of memory used by the algorithm.
Under some circumstances, it may be beneficial to weight points so that nearby points contribute more to the regression than distant points.The "uniform" weight function setting assigns equal weights to all points, while "distance" assigns weights proportional to the reciprocal distance from the query point.
As a result of the five-box cross-validation in Section 4.1.2,the best parameters for the k-nearest neighbors model were determined (Table 7).One of the main advantages of SVR is that its computational complexity does not depend on the dimension of the input space.In addition, it has excellent generalization capabilities with high predictive accuracy when the parameters are properly tuned.
In practice, for the SVR method, the most commonly used kernel, which provides good generalization capabilities, is the radial basis function (RBF), also known as the Gaussian kernel.
There is no rule of thumb for choosing the value of C-it depends entirely on the data.The best option is to search through a grid of parameters, as in Section 4.1.3,where it is suggested to use several different values and choose the one that gives the lowest level of error in testing.
SVR is a powerful algorithm that allows us to choose how error tolerant we are with an acceptable margin of error.The epsilon parameter defines the dead zone.
Adjustment of the penalty coefficient C and the threshold value of the error ε significantly affect the mean square error in the regression model.After conducting multiple experiments with the help of cross-validation, better training results were obtained, thereby choosing the optimal values of the model parameters.
Table 8 presents the parameters of the final SVR model.Owing to the fact that the search for optimal model parameters using parameter grids and cross-validation leads to the creation and training of a large number of models, it is worth evaluating the time spent on the algorithms.
To reduce time costs, the grid search and cross-validation were parallelized across several processor cores.Tables 9-11 show the values of two characteristics (CPU times and Wall time) depending on the number of cores involved.Loading eight processor cores allows you to reduce CPU times~15 times, and Wall time~3 times.

Comparison of Prediction Results
Prediction error plots (Figure 8) show the actual values from the dataset versus the predicted values generated by our model.This visualization method allows you to see how large the variance is in the model.
Table 12 presents the values of the metrics selected to evaluate the developed models.Figure 9 shows graphs visualizing this table.Considering the fact that the developed machine learning algorithms were applied on a series of experimental data obtained when testing concrete, which is a heterogeneous material that depends on a large number of factors and significantly differs in properties and structure in its volume, the following should be noted.The scatter of data when measuring the characteristics of such a material exists regardless of knowledge about them, many of the heterogeneities in concrete are uncontrollable either from the point of view of either the recipe or the technology.Therefore, there is always a data error, which is within 10% and is an acceptable norm in the production of concrete.
The results of the study showed that the coefficient of determination for the developed models is quite high, 0.98-0.99,while the observed value is higher than that reported in [27], which is explained by the homogeneity of the initial data set and the tuning of the hyperparameters of the models.
MAE values are in the range from 1.97 to 2.61, MSE from 6.85 to 11.39 and RMSE from 2.62 to 3.37, which are consistent with the results of previous studies by other authors [25,26].
The MAPE value (6.15-7.89%)obtained by testing the developed machine learning models is acceptable; the models can be verified and accepted for use in determining the compressive strength of self-compacting concrete, considering all available data.The accuracy of the models is comparable to the normative and technical documents for concrete in global practice.The developed methods can be successfully implemented in the process of production and quality control of building materials, since they do not require serious computing resources and, in the future, based on artificial intelligence, an expert system can be created to summarize all of the accumulated experimental data, which can be located in an electronic environment university and provide data to interested workers and researchers for the development of the industry.

Figure 1 .
Figure 1.Iterative training of decision trees in gradient boosting.

Figure 2 .
Figure 2. CatBoost decision tree for the regression problem.

Figure 4 .
Figure 4. Parameter selection and model evaluation process using parameter grid and five-box cross-validation.

Table 3 .
Parameter grid for the CatBoost model.

Figure 5 .
Figure 5. Heat map of R 2 value from two parameters: tree depth and learning rate.

Figure 6
Figure 6 shows the training schedule, according to which 65 iterations are sufficient for the model, determined by setting the overfitting detector.

Figure 7 .
Figure 7. Visualization of the tree structure.

Figure 8 .
Figure 8. Relationship between actual compressive strength and calculated values (a) for the CatBoost model; (b) for the k-nearest neighbors model; (c) for the SVR model.

( 1 )
Development and comparison of three machine learning algorithms based on Cat-Boost gradient boosting, k-nearest neighbors (KNN) and support vector regression (SVR) were used to predict the compressive strength of self-compacting concrete by applying our accumulated empirical database and data.(2) It has been established that artificial intelligence methods can be applied to determine the compressive strength of self-compacting concrete.The developed models showed a mean absolute percentage error (MAPE) in the range 6.15-7.89%.(3) Of the three machine learning algorithms, the smallest errors and the largest coefficient of determination were observed in the KNN algorithm: MAE was 1.97; MSE, 6.85; RMSE, 2.62; MAPE, 6.15; and the coefficient of determination R 2 , 0.99.(4) Models can be verified and accepted for use in determining the compressive strength of self-compacting concrete, taking into account all available data.(5)

Table 1 .
Overview of the application of various machine learning methods for predicting the characteristics of concrete and products and structures from it.

Table 2 .
Statistical characteristics of the original dataset.

Table 4 .
Parameters for the k-nearest neighbor model.

Table 5 .
Parameters for SVR model.

Table 6 .
Model parameters based on CatBoost.

Table 7 .
Parameters of the k-nearest neighbors model.

Table 8 .
Model parameters based on SVR.

Table 9 .
The result of parallelizing the learning process across CPU cores for the CatBoost model.

Table 10 .
The result of parallelizing the learning process across CPU cores for the k-nearest neighbors model.

Table 11 .
The result of parallelizing the learning process across CPU cores for the SVR model.

Table 12 .
Metrics of the developed models.