Applications of Decision Tree and Random Forest as Tree-Based Machine Learning Techniques for Analyzing the Ultimate Strain of Spliced and Non-Spliced Reinforcement Bars

: The performance of both non-spliced and spliced steel bars signiﬁcantly affects the overall performance of structural reinforced concrete elements. In this context, the mechanical properties of reinforcement bars (i.e., their ultimate strength and strain) should be determined in order to evaluate their reliability prior to the construction procedure. In this study, the application of Tree-Based machine learning techniques is implemented to analyze the ultimate strain of non-spliced and spliced steel reinforcements. In this regard, a database containing the results of 225 experimental tests was collected based on the research investigations available in peer-reviewed international publications. The database included the mechanical properties of both non-spliced and mechanically spliced bars. For better accuracy, the databases of other splicing methods such as lap and welded-spliced methods were excluded from this research. The database was categorized as two sub-databases: training (85%) and testing (15%) of the developed models. Various effective parameters such as splice technique, steel grade of the bar, diameter of the steel bar, coupler geometry—including length and outer diameter along with the testing temperatures—were deﬁned as the input variables for analyzing the ultimate strain using tree-based approaches including Decision Trees and Random Forest. The predicted outcomes were compared to the actual values and the precision of the prediction models was assessed via performance metrics, along with a Taylor diagram. Based on the reported results, the reliability of the proposed ML-based methods was acceptable (with an R 2 ≥ 85%) and they were time-saving and cost-effective compared to more complicated, time-consuming, and expensive experimental examinations. More importantly, the models proposed in this study can be further considered as a part of a comprehensive prediction model for estimating the stress-strain behavior of steel bars.


Introduction
Reinforced steel bars are generally fabricated with limited lengths due to the limitations of the transportation and production processes.Splicing bars, therefore, is an inevitable issue which could influence the behavior of reinforced concrete (RC) elements and structures [1].Splice methods and their effect on RC structures' behavior has been extensively researched experimentally and numerically [2][3][4][5][6][7], considering their remarkable importance.
Regarding splice methods, they can be generally categorized into three main classifications: lap, mechanical, and welded splices.Lap splices are recognized as the most common and inexpensive method applied for connecting rebars due to its simplicity [8].However, its shortcomings (e.g., bar congestion), which lead to poor performance of structures, have motivated researchers to propose alternative splice approaches [9][10][11].The application of welded splices is also scarce as it requires specific operation skills to achieve a reliable splice.Moreover, the lack of investigations for conclusions and code provisions on welded splices prevents them from being widely used in construction [9,12].
Mechanical splices, therefore, are claimed to be an appropriate splice method because their implementation (i) is quick and easy, (ii) does not require special skills, (iii) reduces bar congestion, and (iv) leads to better performance compared to other methods [9,10].Mechanical splices could be categorized into five main groups, namely, threaded (the most common type), grouted, swaged, bolted, and headed bar couplers [6,13].A combination of the above-mentioned methods (e.g., threaded-screw couplers) has also been proposed by researchers [14][15][16][17][18].The stress transformation is different in each method.As an example, in the threaded, swaged, bolted, and headed bar couplers, force is transferred through thread locks, grout, interlocks between the bar rips and swaged sleeve, friction between bolts and bar rips, and the male-female elements, respectively [9].In all the mechanical methods, the rebars are connected by a sleeve and, therefore, the length and diameter (or thickness) of the sleeve affects the splice performance greatly.It is claimed that longer couplers could decrease RC element ductility and deformation capacity [12,18,19] and cause earlier failure due to a larger concrete-bar slip [19].The coupler rigidity is increased by increasing its outer diameter (or thickness), which might decrease ductility [13,19].
The other remarkable parameter which can affect the performance of either non-spliced or spliced bars is temperature [20,21].Bompa and Elghazouli concluded that increasing temperature up to 400 • C does not notably affect stiffness and strength, whereas yield plateau becomes less visible for temperatures higher than 300 • C. Furthermore, the ultimate strains of both spliced and non-spliced bars are reduced with increasing temperatures by more than 7.5% of the ultimate strain at ambient temperature [20].
Due to the significant influence of the above-mentioned parameters on the mechanical properties ofnon-spliced and spliced bars, their stress-strain behavior is determined under tensile testing according to design codes (e.g., ASTM [22]).The test is time-consuming and expensive, particularly when an appropriate coupler geometry with a reliable performance needs to be achieved.As a result, machine learning (ML)-based techniques have been developed by researchers as a reliable, quick, and inexpensive alternative approach for experimental tests [23][24][25][26][27].The limited ML-based models presented for predicting the mechanical properties of non-spliced and spliced bars [28,29] have proven the accuracy of these ML-based techniques.As an example, Dabiri et al., predicted the ultimate tensile strength of both non-spliced and spliced bars using nonlinear regression and ridge regression, as well as incorporating an Artificial Neural Network (ANN).In their study, the proposed nonlinear regression method resulted in more accurate values compared to other techniques [28].
It is worth mentioning that the mechanical properties (specifically ultimate strength and ultimate strain) of reinforcement bars must be accurately determined and reported before commencing the construction procedure.The above-mentioned parameters are generally determined through experimental tests that can be time-consuming and costly.ML-based methods have thus attracted researchers' attention for determining the properties and performance of steel bars [30][31][32][33].According to the authors' best knowledge, the ultimate strain of bars is only investigated experimentally in many studies [10,20]; no ML-based models have been proposed for determining this significant parameter.In this context, the prime goal of this study was to evaluate tree-based ML-based approaches, namely Decision Tree (DT) and Random Forest (RF), for predicting the ultimate strain of non-spliced and spliced bars.It should be explained that DT and RF models are considered in this study since they return more reliable estimated values in comparison to other prediction approaches.The developed models could be used as a part of a larger prediction model for obtaining the stress-strain behavior of steel bars without the need to perform experimental tests.

Methodology and Data Collection
The main objective of this study was to present and evaluate models using machine learning for determining the ultimate strain of non-spliced and spliced steel reinforcement bars.To achieve this research objective, 225 data sets were collected from the literature available in peer-reviewed publications [18,20,[34][35][36][37][38][39][40].The collected database was then randomly divided into two sub-databases: (a) a training database for training the relationship between inputs and output, and (b) a testing database for assessing the accuracy of the models.In this study, 85% of the database was incorporated into the training database while 15% was used for testing.As reviewed and discussed in the first section, the parameters with the highest impact on ultimate strain-namely, the bar diameter (mm), steel grade, coupler geometry (length and outer diameter, mm), and temperature ( • C)-were considered as input parameters to estimate the only output, ultimate strain.To determine the degree to which the input data affected the target output, Pearson correlation coefficients were determined and are given in Table 1.It is worth explaining that Pearson Coefficients are defined as the ratio of covariance of two parameters to the products of their standard deviation; ρ = Cov(X, Y)/σ x σ y , and denotes a linear relationship between two variables.Unless otherwise mentioned, ρ ≈ 1 refers to a high linear relationship while ρ ≈ 0 stands for independence between factors; negative coefficients show a reverse relationship between two parameters.To compare the effect of input variables on ultimate strain more precisely, the Pearson coefficients are illustrated in a bar chart shown in Figure 1.The values given in Table 1 and the bar chart demonstrated in Figure 1 reveal that coupler geometry and temperature had a higher effect on the ultimate strain of non-spliced and spliced bars compared to the other input variables.Additionally, all the quantitative inputs (i.e., bar diameter, length of coupler, and outer diameter, along with the temperature) had an inverse relationship with ultimate strain.Regarding data collection, a comprehensive database was collected including various splice methods used for different bar sizes with different steel grades over a large range of temperature variation.A total of 134 out of 225 datasets are outcomes of a tensile test on non-spliced steel bars while the other 91 datasets are the mechanical properties of spliced bars with different methods.In terms of splice techniques, five mechanical techniques-namely, headed bars, shear-screw, grouted, taper threaded, swaged and threaded couplers (as displayed in Figure 2)-with different lengths and thicknesses were taken into consideration.It should be clarified that according to the design code (i.e., ACI 318-19: 18.2.7.1 [41]), regardless of the splice method, bars spliced by mechanical approaches should provide 1.25f y of a non-spliced bar in both tension and compression.Although the mechanical splices in the database are different, they all meet the abovementioned provision and therefore could be utilized in the database for the prediction models.It is noteworthy that this study focused on mechanical couplers and that neither lap nor welded splices (e.g., gas pressure welding, head-to-head) are included in the collected database.Table 2 provides the statistical properties of the other quantitative input and output parameters.Bars with diameters of 12-32 mm were included in the database since they are considered by designers and researchers in typical structures more than other sizes.Coupler geometry (i.e., length and outer diameter), as depicted in Figure 3, were in the range of 45.00-490.34mm and 7.29-64.00mm, respectively, which are the dimensions leading to the acceptable performance of couplers.Eventually, temperatures in the range of 20-600 • C, which were studied in the literature, were also considered in the database.It should be clarified that one out of the six inputs (splice methods) is not quantitative, and therefore, its statistical properties are not given in Table 2.The distribution of each input is also depicted in the histograms provided in Figure 4.

Prediction Models
In this study, two ML-based models, DT and RF-which are recognized as the simplest prediction models with high accuracy results-were proposed using the Python programming language.

Decision Tree (DT)
The Decision Tree (DT) approach is a supervised machine learning technique with the ability of being used for classifications (for non-continuous output values) and regression (for continuous output value) problems.The name of this method is inspired from the shape of a tree, where the class labels are the leaves and the features (or conditions) are the branches.The most notable asset of the DT approach is its being simple to understand, interpret, and visualize.Moreover, it has the possibility of incorporating decision techniques into the Decision Tree.Modeling the datasets with a high degree of nonlinearity in the relationship between the output and the input variables can also be performed through this method.Its drawbacks which should be taken into account are its being prone to overfitting and its difficulty in classifying multiple output classes [42][43][44].

Random Forest (RF)
Simply stated, RF consists of many Decision Trees, and the target output is predicted by considering either the average of the DTs' predicted values or the most voted value [45,46].To be more precise, RF is basically the combination of Bagging and the Random selection of features via the creation of various Decision Trees.Ho primarily developed this method in 1995 by presenting an algorithm for random decision forests.In another research work, Breiman introduced an algorithm using Breiman's bagging idea and the random selection of features, which was proposed by Ho and Amit and Geman [47][48][49][50].The RF method has many advantages compared to other ML-based methods; the most remarkable one is its highly accurate results when a large database is used for training.Furthermore, it is simple and fast to implement [43].
The most important parameter that should be chosen efficiently in an RF model is the number of trees.Figure 5 demonstrates the number of trees against the R 2 -score values for the predicted values.As can be observed in Figure 5, by increasing the number of trees up to approximately 35, the accuracy of the model increased considerably.The accuracy of the models with 35-250 trees increased insignificantly.The R 2 -score of the models with more than 250 trees remained almost unchanged.The optimal number of trees was 367, which is shown by a red point in Figure 5.

Results
The proposed DT and RF models were trained using 85% of the collected database.The other 15%, on the other hand, was used for evaluating the reliability of the prediction models.It is worth explaining that specific optimal percentages have not been suggested for training and testing divisions thus far.It should be clarified that the larger the training database, the better the prediction models can learn the relationship between inputs and output.The testing database, on the other hand, should be sufficient for assessing the accuracy of the proposed models.Additionally, similar percentages have generally been used in similar studies of ML-based models, as they return acceptable results.
The correlation between the predicted and actual values of ultimate strain is depicted in Figure 6 for both the models.The ideal line (predicted values = actual values) and the lines with 20% lower and higher values are shown by solid red and dashed blue lines, respectively.Figure 7 compares the predicted values to the corresponding real ultimate strains obtained in the results of experimental investigations.The graphical comparison between the predicted and actual values illustrated in Figures 6 and 7 reflects the high ability of the DT and RF models for learning the relationship between the input variables and ultimate strain.The DT model exhibited slightly better accuracy compared to the RF model.The prediction models are evaluated deeper in the next section.

Accuracy Evaluation
The accuracy of the proposed tree-based models is deeper assessed in this section through performance metrics and Taylor diagrams.
The most common performance metrics utilized for assessing the reliability prediction models are the R 2 -score, R, RMSE, MSE, MAPE, and MAE [51].As could be seen in Equations ( 1)-( 6), the mentioned parameters are all based on the difference between the predicted value and corresponding actual value.
Table 3 reports the performance metrics of the developed prediction models.As noted previously, the ultimate strain values predicted by the DT model (R 2 = 0.89) were slightly more accurate than those of the RF model (R 2 = 0.85).It should be explained that the RF method is claimed to lead to more reliable results compared to the DT model.However, this statement was concluded for when the same database is used for training and testing, whereas in this study, the training and testing databases were selected randomly.In other words, the training and testing datasets used for the DT and RF models were not the same and therefore caused the insignificant accuracy difference [52].Briefly noted, both the DT and RF models demonstrated acceptable accuracy, with R 2 scores higher than 85%.In order to compare the R 2 scores and RMSE of the proposed prediction models more easily, they are shown in the Taylor diagram in Figure 8.It is worth stating that a Taylor diagram is a two-dimensional space in which predicted and actual values are positioned based on their coordination.The horizontal and vertical axes, circular lines, and radial lines reflect the standard deviation, RMSE, and R 2 -score, respectively.The distance of each mode to the actual value represents the model's accuracy [30]; the closer to the actual database a prediction model is, the more reliable it will be.According to the schematic illustration made in Figure 8, the DT model with the higher R 2 -score and standard deviation and lower RMSE was positioned closer to the actual values (blue star) and thus was slightly more reliable than the RF model.mode to the actual value represents the model's accuracy [30]; the closer to the actual database a prediction model is, the more reliable it will be.According to the schematic illustration made in Figure 8, the DT model with the higher R 2 -score and standard deviation and lower RMSE was positioned closer to the actual values (blue star) and thus was slightly more reliable than the RF model.

Summary and Conclusions
The mechanical properties (e.g., ultimate strength and stain) of reinforcement bars (either spliced or non-spliced) must be determined before they are implemented in construction.These tests are sometime time-consuming and costly, particularly when an appropriate splice geometry needs to be obtained.As a result, this study aimed at developing ML-based models-namely Random Forest and Decision Trees-for the estimation of the ultimate strain of reinforcement steel bars.To this end, a database containing 225 datasets was collected and divided into training (85%) and testing (15%) databases.The parameters-namely, bar diameter, steel grade, splice method, length and outer diameter of the coupler, and temperature-which are claimed as the most effective factors on ultimate strain were defined as input variables for predicting ultimate strain.The predicted values were compared to the actual values and the accuracy of the proposed DT and RF models were assessed using performance metrics and a Taylor diagram.


The results proved the high ability of both DT and RF models for learning the relationship between input variables and the output.The quick and simple ML-based models proposed in this study, therefore, could be considered as a reliable alternative approach for time-consuming and costly experimental tests.


Both the DT and RF model exhibited highly reliable results, with R 2 scores higher than 85%.The DT model, however, illustrated higher accuracy (R 2 = 89%) compared to the RF model (R 2 = 85%).It should be explained that although RF generally leads to more reliable results, in this study, the predicted values obtained by the DT model were more accurate, because the training and testing datasets for each model were selected randomly in order to avoid any human effect on the prediction process.In other words, the training and testing databases were not the same, which caused the slight difference between the models' accuracy.In order to compare RF and DT more precisely, the same training and testing datasets should be chosen.This study, however, focused on the reliability assessment of Decision Tree and Random Forest models for estimating the ultimate strain of spliced and non-spliced steel bars, and the results proved their acceptable accuracy.


It is also noteworthy that comparing the accuracy of the models, which estimates the ultimate strain of bars, for each splice method could be conducted in further studies.To this end, (a) an extensive database including the results of tests on each splice type should be collected; (b) input variables for each method should be specified-as an example, the strength of grout could be considered as one of the inputs for models

Summary and Conclusions
The mechanical properties (e.g., ultimate strength and stain) of reinforcement bars (either spliced or non-spliced) must be determined before they are implemented in construction.These tests are sometime time-consuming and costly, particularly when an appropriate splice geometry needs to be obtained.As a result, this study aimed at developing ML-based models-namely Random Forest and Decision Trees-for the estimation of the ultimate strain of reinforcement steel bars.To this end, a database containing 225 datasets was collected and divided into training (85%) and testing (15%) databases.The parametersnamely, bar diameter, steel grade, splice method, length and outer diameter of the coupler, and temperature-which are claimed as the most effective factors on ultimate strain were defined as input variables for predicting ultimate strain.The predicted values were compared to the actual values and the accuracy of the proposed DT and RF models were assessed using performance metrics and a Taylor diagram.

•
The results proved the high ability of both DT and RF models for learning the relationship between input variables and the output.The quick and simple ML-based models proposed in this study, therefore, could be considered as a reliable alternative approach for time-consuming and costly experimental tests.

•
Both the DT and RF model exhibited highly reliable results, with R 2 scores higher than 85%.The DT model, however, illustrated higher accuracy (R 2 = 89%) compared to the RF model (R 2 = 85%).It should be explained that although RF generally leads to more reliable results, in this study, the predicted values obtained by the DT model were more accurate, because the training and testing datasets for each model were selected randomly in order to avoid any human effect on the prediction process.In other words, the training and testing databases were not the same, which caused the slight difference between the models' accuracy.In order to compare RF and DT more precisely, the same training and testing datasets should be chosen.This study, however, focused on the reliability assessment of Decision Tree and Random Forest models for estimating the ultimate strain of spliced and non-spliced steel bars, and the results proved their acceptable accuracy.

•
It is also noteworthy that comparing the accuracy of the models, which estimates the ultimate strain of bars, for each splice method could be conducted in further studies.
To this end, (a) an extensive database including the results of tests on each splice type should be collected; (b) input variables for each method should be specified-as an example, the strength of grout could be considered as one of the inputs for models estimating the mechanical properties of grouted spliced bars; and (c) the databases should be large enough for training the relationships between inputs (which may be different for each splice technique) and output appropriately.

•
The proposed models in this research could be used in a more generalized model for predicting the stress-strain behavior of spliced and non-spliced bars.More clearly, by combining models which predict other parameters (e.g., yield strength and strain) with the models proposed in this study, the stress-strain curve could be estimated without doing experimental tests.

Figure 1 .
Figure 1.Pearson coefficients of the input variables.

Figure 3 .
Figure 3. Schematic illustration of length and outer diameter of a mechanical splice.

Figure 4 .
Figure 4. Histograms illustrating the distribution of the input and output variables.

Figure 5 .
Figure 5. Choosing the optimal number of trees for the RF model.

Figure 8 .
Figure 8. Comparing the accuracy of the proposed models through a Taylor diagram.

Figure 8 .
Figure 8. Comparing the accuracy of the proposed models through a Taylor diagram.

Table 1 .
Pearson coefficients determined for the input and output parameters.

Table 2 .
Statistical properties of the input and output parameters.

Table 3 .
Performance metrics for the prediction models.