Predicting Mechanical Properties of High-Performance Fiber-Reinforced Cementitious Composites by Integrating Micromechanics and Machine Learning

Current development of high-performance fiber-reinforced cementitious composites (HPFRCC) mainly relies on intensive experiments. The main purpose of this study is to develop a machine learning method for effective and efficient discovery and development of HPFRCC. Specifically, this research develops machine learning models to predict the mechanical properties of HPFRCC through innovative incorporation of micromechanics, aiming to increase the prediction accuracy and generalization performance by enriching and improving the datasets through data cleaning, principal component analysis (PCA), and K-fold cross-validation. This study considers a total of 14 different mix design variables and predicts the ductility of HPFRCC for the first time, in addition to the compressive and tensile strengths. Different types of machine learning methods are investigated and compared, including artificial neural network (ANN), support vector regression (SVR), classification and regression tree (CART), and extreme gradient boosting tree (XGBoost). The results show that the developed machine learning models can reasonably predict the concerned mechanical properties and can be applied to perform parametric studies for the effects of different mix design variables on the mechanical properties. This study is expected to greatly promote efficient discovery and development of HPFRCC.


Introduction
High-performance fiber-reinforced cementitious composites (HPFRCC) feature high tensile strength and ductility, strain-hardening property, and long-term durability [1]. Representative HPFRCC include engineered cementitious composites (ECC) [2][3][4] and ultra-high-performance concretes (UHPC) [5][6][7][8]. ECC feature high ductility, dense cracks, and self-control of crack width, and is designed by mechanistically tuning the matrix, fibers, and fiber-matrix interface [1]. Recently, ECC has achieved multifunctionality, such as selfhealing, self-sensing, self-cleaning, and air-purifying [9,10]. With self-healing crack width control, ECC possesses extreme durability [11]. Typically, with the use of a medium volume (~2%) of polymer fibers, ECC can achieve a tensile strain capacity of 4% or higher [12,13]. UHPC feature high compressive and tensile strengths and are designed by maximizing the particles packing density. Under standard curing conditions, UHPC can achieve compressive strengths higher than 120 MPa. Uncracked UHPC has excellent durability due to the dense microstructure. The superior properties are based on proper mix design. For example, UHPC is designed to densify the microstructures through maximizing the particle strength, respectively. Two strategies are presented to utilize micromechanics, and multiple innovative methods are proposed to improve the dataset. This study attempts to provide an alternative method to promote the development of HPFRCC.

Machine Learning Models
This section introduces the ANN, SVR, and CART. ANN links the input variables (e.g., mix design) to the output variables (e.g., mechanical properties). Figure 1a shows a typical ANN consisting of three types of layers, including an input layer, one or multiple hidden layers, and an output layer [39]. Each layer has one or multiple variables, and the relationships of the variables in different layers are described by using weights and bias. Given a dataset with known mix designs and the corresponding mechanical properties, the weights and the bias are determined to minimize the discrepancy between the predicted and real mechanical properties through an optimization process [40], which is known as the training process. Once the ANN is trained, the relationships between the layers are determined, so the ANN can be used to predict the mechanical properties using the mix design variables. SVR links the input variables (e.g., mix design) to the output variable (e.g., compressive strength) using a regression relationship [41]. w/c is the water-to-cement ratio; s/c is the sand-to-cement ratio; Vf is the fiber content; fc is the compressive strength; ft is the tensile strength; and εcu is the tensile strain capacity (i.e., ductility). w/c is the water-to-cement ratio; s/c is the sand-to-cement ratio; V f is the fiber content; f c is the compressive strength; f t is the tensile strength; and ε cu is the tensile strain capacity (i.e., ductility).  Figure 1b illustrates an application of using SVR to predict the compressive strength. Compared with ANN, SVR also contains three types of layers [42], and uses weights and bias to relate the layers. SVR has one output variable at a time. To consider multiple mechanical properties, SVR can be operated for multiple times. Compared with the conventional regression methods, SVR employs kernel functions that enable the model to solve complex, non-linear problems because the relationships between some variables cannot be described using linear functions [43]. CART relates the input variables to the mechanical properties using a tree structure [40], as shown in Figure 1c. The tree is composed of a root node, multiple interior nodes, and multiple leaf nodes [44]. CART describes the relationships between the input and the output variables by splitting the values of the input variables into subgroups, and determines the splitting pathway through a training process by using a given dataset. The splitting operation of the tree is terminated when the termination criterion is met. The splitting schemes are determined through the training process of the CART model [45]. Similar to SVR, CART has one output variable at a time. Typically, a single tree model (e.g., CART model) cannot provide accurate predictions due to the relatively simple architecture and limited prediction capability. Therefore, the extreme gradient boosting tress were presented to ensembled multiple tree models. The XGboost method can continuously add a new tree and fit the discrepancy between the real value and predicted value from the last iteration, as shown in Figure 1d.

Dataset
The development of the machine learning models is based on datasets that are needed to relate the input and the output variables of the models. The size and the quality of the dataset are significant for the accuracy and generalization performance of the machine learning models. Figure 2 shows the proposed flowchart for establishing the datasets used to develop the machine learning models. First, the variables informed by the mix design and the micromechanics model of HPFRCC in published references are preliminary selected to form a dataset, designated as Dataset 1. Considering that there are limited data of HPFRCC in the published references and the test results of the tensile strain capacity usually show significant scatters, the micromechanics model is used to generate more results of the tensile strain capacity for data augmentation, forming another dataset, designated as Dataset 2. Then, data cleaning is performed to identify and remove anomalous data in Dataset 1 and Dataset 2. The cleaned datasets are further processed through data normalization. The normalized datasets are tested to check whether multicollinearity occurs. If multicollinearity occurs, a principal component analysis (PCA) will be performed to reduce the dimensionality of the datasets and eliminate the multicollinearity problem. The novelties of the procedures include: (i) utilization of micromechanics model for variable selection and data augmentation; (ii) data cleaning and normalization; and (iii) adoption of PCA.  Figure 1b illustrates an application of using SVR to predict the compressive strength. Compared with ANN, SVR also contains three types of layers [42], and uses weights and bias to relate the layers. SVR has one output variable at a time. To consider multiple mechanical properties, SVR can be operated for multiple times. Compared with the conventional regression methods, SVR employs kernel functions that enable the model to solve complex, non-linear problems because the relationships between some variables cannot be described using linear functions [43]. CART relates the input variables to the mechanical properties using a tree structure [40], as shown in Figure 1c. The tree is composed of a root node, multiple interior nodes, and multiple leaf nodes [44]. CART describes the relationships between the input and the output variables by splitting the values of the input variables into subgroups, and determines the splitting pathway through a training process by using a given dataset. The splitting operation of the tree is terminated when the termination criterion is met. The splitting schemes are determined through the training process of the CART model [45]. Similar to SVR, CART has one output variable at a time. Typically, a single tree model (e.g., CART model) cannot provide accurate predictions due to the relatively simple architecture and limited prediction capability. Therefore, the extreme gradient boosting tress were presented to ensembled multiple tree models. The XGboost method can continuously add a new tree and fit the discrepancy between the real value and predicted value from the last iteration, as shown in Figure 1d.

Dataset
The development of the machine learning models is based on datasets that are needed to relate the input and the output variables of the models. The size and the quality of the dataset are significant for the accuracy and generalization performance of the machine learning models. Figure 2 shows the proposed flowchart for establishing the datasets used to develop the machine learning models. First, the variables informed by the mix design and the micromechanics model of HPFRCC in published references are preliminary selected to form a dataset, designated as Dataset 1. Considering that there are limited data of HPFRCC in the published references and the test results of the tensile strain capacity usually show significant scatters, the micromechanics model is used to generate more results of the tensile strain capacity for data augmentation, forming another dataset, designated as Dataset 2. Then, data cleaning is performed to identify and remove anomalous data in Dataset 1 and Dataset 2. The cleaned datasets are further processed through data normalization. The normalized datasets are tested to check whether multicollinearity occurs. If multicollinearity occurs, a principal component analysis (PCA) will be performed to reduce the dimensionality of the datasets and eliminate the multicollinearity problem. The novelties of the procedures include: (i) utilization of micromechanics model for variable selection and data augmentation; (ii) data cleaning and normalization; and (iii) adoption of PCA. Currently, there is no consensus on the selection of variables for predicting material properties using machine learning methods. Different scholars selected different variables to predict the same type of properties. For example, in reference [46], the compressive Currently, there is no consensus on the selection of variables for predicting material properties using machine learning methods. Different scholars selected different variables to predict the same type of properties. For example, in reference [46], the compressive strength was predicted by using the w/c, the aggregate-to-cement ratio, the fine aggregate content, and the superplasticizer content as the input variables, while in reference [25] the compressive strength was predicted by using the w/c, the fly ash content, the aggregate-tocement ratio, the micro silica content, and the superplasticizer content.

Overview
A micromechanics model [47] was developed to design HPFRCC in order to achieve the desired tensile properties, in particular, the post-cracking strain-hardening properties and the superior ductility and toughness. The micromechanics model informs two criteria that are essential for achieving strain-hardening behavior: energy criterion and stress criterion. Figure 3 shows the stress-crack curve for strain-hardening cementitious composites (e.g., ECC) [22]. strength was predicted by using the w/c, the aggregate-to-cement ratio, the fine aggregate content, and the superplasticizer content as the input variables, while in reference [25] the compressive strength was predicted by using the w/c, the fly ash content, the aggregateto-cement ratio, the micro silica content, and the superplasticizer content. A micromechanics model [47] was developed to design HPFRCC in order to achieve the desired tensile properties, in particular, the post-cracking strain-hardening properties and the superior ductility and toughness. The micromechanics model informs two criteria that are essential for achieving strain-hardening behavior: energy criterion and stress criterion. Figure 3 shows the stress-crack curve for strain-hardening cementitious composites (e.g., ECC) [22]. The energy criterion for steady-state crack propagation can be expressed in Equation (1) [20]: where J is the toughness of the matrix, and J = K /E ; E is the modulus of elasticity of the matrix; K is the fracture toughness of the matrix, which can be tested using beams with a notch under three-point bending [22]; σ is the tensile strength under steady-state crack propagation process; and δ is the corresponding crack width [48]. The toughness of the matrix must be less than the complementary energy from the fiber bridging [20]. The upper limit for steady-state crack propagation condition can be expressed as: where σ is the peak stress, and δ is the corresponding crack opening width. The complementary energy can be calculated by Equation (3) [22]: where V , L , d , and E are respectively the volume ratio, length, diameter, and elastic modulus of the fibers; τ and G are respectively the frictional bond and chemical bond strengths [49]. The micromechanics model shows that the tensile properties of HPFRCC are associated with the following parameters: (1) the properties of the chopped fibers: the volume ratio (V ), the fiber length (L ), the fiber diameter (d ), and the elastic modulus (E ); (2) the properties of the cementitious matrix: the elastic modulus (E ) and the fracture toughness (K ); and (3) the fiber-matrix interface properties: the frictional bond strength The energy criterion for steady-state crack propagation can be expressed in Equation (1) [20]: where J tip is the toughness of the matrix, and J tip = K m 2 /E m ; E m is the modulus of elasticity of the matrix; K m is the fracture toughness of the matrix, which can be tested using beams with a notch under three-point bending [22]; σ ss is the tensile strength under steady-state crack propagation process; and δ ss is the corresponding crack width [48].
The toughness of the matrix must be less than the complementary energy from the fiber bridging [20]. The upper limit for steady-state crack propagation condition can be expressed as: where σ 0 is the peak stress, and δ 0 is the corresponding crack opening width. The complementary energy can be calculated by Equation (3) [22]: where V f , L f , d f , and E f are respectively the volume ratio, length, diameter, and elastic modulus of the fibers; τ 0 and G d are respectively the frictional bond and chemical bond strengths [49].
The micromechanics model shows that the tensile properties of HPFRCC are associated with the following parameters: (1) the properties of the chopped fibers: the volume ratio (V f ), the fiber length (L f ), the fiber diameter (d f ), and the elastic modulus (E f ); (2) the properties of the cementitious matrix: the elastic modulus (E m ) and the fracture toughness (K m ); and (3) the fiber-matrix interface properties: the frictional bond strength (τ 0 ) and the chemical bond strength (G d ) [12]. Therefore, the fiber properties (V f , L f , d f , E f ) are also considered as the input variables of the machine learning methods, in addition to the variables typically used for conventional concrete.

Dataset Augmentation
Based on the micromechanics, a semi-empirical model was proposed to predict the tensile strain capacity (ε cu ) of HPFRCC by using three fiber parameters, as shown in Equation (4) [23]: where L f is the fiber length; d f is the fiber diameter; and V f is the fiber content. The R 2 of Equation (4) was 0.95, indicating a strong correlation [23]. Therefore, the semi-empirical model is used to generate more data to enlarge the dataset used to develop the machine learning models. Specifically, Equation (4) is used to generate 70 data by varying the values of L f , d f , and V f . The generated data are used to supplement the data in Dataset 1, forming a larger dataset for the prediction of tensile strain capacity, designated as Dataset 2. Compared with Dataset 1, Dataset 2 has the same types of variables but is larger.

Dataset Cleaning
In general, there are anomalous data in the dataset formed by collecting test data from different sources, due to the errors generated in tests, data documentation, and so on. This study proposes to identify and remove anomalous data from dataset through a cluster analysis. Specifically, the anomalous data are identified from the analysis of data distribution, as elaborated in [67]. For each variable, when the data follows a normal distribution, 99.7% of the entire dataset should be within three times standard deviations (3σ), as shown in Equation (5). In this study, the data outside the range determined by the normal distribution are considered as anomalous data, as depicted by: where x denotes a data; µ is the expectation; and σ is the standard deviation.

Dataset Normalization
The raw data extracted from literature often have different units and scales of magnitude. For example, the water-to-binder ratio is 0.25, while the modulus of elasticity of fibers can be up to 100 GPa. The significant discrepancy of numeric values of different variables may highly affect the results of machine learning models. Therefore, in this study, all the input data are normalized to the range of −1 to 1, as shown in Equation (6): where x is the original data; x * is the normalized data; µ is the mean value; and σ is the standard deviation. The distribution of data is kept the same before and after the normalization [68]. The dataset is divided into training and testing datasets with the same random seed.

Multicollinearity and Principal Component Analysis
Multicollinearity may occur in high-dimension analysis and compromise the statistical significance of independent variables [69]. According to [70], multicollinearity occurs when the absolute value of the Pearson correlation coefficient is higher than 0.7. When multicollinearity occurs, this study performs a PCA [71], which is an unsupervised learning method to reduce the dimensionality of the dataset and avoid multicollinearity through eigenvalue decomposition. The PCA aims to extract the main variables by evaluating the significance of the variables on the mechanical properties. The significance is reflected by the variance as defined in Equation (7): where λ is the variance; X i is the ith sample; and X is the average value of all the samples. A cumulative variance ratio is defined in Equation (8) [72]: where k is the optimal dimensionality of the input variables, and n is the total dimensionality of the input variables. According to [72], the cumulative variance ratio is the ratio of the sum of the variances for the principal components to the total variances for all components, and the cumulative variance ratio should be greater than 0.99.

Hyperparameter Tuning
Hyperparameters are the key parameters of machine learning methods. For example, the hyperparameters of ANN include the number of variables in each hidden layer and the learning rate. This study proposes to combine a grid search method [73] and K-fold crossvalidation method to optimize hyperparameters and prevent overfitting and underfitting. Figure 4 illustrates the proposed hyperparameter tuning or optimization method. For instance, the number of variables in a hidden layer of an ANN is described as H = {20, 21, . . . , 100}, and the learning rate is expressed as η = {0.1, 0.01, 0.001}. The grid search method tests and selects the H and η values that yield the lowest error. The K-fold crossvalidation is used to improve the generalization performance of the machine learning models. A training dataset is divided into K folds (K = 10) with comparable sizes. One fold is randomly selected as the validation set, and the other folds are used to train the model. By using K-fold cross-validation method, all data can participate in the training process. validation is used to improve the generalization performance of the machine learning models. A training dataset is divided into K folds (K = 10) with comparable sizes. One fold is randomly selected as the validation set, and the other folds are used to train the model. By using K-fold cross-validation method, all data can participate in the training process.

Performance Evaluation
To evaluate the prediction accuracy, three typical performance metrics are used to assess the correlation between the predicted value (Y ) and the actual value (Y ) of the four different machine learning models, which are the mean squared error (MSE), Pearson correlation coefficient (R), and coefficient of determination (R 2 ), as defined in Equations (9)-(11) [74][75][76]: where n is the data number. Figure 5 shows the innovations for the prediction of the mechanical properties of HPFRCC. With the challenges identified in the introduction section, novel methods are proposed for improving the dataset used to develop the machine learning models, including data collection, data augmentation, data cleaning, multicollinearity analysis, and variable selection through PCA. Two strategies are proposed to utilize the micromechanics model: (1) Strategy 1 (variable selection): use the theoretical model to screen the variables; and (2) Strategy 2 (data augmentation): use the model to generate more data that supplement the experimental data. These two strategies are elaborated in Section 2.2.1. The PCA method is proposed to finalize the selection of variables and avoid multicollinearity. K-

Performance Evaluation
To evaluate the prediction accuracy, three typical performance metrics are used to assess the correlation between the predicted value (Y pre ) and the actual value (Y actual ) of the four different machine learning models, which are the mean squared error (MSE), Pearson correlation coefficient (R), and coefficient of determination (R 2 ), as defined in Equations (9)-(11) [74][75][76]: where n is the data number. Figure 5 shows the innovations for the prediction of the mechanical properties of HPFRCC. With the challenges identified in the introduction section, novel methods are proposed for improving the dataset used to develop the machine learning models, including data collection, data augmentation, data cleaning, multicollinearity analysis, and variable selection through PCA. Two strategies are proposed to utilize the micromechanics model: (1) Strategy 1 (variable selection): use the theoretical model to screen the variables; and (2) Strategy 2 (data augmentation): use the model to generate more data that supplement the experimental data. These two strategies are elaborated in Section 2.2.1. The PCA method is proposed to finalize the selection of variables and avoid multicollinearity. K-fold cross-validation and grid search are combined to optimize the hyperparameters. Finally, the prediction accuracy is evaluated to select the best machine learning models for the different mechanical properties of HPFRCC.

Innovation of the Proposed Methodology
fold cross-validation and grid search are combined to optimize the hyperparameters. Finally, the prediction accuracy is evaluated to select the best machine learning models for the different mechanical properties of HPFRCC.  Table 2 shows the data anomaly detection results. It should be noted that only the items that contained anomalous data are listed. According to the analysis total of 23 data are removed from Dataset 1 and Dataset 2. For example, data with a water-to-binder ratio (w/b) of 0.8 were identified as anomalous data, consistent with the knowledge of typical HPFRCC with low w/b (<0.35). Table 2. Results of data anomaly detection.

Items
Number of Anomalous Data Sand-to-binder ratio 7 Water-to-binder ratio 6 Superplasticizer content 6 Fiber length 2 Fiber elastic modulus 2 The dataset sizes of the different mechanical properties are different because different papers reported different properties. For example, a significant number of papers only reported the tensile properties of HPFRCC. After performing data cleaning, the numbers of data for the compressive strength, the tensile strength, and the tensile strain capacity are respectively 238, 247, and 266 in Dataset 1. Dataset 2 are established by incorporating the data generated by the micromechanics model for data augmentation, containing 317 data for predicting the tensile strain capacity.    Table 2 shows the data anomaly detection results. It should be noted that only the items that contained anomalous data are listed. According to the analysis total of 23 data are removed from Dataset 1 and Dataset 2. For example, data with a water-to-binder ratio (w/b) of 0.8 were identified as anomalous data, consistent with the knowledge of typical HPFRCC with low w/b (<0.35). Table 2. Results of data anomaly detection.

Number of Anomalous Data
Sand-to-binder ratio 7 Water-to-binder ratio 6 Superplasticizer content 6 Fiber length 2 Fiber elastic modulus 2 The dataset sizes of the different mechanical properties are different because different papers reported different properties. For example, a significant number of papers only reported the tensile properties of HPFRCC. After performing data cleaning, the numbers of data for the compressive strength, the tensile strength, and the tensile strain capacity are respectively 238, 247, and 266 in Dataset 1. Dataset 2 are established by incorporating the data generated by the micromechanics model for data augmentation, containing 317 data for predicting the tensile strain capacity.

Variable Selection
Figure 6a-c show that the Pearson correlation coefficients off the diagonal can be higher than 0.7, indicating that multicollinearity can occur if all the variables are used. Thus, the PCA is performed to reduce the dimensionality and eliminate multicollinearity for the datasets. Figure 6d-f show the results of the variance and the variance ratio for the datasets used to predict the three mechanical properties. With the threshold (0.99) of the cumulative variance ratio, the dimensionality of the input variables is reduced from 14 to 12 for the three datasets. The first 12 components with a high cumulative variance ratio are selected to construct the dataset. The correlation matrix after reducing the dimensionality of dataset is shown in Figure 6g. Because the correlation of each pair of variables is small (less than 0.01), the correlation matrices of the compressive strength, tensile strength, and tensile strain capacity look the same. cumulative variance ratio, the dimensionality of the input variables is reduced from 14 to 12 for the three datasets. The first 12 components with a high cumulative variance ratio are selected to construct the dataset. The correlation matrix after reducing the dimensionality of dataset is shown in Figure 6g. Because the correlation of each pair of variables is small (less than 0.01), the correlation matrices of the compressive strength, tensile strength, and tensile strain capacity look the same. After the datasets are improved by the data cleaning and PCA, the datasets are used to train and test the machine learning models. Specifically, 75% of data are used for training, and 25% of data are used for testing of the machine learning models. Table 3 lists the optimal hyperparameters of the machine learning models for the different properties. For the same machine learning method, the optimal hyperparameters are different for the different properties. Therefore, different models must be used to predict the different properties. After the datasets are improved by the data cleaning and PCA, the datasets are used to train and test the machine learning models. Specifically, 75% of data are used for training, and 25% of data are used for testing of the machine learning models. Table 3 lists the optimal hyperparameters of the machine learning models for the different properties. For the same machine learning method, the optimal hyperparameters are different for the different properties. Therefore, different models must be used to predict the different properties.

Training Process
The optimal hyperparameters listed in Table 3 are used to train the machine learning models. In the training process, the MSE values of the different machine learning methods are changed, as shown in Figure 7. As the data number increases, the MSE of the training dataset increases because it becomes more difficult for the machine learning model to fit the data; the MSE of the cross-validation decreases, meaning that the generalization performance of the machine learning model continues to be improved; the MSE of the cross-validation curve gets close to but is larger than the MSE of the training dataset, indicating that overfitting or underfitting does not occur.

Training Process
The optimal hyperparameters listed in Table 3 are used to train the machine learning models. In the training process, the MSE values of the different machine learning methods are changed, as shown in Figure 7. As the data number increases, the MSE of the training dataset increases because it becomes more difficult for the machine learning model to fit the data; the MSE of the cross-validation decreases, meaning that the generalization performance of the machine learning model continues to be improved; the MSE of the crossvalidation curve gets close to but is larger than the MSE of the training dataset, indicating that overfitting or underfitting does not occur.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.

Prediction Results of Mechanical Properties
Based on the trained machine learning models, the compressive strength, tensile strength, and tensile strain capacity can be predicted. The prediction results are compared with the actual test results, as shown in Table 4.  The prediction accuracy is reflected by the R 2 value, and a large R 2 value indicates a high prediction accuracy. The results corresponding to the training and the testing datasets are respectively considered in the comparison. Among the four machine learning methods, the XGBoost method shows the highest accuracy for all the three investigated mechanical properties, followed by the SVR method and then the ANN method. The CART method shows the lowest accuracy for all the three properties. With the XGBoost method, the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.984, 0.993, and 0.989, respectively, for the training dataset; and the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.921, 0.957, and 0.896, respectively, for the testing dataset. The high accuracy of the XGBoost model can be attributed to its architecture, as shown in Figure 1d, which can better represent the relationship between input and output variables.
The predicted results of compressive strength, tensile strength, and tensile strain capacity from the ANN, SVR, CART, and XGBoost models are summarized in Table 5. For the prediction of the compressive strength, the XGBoost model exhibits the highest accuracy: R 2 = 0.921, R = 0.966, and MSE = 45.57. For the prediction of the tensile strength, the XGBoost model shows the highest accuracy: R 2 = 0.957, R = 0.980, MSE = 0.602. For the prediction of tensile strain capacity, the XGBoost model also shows the highest accuracy: R 2 = 0.896, R = 0.955, and MSE = 0.617. Although XGBoost shows the highest accuracy, the prediction accuracy for the tensile strain capacity is relatively low (lower than 0.90), compared with the accuracy of the compressive strength and the tensile strength. Further improvement is needed for the tensile strain capacity. The prediction accuracy is reflected by the R 2 value, and a large R 2 value indicates a high prediction accuracy. The results corresponding to the training and the testing datasets are respectively considered in the comparison. Among the four machine learning methods, the XGBoost method shows the highest accuracy for all the three investigated mechanical properties, followed by the SVR method and then the ANN method. The CART method shows the lowest accuracy for all the three properties. With the XGBoost method, the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.984, 0.993, and 0.989, respectively, for the training dataset; and the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.921, 0.957, and 0.896, respectively, for the testing dataset. The high accuracy of the XGBoost model can be attributed to its architecture, as shown in Figure 1d, which can better represent the relationship between input and output variables.
The predicted results of compressive strength, tensile strength, and tensile strain capacity from the ANN, SVR, CART, and XGBoost models are summarized in Table 5. For the prediction of the compressive strength, the XGBoost model exhibits the highest accuracy: R 2 = 0.921, R = 0.966, and MSE = 45.57. For the prediction of the tensile strength, the XGBoost model shows the highest accuracy: R 2 = 0.957, R = 0.980, MSE = 0.602. For the prediction of tensile strain capacity, the XGBoost model also shows the highest accuracy: R 2 = 0.896, R = 0.955, and MSE = 0.617. Although XGBoost shows the highest accuracy, the prediction accuracy for the tensile strain capacity is relatively low (lower than 0.90), compared with the accuracy of the compressive strength and the tensile strength. Further improvement is needed for the tensile strain capacity. The prediction accuracy is reflected by the R 2 value, and a large R 2 value indicates a high prediction accuracy. The results corresponding to the training and the testing datasets are respectively considered in the comparison. Among the four machine learning methods, the XGBoost method shows the highest accuracy for all the three investigated mechanical properties, followed by the SVR method and then the ANN method. The CART method shows the lowest accuracy for all the three properties. With the XGBoost method, the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.984, 0.993, and 0.989, respectively, for the training dataset; and the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.921, 0.957, and 0.896, respectively, for the testing dataset. The high accuracy of the XGBoost model can be attributed to its architecture, as shown in Figure 1d, which can better represent the relationship between input and output variables.
The predicted results of compressive strength, tensile strength, and tensile strain capacity from the ANN, SVR, CART, and XGBoost models are summarized in Table 5. For the prediction of the compressive strength, the XGBoost model exhibits the highest accuracy: R 2 = 0.921, R = 0.966, and MSE = 45.57. For the prediction of the tensile strength, the XGBoost model shows the highest accuracy: R 2 = 0.957, R = 0.980, MSE = 0.602. For the prediction of tensile strain capacity, the XGBoost model also shows the highest accuracy: R 2 = 0.896, R = 0.955, and MSE = 0.617. Although XGBoost shows the highest accuracy, the prediction accuracy for the tensile strain capacity is relatively low (lower than 0.90), compared with the accuracy of the compressive strength and the tensile strength. Further improvement is needed for the tensile strain capacity.

Xgboost Xgboost Xgboost
The prediction accuracy is reflected by the R 2 value, and a large R 2 value indicates a high prediction accuracy. The results corresponding to the training and the testing datasets are respectively considered in the comparison. Among the four machine learning methods, the XGBoost method shows the highest accuracy for all the three investigated mechanical properties, followed by the SVR method and then the ANN method. The CART method shows the lowest accuracy for all the three properties. With the XGBoost method, the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.984, 0.993, and 0.989, respectively, for the training dataset; and the R 2 values of the compressive strength, tensile strength, and tensile strain capacity are 0.921, 0.957, and 0.896, respectively, for the testing dataset. The high accuracy of the XGBoost model can be attributed to its architecture, as shown in Figure 1d, which can better represent the relationship between input and output variables.
The predicted results of compressive strength, tensile strength, and tensile strain capacity from the ANN, SVR, CART, and XGBoost models are summarized in Table 5. For the prediction of the compressive strength, the XGBoost model exhibits the highest accuracy: R 2 = 0.921, R = 0.966, and MSE = 45.57. For the prediction of the tensile strength, the XGBoost model shows the highest accuracy: R 2 = 0.957, R = 0.980, MSE = 0.602. For the prediction of tensile strain capacity, the XGBoost model also shows the highest accuracy: R 2 = 0.896, R = 0.955, and MSE = 0.617. Although XGBoost shows the highest accuracy, the prediction accuracy for the tensile strain capacity is relatively low (lower than 0.90), compared with the accuracy of the compressive strength and the tensile strength. Further improvement is needed for the tensile strain capacity.

Effect of Supplemental Data
To further improve the prediction accuracy for the tensile strain capacity, Dataset 2 which includes the supplemental data generated from the semi-empirical model is used to train the machine learning models. After data augmentation, the dataset for the prediction of tensile strain capacity increases from 247 to 317. The correlation map for the variables is plotted in Figure 8. In Figure 8a, when the 14 variables are used, the multicollinearity occurs. In Figure 8b, the dataset is improved by the PCA to reduce the dimensionality from 14 to 12 and remove the multicollinearity.

Effect of Supplemental Data
To further improve the prediction accuracy for the tensile strain capacity, Dataset 2 which includes the supplemental data generated from the semi-empirical model is used to train the machine learning models. After data augmentation, the dataset for the prediction of tensile strain capacity increases from 247 to 317. The correlation map for the variables is plotted in Figure 8. In Figure 8a, when the 14 variables are used, the multicollinearity occurs. In Figure 8b, the dataset is improved by the PCA to reduce the dimensionality from 14 to 12 and remove the multicollinearity. Then, the improved Dataset 2 is adopted to train the predictive models using the four machine learning methods, and the evaluation results for the training and the testing dataset are shown in Table 6. Compared with Dataset 1, Dataset 2 improves the R 2 of testing dataset from 0.754 to 0.868 for the ANN model, from 0.871 to 0.907 for the SVR model, from 0.703 to 0.817 for the CART model, and from 0.896 to 0.912 for the XGBoost model, respectively. Therefore, the prediction performance for four machine learning models is improved by using the proposed dataset augmentation method based on the utilization of the micromechanics model.  Then, the improved Dataset 2 is adopted to train the predictive models using the four machine learning methods, and the evaluation results for the training and the testing dataset are shown in Table 6. Compared with Dataset 1, Dataset 2 improves the R 2 of testing dataset from 0.754 to 0.868 for the ANN model, from 0.871 to 0.907 for the SVR model, from 0.703 to 0.817 for the CART model, and from 0.896 to 0.912 for the XGBoost model, respectively. Therefore, the prediction performance for four machine learning models is improved by using the proposed dataset augmentation method based on the utilization of the micromechanics model.

Implementation of the Predictive Models
In this section, the XGBoost models are used to predict the compressive strength, tensile strength, and tensile strain capacity. In [77], as metakaolin was used to partially replace fly ash at a percentage of 0 to 40%, the compressive strength was increased from 55.3 MPa to 72.7 MPa because the metakaolin was more reactive. The trained XGBoost model is used to predict the compressive strength, as shown in Figure 9a. The results show that the model can reasonably predict the compressive strength. In [22], as fly ash was used to partially replace cement, the tensile strength of the mixture was changed. The trained XGBoost model is used to predict the tensile strength, as shown in Figure 9b. The results show that the model can reasonably predict the tensile strength. In [63], as slag was used to partially replace cement at a percentage of 0 to 30%, the tensile strain capacity was changed. The XGBoost models that are respectively trained using Dataset 1 and Dataset 2 are used to predict the tensile strain capacity, as shown in Figure 9c. The results show that the model can reasonably predict the tensile strain capacity. These results show that the developed machine learning models are promising for parametric studies on the effects of the mix design variables on the mechanical properties. (c) Figure 9. Comparison of the prediction against the test results: (a) compressive strength [77]; (b) tensile strength [22]; and (c) tensile strain capacity [63].

Conclusions
This research develops a new paradigm for prediction of the mechanical properties of HPFRCC by integrating the micromechanics and machine learning. Two strategies are presented to utilize micromechanics. Multiple methods are proposed to improve the prediction accuracy through improving the datasets. Four machine learning models are compared and used to predict the compressive strength, tensile strength, and tensile strain capacity of HPFRCC.
Based on the above investigations, the following conclusions can be drawn: • The proposed methods provide reasonable prediction accuracy for the tensile strain capacity (or ductility), as well as the compressive and tensile strengths of HPFRCC. Among the investigated machine learning methods, the XGBoost method shows the highest prediction accuracy for all the investigated mechanical properties. With the training dataset, R 2 of the compressive strength, tensile strength, and ductility reached 0.984, 0.993, and 0.989, respectively. With the testing dataset, R 2 of the compressive strength, tensile strength, and ductility reached 0.921, 0.957, and 0.896, respectively.

•
The prediction accuracy for the tensile strain capacity can be further improved by using the supplemental data generated from the micromechanics model. With the addition of only 70 more data, the R 2 values of the tensile strain capacity is increased from 0.896 to 0.912 for the training results.

•
The predictive models are implemented to predict the mechanical properties of HPFRCC. The comparison of the prediction and test results further proves the prediction accuracy of the developed models. The implementation also demonstrates possible use cases of the predictive models for replacing or supplementing the experimental tests in the development and optimization of HPFRCC.  [77]; (b) tensile strength [22]; and (c) tensile strain capacity [63].

Conclusions
This research develops a new paradigm for prediction of the mechanical properties of HPFRCC by integrating the micromechanics and machine learning. Two strategies are presented to utilize micromechanics. Multiple methods are proposed to improve the prediction accuracy through improving the datasets. Four machine learning models are compared and used to predict the compressive strength, tensile strength, and tensile strain capacity of HPFRCC.
Based on the above investigations, the following conclusions can be drawn: • The proposed methods provide reasonable prediction accuracy for the tensile strain capacity (or ductility), as well as the compressive and tensile strengths of HPFRCC. Among the investigated machine learning methods, the XGBoost method shows the highest prediction accuracy for all the investigated mechanical properties. With the training dataset, R 2 of the compressive strength, tensile strength, and ductility reached 0.984, 0.993, and 0.989, respectively. With the testing dataset, R 2 of the compressive strength, tensile strength, and ductility reached 0.921, 0.957, and 0.896, respectively.

•
The prediction accuracy for the tensile strain capacity can be further improved by using the supplemental data generated from the micromechanics model. With the addition of only 70 more data, the R 2 values of the tensile strain capacity is increased from 0.896 to 0.912 for the training results.

•
The predictive models are implemented to predict the mechanical properties of HPFRCC. The comparison of the prediction and test results further proves the prediction accuracy of the developed models. The implementation also demonstrates possible use cases of the predictive models for replacing or supplementing the experimental tests in the development and optimization of HPFRCC.
Future research is needed to investigate the performance of the proposed method for prediction of the other important properties of HPFRCC, such as the fresh properties (e.g., flowability) and the durability, and more research is needed to test the applicability of the method for other composites. It is envisioned that the developed prediction method can be used to facilitate optimization of the mix design of HPFRCC, so as to maximize the mechanical properties, the cost-effectiveness, and the durability, while minimizing the environmental impacts (e.g., carbon footprint and energy consumption).