Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning

Zghair, Hussein H.; Harith, Iman Kattoof; Hussain, Tholfekar Habeeb

doi:10.3390/buildings16081529

Open AccessArticle

Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning

by

Hussein H. Zghair

^1,*

,

Iman Kattoof Harith

^2,*

and

Tholfekar Habeeb Hussain

³

¹

Civil Engineering Department, University of Technology-Iraq, Baghdad 10066, Iraq

²

Civil Engineering Department, College of Engineering, Al-Qasim Green University, Babylon 51013, Iraq

³

Water Resources Management Engineering Department, College of Engineering, Al-Qasim Green University, Babylon 51013, Iraq

^*

Authors to whom correspondence should be addressed.

Buildings 2026, 16(8), 1529; https://doi.org/10.3390/buildings16081529

Submission received: 8 March 2026 / Revised: 31 March 2026 / Accepted: 9 April 2026 / Published: 14 April 2026

(This article belongs to the Section Building Materials, and Repair & Renovation)

Download

Browse Figures

Versions Notes

Abstract

Traditional concrete faces challenges such as low energy absorption, brittleness and major environmental impacts, attributed to its dependence on natural resources. Integrating natural fibers with recycled coarse aggregates into concrete presents a promising method of enhancing concrete’s sustainability and mechanical performance. Still, accurately predicting the mechanical properties of these innovative concrete mixes remains complex. This research investigates the predictive abilities of two machine learning (ML) models, classification and regression trees (CART) and stepwise polynomial regression (SPR), for estimating the compressive and splitting tensile strengths of NF-reinforced concrete containing recycled coarse aggregates. The CART model showed greater predictive accuracy, reaching R² = 0.91 for compressive strength and R² = 0.89 for splitting tensile strength. Additionally, the model demonstrated consistently lower error metrics (RMSE, MAD, MAPE, MSE) than comparable approaches. For compressive strength, CART achieved R² = 0.91, RMSE = 5.5686, MSE = 31.0098, MAD = 4.1076, and MAPE = 0.1055, while for splitting tensile strength, it achieved R² = 0.89, RMSE = 0.3954, MSE = 0.1563, MAD = 0.2996, and MAPE = 0.0939. These results emphasize the significant potential of ML, particularly CART, to optimize the design of sustainable concrete mixtures, enabling more accurate and effective strength predictions and finally contributing to more resilient and sustainable infrastructure.

Keywords:

sustainable concrete; natural fiber; recycled coarse aggregate; machine learning; prediction model

1. Introduction

The construction industry is under increasing pressure to adopt sustainable practices to mitigate the environmental impact of traditional building materials, which contribute significantly to resource depletion and climate change. This urgency has driven the exploration of eco-friendly alternatives such as natural fibers (e.g., coir, bamboo, sisal) and recycled concrete aggregates (RCAs). Natural fibers, derived from renewable resources (Figure 1), offer advantages like high specific strength, low density, and biodegradability, enhancing concrete’s tensile strength, energy absorption and impact resistance while reducing reliance on petroleum materials [1,2,3]. At the same time, recycled coarse aggregates, RCAs, repurposed from demolished infrastructure address building waste challenges, preserving natural resources, in addition to reducing landfill complications [4]. However, integrating used materials in concrete introduces difficulties. Recycled coarse aggregate (RCA)-bonded mortar increases the porosity percentage, which leads to the weakening of the interfacial transition zone, compromising mechanical characteristics [5,6,7,8,9,10]. Furthermore, natural fibers need a specific content to achieve equilibrium between workability and mechanical performance [11].

Machine learning (ML) has emerged as a critical tool for optimizing these sustainable composites. Recent studies highlight its potential but reveal critical limitations. For example, gradient boosting achieved high accuracy (R² = 0.91) in predicting the compressive strength of recycled concrete powder (RCP)-modified concrete but focused solely on cement replacement [12]. Artificial neural networks (ANNs) demonstrated strong training performance (R²: 0.93–0.99) for RCA concrete but suffered from overfitting, yielding unreliable testing results (R²: 0.67–0.78) [13]. Random forest (RF) models excelled in dual-target predictions (compressive and tensile strength) but lacked interpretability for engineering applications [14]. Linear approaches like stepwise polynomial regression (SPR) struggled with multicollinearity among predictors, inflating variable significance [15,16], while hybrid models (e.g., GA-ANN) and machine learning architectures (e.g., ANNs) faced computational or data impracticality [17].

While the incorporation of sustainable materials in concrete has been widely investigated, achieving an optimal balance between performance and practical applicability remains a key challenge. Predictive modeling approaches for RCA concrete have progressed from early empirical formulations [18,19,20,21,22,23] to sophisticated machine learning techniques.

As shown in Table 1, previous studies display significant limitations. To address these gaps, this research observes the CART and SPR machine learning methods. CART proposes transparent decision guidelines (e.g., RCA ≤ 40% thresholds) and ranks predictors (e.g., water-to-binder ratio w/b dominance), whereas SPR progresses nonlinear terms via backward elimination. Dual-target predictions of splitting tensile and compressive strength capture material interdependencies, validated rigorously through 10-fold cross-validation. This summary balances accuracy, interpretability, and robustness, progressing the design of sustainable concrete mixtures.

2. Research Methodology

The methodology used 406 experimental datasets to evaluate the splitting tensile strength and compressive strength of natural fiber-reinforced concrete mixtures containing recycled coarse aggregates. Nine variables were analyzed via statistical measures (mean, skewness, kurtosis) and histograms, supported by a Pearson correlation matrix identifying key relationships. Two machine learning models were developed: stepwise polynomial regression (SPR) with backward elimination (p < 0.05) for significant second-order terms and classification and regression tree (CART) pruned (α = 0.01) with node size ≥3. Generalizability was tested via 10-fold cross-validation, using R², RMSE, MAD, MAPE, and MSE. The research methodology flowchart is presented in Figure 1.

2.1. Data Preparation and Distribution

To derive a robust model, a wide-ranging dataset was collected by leveraging the findings of previous research, including studies by [4,24,25,26,27,28,29,30,31,32,33]. This dataset includes 406 experimental results, each covering measurements of cement (C), water-to-binder ratio (W/B), fine aggregate (FA), coarse aggregate (CA), recycled coarse aggregate (RCA), supplementary cementitious materials (SCMs), superplasticizer (SP), natural fiber (NF), age, and the corresponding compressive strength (

f_{c u}

) and splitting tensile strength (

f_{t}

) of sustainable concrete. The natural fiber includes bamboo fiber, sisal fiber, coir fiber, jute fiber, ramie fibers, and kenaf fiber.

The compiled dataset exhibits intentional heterogeneity, reflecting real-world construction variability. Specimen geometry differences (cubes vs. cylinders) were addressed by converting all strength values to equivalent 150 mm cube strengths using conversion factors (cylinder × 1.25 for slenderness 2.0) per British and European standards for testing the compressive strength of hardened concrete (BS EN 12390-3: 2019 [34]). Curing temperature variations (23–27 °C) fall within standard laboratory tolerance (±2 °C of target).

However, fiber treatment heterogeneity presents uncontrolled variability: alkali-treated jute versus untreated coir versus steam-cured bamboo creates NF-specific subpopulations that the aggregated ‘NF’ variable cannot distinguish. This is acknowledged as a limitation rather than controlled variable.

No data preprocessing (normalization, outlier removal, or transformation) was applied to preserve real-world variability. Missing values (3.2% of dataset) were addressed through complete-case analysis (n = 393) rather than imputation to avoid introducing artificial patterns. Sensitivity analysis confirmed no significant difference between complete-case and imputed results (p > 0.05).

Table 2 presents basic statistics for these key variables. It includes information on the range of values (minimum and maximum), measures of central tendency (mean), and measures of variability (coefficient of variation). Additionally, Table 2 provides insights into the shape of the distributions (kurtosis and skewness), which characterize the data distribution beyond simple measures of central tendency and variability. Remarkably, the raw dataset (n = 406) was employed without preprocessing, e.g., normalization and outlier removal for ensuring the models reflected real-world changeability. The NF variation coefficient (194.84%) exhibits a bimodal distribution structure with structural zeros + active dosages instead of a measurement mistake.

The histograms in Figure 2 show a varied range of distribution forms. Many variables display approximately normal distributions, categorized by bell-shaped curves, whereas others display skewed distributions, either negatively or positively. The mean values, specified by vertical red lines, provide an estimation of the central tendency for each variable in each situation. The shapes of the distributions themselves offer valuable insights into the underlying data. For instance, a skewed distribution might propose the existence of outliers, or otherwise specify that the data are not evenly scattered.

2.2. Correlation Matrix

A correlation matrix is a statistical tool that visualizes relationships between various variables in a dataset. These square, symmetrical matrices show correlation coefficients (often Pearson’s R) for each variable couple. Pearson’s R deals with the strength and direction of linear relations, ranging from (−1) (perfect negative) to (+1) (perfect positive). A value of (0) shows no linear correlation. Significantly, Pearson’s R may not precisely reflect nonlinear relations. Correlation matrices offer a brief outline of interdependencies within a dataset, helping scientists understand which variables are associated and how powerfully [35].

Figure 3 shows a Pearson correlation matrix between the input variables and the compressive strength (

f_{c u}

). Important linear relations are recognized, including strong negative correlations for water-to-binder ratio (W/B, R = −0.76), in addition to the fine aggregate content (FA, R = −0.53). In addition, there are positive correlations for supplementary cementitious materials (SCM, R = +0.54) and superplasticizer content (SP, R = +0.69). Remarkably high correlation for SP likely reflects its useful role in decreasing water requests, in addition to improving workability, which indirectly progresses compaction and strength. On the contrary, the moderate correlation for SCM may rise from dosage-dependent influences. Otherwise, synergistic connections with other binders could be responsible, which will be further explored through nonlinear modeling.

These robust correlations raise concerns around multicollinearity among analysts, possibly inflating the perceived impact of specific variables in linear analyses. To address the predictive methodologies for assessing compressive (

f_{c u}

) and tensile strength (

f_{t}

), we will use machine learning techniques, e.g., regularization and sensitivity analysis, to disentangle the complex relations and quantify variable contributions more strongly. While

f_{t}

displays weaker linear relations with the input variables, the comparative analysis of compressive strength and tensile behavior will explain how material influences differentially govern these mechanical characteristics. This dual approach confirms a balanced clarification of variable impacts, mainly for SP and SCM, beyond correlation conclusions.

2.3. Regression and ML Model Assumptions and Limits

2.3.1. Stepwise Polynomial Regression Model

Stepwise polynomial regression (SPR) is a statistical method for constructing predictive models. It iteratively enhances the model by strategically adding or eliminating predictor variables based on their statistical consequences. Preliminarily, with a basic model, SPR regularly progresses its accuracy by recognizing the most effective influences. This procedure can contain forward selection (addition variables), backward elimination (eliminating variables) or a combination. A possible restraint of these methods is their limited capability to notice and incorporate interactions between different predictor variables [36]. In the SPR method, a second-order polynomial model is used, including nine independent variables and one dependent variable (y).

y = b 0 + \sum b i x i + \sum α i j x i x j + \sum α i i x_{i}^{2}

(1)

where y denotes the predictable value of the response variable, b₀ denotes the constant term and b_i stands for the coefficient of the independent variable x.

Numerous studies have discovered the application of SPR in predicting concrete performance, as evidenced by [15,37]. The primary strength of SPR compared to other techniques lies in its automated feature selection and model building process.

The SPR model was designed with quadratic complexity as the maximum polynomial order, striking a balance between capturing curved relationships and avoiding overfitting. Variables, including interaction and squared terms, were retained only if they reached a statistical significance threshold of 0.05 for inclusion in the final specification. To assess robustness, model stability and predictive performance were evaluated across different data subsets using 10-fold cross-validation, ensuring reliable out-of-sample predictions.

2.3.2. Classification and Regression Tree (CART) Model

Classification and regression tree (CART) is a powerful machine learning method used to build predictive models. It employs decision trees to classify categorical outcomes and predict continuous values. The CART algorithm first grows a large tree and then prunes it back to enhance simplicity and prevent overfitting. For categorical target variables, the Gini index measures impurity within each node of the tree. For continuous targets, the least squares deviation method determines the optimal splits [38,39].

Numerous studies have examined the application of CART in predicting concrete performance, as evidenced by [40,41,42]. A key advantage of CART lies in its ability to automatically identify the most influential predictors and determine the optimal threshold values for each predictor to effectively classify the target variable.

The CART model was tuned using the following hyperparameters:

Pruning (α): Cost-complexity pruning with α = 0.01 was applied to balance tree depth and overfitting.
Terminal Node Size: A minimum of three observations per node was enforced to ensure meaningful splits.
Splitting Criterion: The least squares deviation method minimized variance in terminal nodes for regression tasks.
Validation: The optimal subtree was selected via 10-fold cross-validation, which minimized RMSE while preserving predictive power.

2.4. K-Fold Cross-Validation Technique

Cross-validation (CV) is an essential technique for evaluating the performance of machine learning (ML) models and ensuring their ability to generalize to unseen data. By splitting the dataset into subsets—training the model on one portion and validating it on the remaining data—CV helps reduce both bias and overfitting [43].

K-fold cross-validation is a widely adopted approach, in which the dataset is divided into ‘k’ equal-sized folds. The model is trained on (k − 1) folds and evaluated on the remaining fold. This process is repeated k times, with each fold serving as the validation set exactly once. Common performance metrics, such as root mean squared error (RMSE) and R-squared, are used to assess model performance [44]. In particular, 10-fold CV is often considered a good trade-off between reliable variance estimation and computational efficiency [45,46].

In this research, 10-fold CV was employed on a 406-instance dataset to assess model performance. The model with the lowest average RMSE across the 10 iterations was selected as the best-performing model, Figure 4.

2.5. Criteria for Evaluating Models

To evaluate the predictive capabilities of current models, it is crucial to employ a diverse set of metrics [48]. These metrics, including mean square error (MSE), root mean square error (RMSE), mean absolute deviation (MAD), and mean absolute percentage error (MAPE), quantify the discrepancy between model predictions and actual observations. Lower values in these metrics indicate better model performance. Other metrics, including R squared (R²) adjusted R squared and predicted R squared, evaluate the model’s ability to capture the trends observed in the real-world data. By using Equations (2)–(6), a complete assessment of the predictive precision of the present models can be obtained.

Note: y_i = actual (observed) value, ŷ_i = predicted value

R^{2} = {(\frac{\sum_{i = 1}^{n} (y_{i} - ȳ) (ŷ_{i} - \bar{ŷ})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - ȳ)}^{2} \sum_{i = 1}^{n} {(ŷ_{i} - \bar{ŷ})}^{2}}})}^{2}

(2)

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - ŷ_{i})}^{2}

(3)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - ŷ_{i})}^{2}}

(4)

M A D = (\sum | y_{i} - ŷ_{i} |) / n

(5)

M A P E = 1 / n \sum ((y_{i} - ŷ_{i}) / y_{i}) * 100

(6)

Regarding Equation (6) for MAPE, it is completely recognized that the mean absolute proportion error is undefined or becomes problematic when the actual value (y_i) is close to zero. In the dataset used in this research, compressive strength ranges from 9.75 MPa to 96.7 MPa, with splitting tensile strength values ranging from 1.17 MPa to 6.20 MPa. All real values are positive and reasonably far from zero; consequently, the MAPE calculation remains stable. However, it is agreed that this is a significant caveat for general applications.

3. Findings and Interpretation

3.1. Stepwise Polynomial Regression Model

3.1.1. Compressive Strength

A stepwise polynomial regression (SPR) model was utilized to establish the strengths of concrete, beginning with a full second-order polynomial model (Equation (1)) to account for possible nonlinear relationships while avoiding overcomplexity. The model was iteratively refined employing backward decreases, where terms with statistically unimportant influence (*p-value > 0.05*) were analytically detached until only important predictors (*p-value < 0.05*) remained. This concept confirmed an equilibrium between interpretability and predictive power: limiting the polynomial to second-degree terms prevented overfitting, while the p-value threshold streamlined the model by excluding redundant interactions. The final parsimonious model retained nine independent variables (Equation (7)), accomplishing strong generalizability, as validated by 10-fold cross-validation (R² = 0.86–0.89). This optimized assemblage not only reduced computational complexity but also preserved critical relations between input variables, for example, the synergistic influences of superplasticizer contents and water-to-binder ratio, which significantly influenced compressive strength (p < 0.001). By strictly adhering to statistical and domain-driven principles, the SPR model efficiently captured the mechanical performances of sustainable concrete while maintaining transparency for engineering uses.

\begin{matrix} f_{c u} = - 3.5 & + 0.2212 C - 52.86 W / B + 2.696 S P + 26.09 N F + 0.698 A g e - 0.000218 C * C \\ + 1.651 N F * N F - 0.002384 A g e * A g e - 0.0356 C * N F - 0.000498 C * A g e \\ - 4.39 W / B * S P - 33.1 W / B * N F - 1.188 S P * N F + 0.00619 S P * A g e - 0.0928 N F \\ * A g e \end{matrix}

(7)

Equation (7) generates a nonlinear formulation between the compressive strength and the studied variables. The formulary framework displayed in Equation (7) reveals that a majority of the studied variables significantly impact the compressive strength (

f_{c u}

). The (C*Age) interaction (p = 0.103) was retained for model completeness, but it falls below the significance threshold. The exclusion of RCA and SCM interaction terms in the SPR model shows the method’s sensitivity to multicollinearity instead of physical insignificance. RCA’s porosity impacts are statistically absorbed by the dominant W/B ratio term (R = −0.76), whereas SCM’s pozzolanic involvement overlaps with the cement quadratic term. This restriction underscores why CART, which handles multicollinearity through recursive partitioning, retains all predictors and identifies SCM as the third-most-important variable (75.8%). SPR-NF interactions persist because fibers display sole threshold-dependent behavior (optimal NF ≈ 0.5%) that polynomial terms capture efficiently.

Table 3 shows that the linear relations of variables, such as cement, superplasticizer, water-to-binder ratio (W/B), natural fiber, and age, meaningfully impact the response variable with p-values below 0.05. On the other hand, the squadratic terms of most variables show insignificant impacts on the response variable, as specified by p-values > 0.05, principal to their removal from the model. Remarkably, the quadratic terms for natural fiber, cement, and age reveal a statistically important effect on the response variable, suggesting these variables independently exert a considerable effect on the response. Also, interactions between most variables show statisticals consequences for the response variable, with p-values below 0.05 [49,50].

To evaluate the model’s capability to generalize to unnoticed data, the coefficient of determination (R²) was considered. The R² values of (87.34) %, with adjusted R² of (86.85) % and predicted R² of (86.45) %, specify strong predictive performance. Ten-fold R² was (86.05) %, which indicates good predictive behavior.

Figure 5 displays the standard probability outline for residuals of the compressive strength model. While the central tendency of residuals aligns closely with the reference line, deviations in the upper and lower tails suggest mild non-normality. These deviations reflect inherent unpredictability in the material system, mainly in extreme strength values influenced by stochastic impacts such as localized micro-cracking or various binder spreading.

Figure 6 displays a Versus Fits scheme, which visualizes the residuals (errors) of a model. The model predicts the compressive strength

f_{c u}

of sustainable concrete. The horizontal axis shows the fitted values, or else the values predicted by the model for each data point. The vertical axis indicates the residuals, which are the difference between the actual detected compressive strength

f_{c u}

and the equivalent fitted values. The situation of a dot specifies the connection between its fitted value and its residual. Perfectly, the residuals should be randomly distributed around the horizontal line at zero value. This would propose that the model is appropriate for the dataset, with no noticeable patterns or trends in the errors. In this plot, the residuals seem to be randomly dispersed around the zero line. There is no strong indication of systematic patterns or trends. Based on this visual examination, the “Versus Fits” plot proposes that the model used to predict

f_{c u}

is likely a good fit for the data. The random distribution of residuals indicates that the model’s predictions are reasonably accurate.

3.1.2. Splitting Tensile Strength

A stepwise polynomial regression approach was employed, beginning with a full second-order polynomial (Equation (1)). Terms with p > 0.05 were iteratively removed, resulting in a reduced model containing nine variables with 31 terms (linear, quadratic, and interaction effects).

\begin{matrix} f_{t} = 113.2 & - 0.1655 C + 204.7 W / B - 0.0816 F A - 0.1939 C A + 0.2185 R C A - 4.332 S C M \\ + 15.24 S P - 0.23 N F + 0.0269 A g e - 44.2 W / B * W / B + 0.000072 F A * F A \\ + 0.000048 C A * C A - 0.000235 R C A * R C A - 0.1726 N F * N F - 0.000145 A g e * A g e \\ - 0.1787 C * W / B + 0.000128 C * F A + 0.000156 C * C A - 0.000073 C * A g e \\ - 0.1835 W / B * F A + 0.0355 W / B * C A - 0.2811 W / B * R C A + 4.27 W / B * S C M \\ - 14.12 W / B * S P - 7.09 W / B * N F + 0.000023 F A * C A - 0.000922 F A * S C M \\ + 0.00571 F A * N F - 0.000062 C A * R C A + 0.003071 C A * S C M - 0.00928 C A * S P \\ + 0.000030 C A * A g e - 0.00377 R C A * S C M + 0.00894 R C A * S P + 0.000048 R C A \\ * A g e \end{matrix}

(8)

Equation (8) creates a nonlinear formula between the splitting tensile strength and the examined variables. This formula framework, as shown in Equation (8), demonstrates that all of the examined variables significantly influence the response variable. As presented in Table 4, the linear terms of (cement, W/B ratio, coarse aggregate, recycled coarse aggregate, supplementary cementitious materials, superplasticizer, and age) variables significantly affect the

f_{t}

, with p-values less than 0.05.

Linear terms for FA and NF were retained despite marginal/insignificant p-values (0.058 and 0.832, respectively) because their associated quadratic and interaction terms (FA × SCM, FA × NF, NF², W/B × NF) demonstrated statistical significance (p < 0.05), satisfying the hierarchy principle for polynomial model structure.

The quadratic terms of most variables (W/B, FA, CA, RCA, NF, and age) significantly influence the response variable, as indicated by p-values less than 0.05. This suggests that these variables alone have a substantial impact on the response. Furthermore, interactions between most of these variables are also statistically significant for the response variable, with p-values less than 0.05 [50,51].

To evaluate the model’s capability to generalize to unseen data, the coefficient of determination (R²) was calculated. The R² values of 88.94%, with adjusted R² of 87.89% and predicted R² of 86.66%, indicate strong predictive performance. Ten-fold R² was 86.33%, which indicates good predictive performance.

Figure 7 visually assesses whether the data follow a normal distribution, a key assumption in many statistical analyses. The normality plot in Figure 7 suggests that the residuals of the

f_{t}

model are approximately normally distributed, with some minor deviations.

Figure 8 shows a “Versus Fits” plot, which visualizes the residuals (errors) of a model. The model predicts the

f_{t}

of sustainable concrete. In Figure 8, the residuals are scattered randomly around the zero line. There are no discernible patterns, such as funnels, curves, or clusters. This random distribution shows that the model’s predictions are unbiased and consistent across the range of fitted values. A random scatter of residuals suggests that the model is suitable for the data. This implies that the model’s predictions are accurate and reliable and that the chosen predictors effectively capture the underlying relations in the dataset. Residual analysis reveals a strong correlation between observation points, prediction points, and residual fluctuations, which fall within a range of −1 to 1. The prediction outcomes propose that the SPR model efficiently captured the patterns within the training data. Nevertheless, to evaluate its predictive act on unseen data, additional testing with data beyond the training set is critical.

3.2. CART Regression Model

3.2.1. Compressive Strength $f_{c u}$

The classification and regression tree (CART) model combined all nine predictors (Table 5 and Figure 9), so each was recognized as statistically important through Gini impurity reduction. The final optimized tree structure comprised 39 terminal nodes and a minimum node size of three remarks to avoid overfitting by avoiding excessively granular splitting. Node splitting was managed by the one-standard-error (1-SE) rule, selecting the simplest tree within one standard error of the maximum (R²) value to equilibrium generalizability and complexity. Cost-complexity pruning (α = 0.01) was useful to trim branches that contributed slightly to analytical accuracy, yielding a parsimonious structure, retaining the critical relation, e.g., water-to-binder ratio dominance. Model style was rigorously validated via 10-fold cross-validation, achieving a testing RMSE of 5.57 MPa for compressive strength, underscoring robust generalization. This group of constraints (α = 0.01, node size = 3) confirmed the model captured nonlinear connections deprived of overfitting noise, as shown by the tight alignment between training (R² = 0.95) and testing (R² = 0.91) actions. The CART model displays a training test performance gap, as shown in Table 5 (R²: 95.16% → 91.09%; MSE: 16.85 → 31.01), which specifies modest overfitting despite pruning constraints (α = 0.01, min. node size = 3). This gap is intrinsic to CART’s recursive partitioning mechanism, which optimizes splits for training data decreases.

Figure 10’s scatter scheme visually approves the model’s precision on the training data, with points gathering closely about the diagonal line. Whereas the model performs well on the training data, its generalization ability on unnoticed data remains undefined and requires additional testing. The scatter scheme in Figure 10 shows that while the training set displays a tighter fit, the testing set still demonstrates a robust overall trend, while there is an insignificant tendency to underestimate (

f_{c u}

) values at higher levels within the testing set. This does not essentially signify poorer performance. The model’s effectiveness (Table 6) is additionally supported by its high R-squared value (91.09%), low RMSE (5.5686), MAD (4.1076), MAPE (0.1055) and MSE (31.0098). Figure 10 visually accentuates the importance of evaluating the overall spread, in addition to deviation from the diagonal line, rather than solely focusing on the direction of bias. Regardless of the many variabilities in the testing set, the model efficiently generalizes to invisible data, indicating its potential for reliable estimates.

Boxplots offer a visual illustration of data dispersal, displaying median, quartiles, and outliers. In Figure 11, both the training and testing sets show roughly symmetric distributions of residuals centered around zero, suggesting unbiased errors. The equivalent spread of residuals across sets indicates consistent prediction precision. This proposes good model performance and generalization to invisible data. Residuals reveal model errors and how well the model fits the dataset. The outliers in residuals might specify poorly modelled data points, although changes in residual spread with predicted values could indicate heteroscedasticity.

3.2.2. Splitting Tensile Strength

The CART model utilizes nine predictors, all of which were deemed important, as presented in Table 7 and Figure 12. The decision tree has 28 terminal nodes, with a minimum size of three observations per node. Node splitting is determined by the criterion of being within 1 standard error of the maximum R² value. The optimal tree is chosen based on minimizing the least squared error. Model validation is performed using 10-fold cross-validation.

The CART model demonstrates excellent performance on the training data, as evidenced by the high R² value (94.55%), low RMSE (0.2858), MSE (0.0817), MAD (0.2301), and MAPE (0.0741), as presented in Table 8. Figure 13’s scatter plot visually confirms the points are more tightly clustered around the diagonal line, suggesting a good fit on the training data. This is predictable, as the model is qualified specifically on these data.

The scatter plot in Figure 13 exposes that, whereas the training set shows a tighter fit, the model still captures the overall trend but with more variability. There might be a tendency to underestimate the (

f_{t}

) for higher values, showing different performance compared to the training set, attributed to the distribution of data points in the training and testing sets being potentially changed, which can affect the model’s behavior. The model might be too complex for the available data, leading to overfitting.

The model’s efficiency is more supported by its high R-squared value (89.56%), low RMSE (0.3954), MSE (0.1563), MAPE (0.0939) and MAD (0.2996). Figure 13 visually emphasizes the significance of evaluating the overall spread and deviation from the diagonal line, rather than solely focusing on the direction of bias. Regardless of the minor variability in the testing set, the model efficiently generalizes to invisible data, indicating its potential for reliable estimates.

Figure 14 displays that both the training and testing sets have symmetrically distributed residuals centered around zero, suggesting unbiased errors. The equivalent spread of residuals suggests consistent prediction precision across both sets. This indicates good model behavior and generalization to invisible data. Residuals reveal model errors, while the boxplot visualizes their spreading.

3.2.3. Variable Significance

The significance of variables was examined employing Gini impurity decreases for compressive and splitting strength. Figure 15 and Figure 16 demonstrate variable impacts on compressive and splitting strength.

The water-to-binder ratio (W/B, 100% importance) governs compressive strength, as lower ratios improve matrix density by dropping porosity, whereas higher ratios weaken the binder material. Superplasticizer (SP, 83.2%) enhances particle packing, and supplementary cementitious material (SCMs, 75.8%) fill pores via pozzolanic reaction. Age (34.2%) shows curing benefits, whereas fine aggregate (FA, 21.2%) with coarse aggregate (CA, 19.7%) plays minor roles as filler materials. Natural fibers (NF, 14.8%) reach equilibrium for crack bridging with void risks. In addition to cement content (10%) displaying diminished influence, due to synergies with SCMs, the recycled coarse aggregate content (RCA, 7.6%) has a slight impact, suggesting controlled quality or low replacement percentages.

Coarse aggregates (CA) (100% significance) are a primary driver of tensile strength, as well the angularity and interlock properties, improving the transfer of loads. The water-to-binder ratio (W/B) is 76.1%, and cement (75.1%) governs the adhesive of the matrix and porosity percentage, whereas FA (53.9%) advances the packing action, and natural fibers (50.3%) delay crack propagation. SCMs (36.6%) and age (32.7%) have circumscribed roles, focused on durability and a gradual hydration process. Superplasticizer (SP) (26.8%) supports fiber distribution indirectly, while RCA (11.7%) displays an insignificant impact, because of the control of potential quality. Totally, W/B, SP, and SCMs determine the compressive strength (83% variability), where CA, W/B and cement govern tensile behavior (75%), emphasizing aggregate matrix synergy due to the water content determined for the design of a sustainable concrete mixture.

The variety in characteristic importance emphasizes property-specific optimization styles:

For compressive strength: Prioritize strict water-to-binder ratio, e.g., (W/B ≤ 0.43 via SP optimization) and SCM amalgamation to refine pore structure.
For tensile strength: Optimize gradation of aggregates (well-graded CA), balancing cement content to progress cohesion of matrix, supplemented by natural fibers (≤0.5% by volume) for resistance to cracking potential.
General guidelines: RCA quality, e.g., pre-treatment to reduce adhered mortar, in addition to curing time, is critical for diminishing variability. CART-derived thresholds, e.g., (W/B ≤ 0.43, NF ≤ 0.5%), suggest actionable rules to reduce trial and error in the design of the mixture.

3.3. Model Validation and K-Fold Cross-Validation

To evaluate model generalizability and reliability, 10-fold cross-validation was employed. This strong method splits the dataset into ten subsets, iteratively training then estimating the model on different groupings. This style offers a more reliable performance evaluation than a single train–test split.

Using 10-fold cross-validation, the CART model confirmed greater performance than the SPR model. For the SPR model, the 10-fold cross-validated (R²) values for SPR compressive strength and SPR splitting tensile strength were 86.05% and 86.33%, respectively. In contrast, the CART model attained greater 10-fold cross-validated R² values for CART compressive strength and CART splitting tensile strength of 91.09% and 89.56%, respectively.

These 10-fold cross-validation outcomes, along with other statistical metrics (RMSE, MSE, MAPE, MAD), powerfully indicate the great generalization capacity and accuracy of the advanced models. This suggests that the models are likely to show reliable performance when applied to new, invisible data, making them valuable tools for guessing the strength of sustainable concrete mixtures in real-world engineering cases.

3.4. Assessment of Practical Machine Learning Models

Model performance was evaluated employing joint violin and boxplot analysis (Figure 17). This visualization successfully compared data distributions, with the boxplot representing the interquartile range (IQR) and the violin plot depicting the probability density. The analysis showed that the CART model outperformed the SPR model in capturing the full range of potential results. This was evident in the violin plots, where the CART model accurately predicted both the min and max values of both strength limits.

The comparison between CART (nonparametric, recursive partitioning) and SPR (parametric, second-order polynomial constrained) is intrinsically asymmetric in functional flexibility. CART’s superior fit metrics (R² = 0.91 vs. 0.87 for compressive strength) partly reflect this capacity to capture threshold effects and higher-order connections beyond quadratic terms. We do not claim CART is ‘superior’ in absolute algorithmic terms, rather that it proposes more accurate predictions within an interpretable framework for this specific application.

Taylor diagrams offer a complete framework for assessing model behavior through associating predictions with experimental data. It evaluates three key characteristics: correlation coefficient, root mean square error and standard deviation. Models positioned closer to the experimental point (signified as a red point) display developed correlation, closer agreement in variability, and lower overall error, signifying greater performance [52]. Points along the correlation axis propose models with good variability and weaker linear relationships. On the other hand, points along the standard deviation axis indicate models with good correlation but different levels of variability related to the experimental data [17].

Figure 18 displays Taylor diagrams for compressive and tensile strength (

f_{c u}

and

f_{t}

) predictions. The consequences clearly create the superior predictive accuracy of the CART model over the SPR model, as shown by advanced correlation coefficients and lower RMSE values.

CART’s superiority stems from its capability to do the following:

Handle Nonlinearity and Multicollinearity: Unlike SPR, which relies on predefined polynomial terms (Equations (7) and (8)), CART intrinsically captures nonlinear interactions (threshold effects like *W/B ≤ 0.43* for compressive strength) without inflating variable significance, attributable to multicollinearity.
Offer Actionable Interpretability: CART’s hierarchical splits (NF ≤ 0.5% for optimal fiber content) proposal engineers direct, rule-based guidelines for design of mixture, bypassing SPR’s complex equations.
Generalize Robustly: Cross-validation confirmed CART’s stability (testing R² = 0.91 vs. SPR’s 0.86), as SPR’s backward elimination often discarded critical interactions (SP-SCM synergy).

4. Future Work

4.1. Statistical Robustness and Error Distribution

While the present SPR and CART models work within deterministic outlines, the noticed residual non-normality suggests that stochastic modelling styles (probabilistic approaches with broader confidence intervals) might quantify doubt in extreme strength predictions. Future work will explore robust regression procedures, as well data transformations (Box–Cox), to accommodate distributional skewness, perhaps enhancing reliability for safety-critical usages

4.2. Fiber-Type Heterogeneity and Model Limitations

A critical limit of the present modelling style is the grouping of six distinct natural fibers (coir, bamboo, sisal, jute, ramie, and kenaf) into a single continuous parameter (NF), representing fiber content. This simplification was required by dataset constraints, as individual fiber types were not steadily labeled across the 406 compiled experimental records from varied literature sources. Still, this aggregation intrinsically masks substantial mechanical modifications between fiber types, which significantly influence the performance of concrete mixtures.

Research proves that these fibers display markedly different tensile strengths: ramie (400–1000 MPa), jute (320–800 MPa), sisal (540–720 MPa), bamboo (287–800 MPa), and coir (~200 MPa). Correspondingly, the elastic moduli (E) vary noticeably: ramie (24.5–128 GPa), jute (8.0–30.0 GPa), coir (3.0–6.0 GPa), and bamboo (5.5–12.6 GPa). Cellulose contents also differ substantially: coir (32–43%), jute (59–70%), ramie (70–83%), bamboo (80–94%) and sisal (65–76%). These mechanical and compositional differences translate to distinct activities in cementitious matrices, where ramie fibers can raise compressive strength by up to 27%, whereas cotton fibers may decrease the strength by 8%. Surface characteristics also differ; coir fibers possess rough, porous surfaces, which increase mechanical bonding with cement paste. Despite the fact bamboo fibers have smoother surfaces, we need different treatment protocols. Accordingly, the current model cannot isolate between high-elastic-modulus ramie fibers, low-modulus coir fibers at equivalent contents, nor can it account for the superior alkali resistance of sisal compared to the rapid degradation of jute in high-pH circumstances. This limitation decreases the model’s applicability for fiber-specific mix design optimization. This likely contributes to the detected residual variability in predictions (testing MAPE: 10.55% for compressive strength, 9.39% for tensile strength).

4.3. Practical Engineering Implications

Although the CART model offers actionable thresholds (NF ≤ 0.5%), engineers should recognize that these represent generalized fiber content limits across heterogeneous fiber categories. The model is well suited for preliminary mixture design, showing where fiber-type flexibility exists. Otherwise, standardized fiber blends are used. For projects requiring specific performance optimization (high-strength applications containing ramie or impact-resistant elements with coir), fiber-specific characterization remains critical, and the model predictions should be treated as conservative evaluations pending investigational validation.

4.4. Recommended Future Improvements

Future model iterations should include fiber type as a categorical variable (coir = 1, bamboo = 2, sisal = 3, ramie = 4, jute = 5 kenaf = 6), otherwise grouped by mechanical property clusters: (1) high-modulus group (ramie, bamboo), (2) medium-modulus group (sisal, jute), and (3) low-modulus group (coir, kenaf). This stratification would enable fiber-type-specific decision rules (“IF filetype = ramie AND dosage ≤ 0.3% THEN compressive strength = high”). We need targeted experimental campaigns with consistent fiber labeling. Furthermore, combining fiber physical properties (tensile strength, elastic modulus, aspect ratio, surface roughness), for example, direct input variables could provide a more mechanistically grounded approach that transcends categorical classifications.

5. Conclusions

This research generated a machine learning framework to predict the strength of natural fiber-reinforced recycled aggregate concrete mixtures, addressing critical gaps in interpretability and multicollinearity inherent to prior models. By using classification and regression trees (CART) and stepwise polynomial regression (SPR) as distinct methodologies, the trade-off between accuracy and interpretability is determined. This concept realizes dual-target predictions: compressive strength (R² = 0.91) and tensile strength (R² = 0.89) while independently elucidating material connections. CART’s hierarchical separations exposed mechanistic insights; the water-to-binder ratio (W/B ≤ 0.43) controls compressive strength. Through densifying the matrix of cement paste, countering RCA’s porosity, the angularity of coarse aggregate controls tensile strength through interlock-driven crack resistance. Superplasticizer (SP)–fiber synergies enhance workability without compromising strength, validated through SPR interaction terms (p < 0.001). These results connect ML predictions with materials science principles, contributing actionable designs of mixtures for engineers (RCA ≤ 40%) (NF ≤ 0.5%), thus reducing trial and error and permitting the adoption of sustainable concrete mixtures for construction.

The framework’s 10-fold generalizability (R² = 0.86 to 0.91) confirms strength across variable RCA qualities. In addition to fiber types, it is adaptable to real-world construction scenarios, where material consistency is challenging. By implanting materials science into predictive analytics, this work allows engineers to optimize eco-friendly concrete mixtures without sacrificing structural resilience, directly supporting circular economies through reduced reliance on virgin aggregate and landfill waste. The future integration of durability metrics (chloride ingress) and hybrid ML approaches (XGBoost) will spread its utility to serve environments such as marine infrastructure, aligning with global goals for sustainable construction. Consequently, the CART-SPR framework not only develops predictive accuracy but also redefines practical design models, converting theoretical insights into codifiable guidelines for next-generation sustainable infrastructure.

Author Contributions

H.H.Z. and I.K.H., wrote the main manuscript text; T.H.H. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghavami, K. Bamboo as reinforcement in structural concrete elements. Cem. Concr. Compos. 2005, 27, 637–649. [Google Scholar] [CrossRef]
Rao, K.M.M.; Rao, K.M. Extraction and tensile properties of natural fibers: Vakka, date and bamboo. Compos. Struct. 2007, 77, 288–295. [Google Scholar] [CrossRef]
Sivaraja, M.; Kandasamy; Velmani, N.; Pillai, M.S. Study on durability of natural fibre concrete composites using mechanical strength and microstructural properties. Bull. Mater. Sci. 2010, 33, 719–729. [Google Scholar] [CrossRef]
Ajdukiewicz, A.; Kliszczewicz, A. Influence of recycled aggregates on mechanical properties of HS/HPC. Cem. Concr. Compos. 2002, 24, 269–279. [Google Scholar] [CrossRef]
Silva, R.V.; De Brito, J.; Dhir, R.K. The influence of the use of recycled aggregates on the compressive strength of concrete: A review. Eur. J. Environ. Civ. Eng. 2014, 19, 825–849. [Google Scholar] [CrossRef]
Silva, R.V.; de Brito, J.; Dhir, R.K. Establishing a relationship between modulus of elasticity and compressive strength of recycled aggregate concrete. J. Clean. Prod. 2016, 112, 2171–2186. [Google Scholar] [CrossRef]
Silva, R.V.; De Brito, J.; Dhir, R.K. Tensile strength behaviour of recycled aggregate concrete. Constr. Build. Mater. 2015, 83, 108–118. [Google Scholar] [CrossRef]
Silva, R.V. Use of recycled aggregates from construction and demolition waste in the production of structural concrete. Res. Net 2015, 10. [Google Scholar]
Kisku, N.; Joshi, H.; Ansari, M.; Panda, S.K.; Nayak, S.; Dutta, S.C. A critical review and assessment for usage of recycled aggregate as sustainable construction material. Constr. Build. Mater. 2017, 131, 721–740. [Google Scholar] [CrossRef]
Johnson, U.; Rahman, M. Use of Recycled Concrete Aggregate in Concrete: A Review. J. Civ. Eng. Manag. 2013, 19, 796–810. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, J.; Chen, G.; Xie, Z. Compressive behaviour of concrete structures incorporating recycled concrete aggregates, rubber crumb and reinforced with steel fi bre, subjected to elevated temperatures. J. Clean. Prod. 2014, 72, 193–203. [Google Scholar] [CrossRef]
Manan, A.; Pu, Z.; Weiyi, C.; Ahmad, J.; Alattyih, W.; Umar, M.; Almujibah, H. Machine learning prediction of recycled concrete powder with experimental validation and life cycle assessment study. Case Stud. Constr. Mater. 2024, 21, e04053. [Google Scholar] [CrossRef]
Manan, A.; Pu, Z.; Majdi, A.; Alattyih, W.; Elagan, S.K.; Ahmad, J. Sustainable optimization of concrete strength properties using artificial neural networks: A focus on mechanical performance. Mater. Res. Express 2025, 12, 025504. [Google Scholar] [CrossRef]
Manan, A.; Pu, Z.; Ahmad, J.; Umar, M. Multi-targeted strength properties of recycled aggregate concrete through a machine learning approach. Eng. Comput. 2024, 42, 388–430. [Google Scholar] [CrossRef]
Kattoof, I.; Ahmed, H.; Abdulhadi, M.; Hussien, M.L. Prediction of concrete strength by hierarchical stepwise regression using ultrasonic pulse velocity and Schmidt rebound hammer. Innov. Infrastruct. Solut. 2024, 9, 487. [Google Scholar] [CrossRef]
Kattoof, I.; Ehsan, H.; Salman, E.; Hussien, M.L.; Mohammed, A.Y.; Nadir, W. Optimizing compressive strength of foamed concrete using stepwise regression. Discov. Appl. Sci. 2025, 7, 607. [Google Scholar] [CrossRef]
Hashem, T.; Kattoof, I.; Hassan, N.; Mohammed, A.Y.; Hussien, M.L. Unlocking precision in hydraulic engineering: Machine learning insights into labyrinth sluice gate discharge coefficients. J. Hydroinform. 2024, 26, 2883–2901. [Google Scholar] [CrossRef]
González-taboada, I.; González-fonteboa, B.; Martínez-abella, F.; Pérez-ordóñez, J.L. Prediction of the mechanical properties of structural recycled concrete using multivariable regression and genetic programming. Constr. Build. Mater. 2016, 106, 480–499. [Google Scholar] [CrossRef]
Naderpour, H.; Hossein, A.; Fakharian, P. Compressive strength prediction of environmentally friendly concrete using arti fi cial neural networks. J. Build. Eng. 2018, 16, 213–219. [Google Scholar] [CrossRef]
Duan, Z.; Kou, S.; Poon, C. Prediction of compressive strength of recycled aggregate concrete using artificial neural networks. Constr. Build. Mater. 2013, 40, 1200–1206. [Google Scholar] [CrossRef]
Trocoli, A.; Dantas, A.; Leite, M.B.; Nagahama, K.D.J. Prediction of compressive strength of concrete containing construction and demolition waste using artificial neural networks. Constr. Build. Mater. 2013, 38, 717–722. [Google Scholar] [CrossRef]
Behnood, A.; Olek, J.; Glinicki, M.A. Predicting modulus elasticity of recycled aggregate concrete using M5 0 model tree algorithm. Constr. Build. Mater. 2015, 94, 137–147. [Google Scholar] [CrossRef]
Gholampour, A.; Gandomi, A.H.; Ozbakkaloglu, T. New formulations for mechanical properties of recycled aggregate concrete using gene expression programming. Constr. Build. Mater. 2017, 130, 122–145. [Google Scholar] [CrossRef]
Thomas, J.; Thaickavil, N.N.; Wilson, P.M. Strength and durability of concrete containing recycled concrete aggregates. J. Build. Eng. 2018, 19, 349–365. [Google Scholar] [CrossRef]
Gomez-Soberon, J.M.V. Shrinkage of Concrete with Replacement of Aggregate with Recycled Concrete Aggregate. Available online: https://upcommons.upc.edu/server/api/core/bitstreams/fb39d5f3-eb56-4a78-a591-73f32be14c77/content (accessed on 8 April 2026).
Al Rawi, K.H.; Al Khafagy, M.A.S. Effect of Adding Sisal Fiber and Iraqi Bauxite on Some Properties of Concrete; Technical Institute of Babylon: Griq, Iraq, 2009. [Google Scholar]
Lam, T.F.; Yatim, J.M. Mechanical properties of kenaf fiber reinforced concrete with different fiber content and fiber length. J. Asian Concr. Fed. 2015, 1, 11–21. [Google Scholar] [CrossRef]
Jalal Khoshnaw, G.; Haidar Ali, B. Experimental Study on Performance of Recycled Aggregate Concrete: Effect of Reactive Mineral Admixtures. Ijciet Int. J. Civ. Eng. Technol. 2019, 10, 2566–2576. [Google Scholar]
Ogunbode, E.B.; Egba, E.I.; Olaiju, O.A.; Elnafaty, A.S.; Kawuwa, S.A. Microstructure and Mechanical Properties of Green Concrete Composites Containing Coir Fibre. Chem. Eng. Trans. 2017, 61, 1879–1884. [Google Scholar] [CrossRef]
Islam, M.S.; Ahmed, S.J.U. Influence of jute fiber on concrete properties. Constr. Build. Mater. 2018, 189, 768–776. [Google Scholar] [CrossRef]
Babafemi, A.J.; Kolawole, J.T.; OOlalusi, B. Mechanical and Durability Properties of Coir Fibre Reinforced Concrete. J. Eng. Sci. Technol. 2019, 14, 1482–1496. [Google Scholar]
Okeola, A.A.; Abuodha, S.O.; Mwero, J. Experimental Investigation of the Physical and Mechanical Properties of Sisal Fiber-Reinforced Concrete. Fibers 2018, 6, 53. [Google Scholar] [CrossRef]
Kumarasamy, K.; Shyamala, G.; Gebreyowhanse, H.; Kumarasamy. Strength Properties of Bamboo Fiber Reinforced Concrete. IOP Conf. Ser. Mater. Sci. Eng. 2020, 981, 032063. [Google Scholar] [CrossRef]
BS EN 12390-3: 2019; Testing Hardened Concrete—Compressive Strength of Test Specimens. British Standards Institution: London, UK, 2019.
Bas, Y.J.; Kakrasul, J.I.; Ismail, K.S.; Hamad, S.M. Advanced predictive techniques for estimating compressive strength in recycled aggregate concrete: Exploring interaction, quadratic models, ANN, and M5P across strength classes. Multiscale Multidiscip. Model. Exp. Des. 2024, 8, 122. [Google Scholar] [CrossRef]
Tohidi, S.; Sharifi, Y. Empirical modeling of distortional buckling strength of half-through bridge girders via stepwise regression method. Adv. Struct. Eng. 2015, 18, 1383–1398. [Google Scholar] [CrossRef]
Harith, I.K.; Nadir, W.; Salah, M.S.; Hussien, M.L. Prediction of high-performance concrete strength using machine learning with hierarchical regression. Multiscale Multidiscip. Model. Exp. Des. 2024, 7, 4911–4922. [Google Scholar] [CrossRef]
Loh, W. Classification and Regression Trees; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2011; pp. 1–17. [Google Scholar]
Breiman, L. Classification and Regression Trees; Routledge: Oxford, UK, 2017. [Google Scholar]
Cha, G.W.; Hong, W.H.; Lee, K.G. A Study on Estimation of Demolition Waste Generation Index by using CART (Classification and Regression Tree) Analysis. In Proceedings of the 2015 International Conference on Structural, Mechanical and Material Engineering; Atlantis Press: Dordrecht, The Netherlands, 2015; pp. 90–93. [Google Scholar]
Chou, J.; Tsai, C.; Pham, A.; Lu, Y. Machine learning in concrete strength simulations: Multi-nation data analytics. Constr. Build. Mater. 2014, 73, 771–780. [Google Scholar] [CrossRef]
Alam, S.; Gazder, U.; Arifuzzaman, M. Classification and regression tree (CART) modelling for analysis of shear strength of FRP-RC members. Arab J. Basic Appl. Sci. 2021, 28, 397–405. [Google Scholar] [CrossRef]
Saud, S.; Jamil, B.; Upadhyay, Y.; Irshad, K. Performance improvement of empirical models for estimation of global solar radiation in India: A k-fold cross-validation approach. Sustain. Energy Technol. Assess. 2020, 40, 100768. [Google Scholar] [CrossRef]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. 1995. Available online: https://www.researchgate.net/profile/Ron-Kohavi/publication/2352264_A_Study_of_Cross-Validation_and_Bootstrap_for_Accuracy_Estimation_and_Model_Selection/links/02e7e51bcc14c5e91c000000/A-Study-of-Cross-Validation-and-Bootstrap-for-Accuracy-Estimation-and-Model-Selection.pdf (accessed on 8 April 2026).
Harith, I.K. Predicting frost resistance of rubberized concrete for sustainable design: A comparative analysis of regression trees and stepwise regression models. Adv. Bridg. Eng. 2026, 7, 2. [Google Scholar] [CrossRef]
Abbas, M.K.; Harith, I.K. Deep beam shear prediction via K-fold cross-validated stepwise regression and a graphical user interface: A comparative analysis with state-of-the-art models. Asian J. Civ. Eng. 2026, 27, 185–209. [Google Scholar] [CrossRef]
Khan, M.A.; Zafar, A.; Farooq, F.; Javed, M.F.; Alyousef, R.; Alabduljabbar, H.; Khan, M.I. Geopolymer Concrete Compressive Strength via Artificial Neural Network, Adaptive Neuro Fuzzy Interface System, and Gene Expression Programming with K-Fold Cross Validation. Front. Mater. 2021, 8, 1–19. [Google Scholar] [CrossRef]
Harith, I.K.; Abdulhadi, A.M.; Hussien, M.L. Harnessing machine learning for accurate estimation of compressive strength of high-performance self-compacting concrete from non-destructive tests: A comparative study. Constr. Build. Mater. 2024, 451, 138779. [Google Scholar] [CrossRef]
Harith, I.K.; Al-Rubaye, M.M.; Abdulhadi, A.M.; Hussien, M.L. Harnessing machine learning for accurate estimation of concrete strength using non-destructive tests: A comparative study. Multiscale Multidiscip. Model. Exp. Des. 2025, 8, 27. [Google Scholar] [CrossRef]
Nadir, W.; Harith, I.K.; Ali, A.Y. Optimization of ultra-high-performance concrete properties cured with ponding water. Int. J. Sustain. Build. Technol. Urban Dev. 2022, 13, 454–471. [Google Scholar] [CrossRef]
Harith, I.K.; Hassan, M.S.; Hasan, S.S.; Majdi, A. Optimization of liquid nitrogen dosage to cool concrete made with hybrid blends of nanosilica and fly ash using response surface method. Innov. Infrastruct. Solut. 2023, 8, 138. [Google Scholar] [CrossRef]
Taylor, K.E. Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res. Atmos. 2001, 106, 7183–7192. [Google Scholar] [CrossRef]

Figure 1. Research methodology.

Figure 2. Histograms of input and output variables of sustainable concrete mixture.

Figure 3. Correlation matrix of input and output variables.

Figure 4. K-fold cross-validation technique procedures [47].

Figure 5. Normal probability plot for compressive strength of sustainable concrete.

Figure 6. Versus Fits plot on compressive strength of sustainable concrete.

Figure 7. Normal probability plot of residuals for splitting strength.

Figure 8. Versus Fits plot of splitting strength of sustainable concrete mixture.

Figure 9. Optimal tree diagram for compressive strength prediction.

Figure 10. Scatter plots for training and testing sets for compressive strength of sustainable concrete.

Figure 11. Residuals of compressive strength of sustainable concrete.

Figure 12. Optimum tree diagram of splitting tensile strength estimate.

Figure 13. Scatter plots for training and testing sets of splitting tensile strength of sustainable concrete mixture.

Figure 14. Boxplot of residuals of splitting tensile strength for sustainable concrete mixture.

Figure 15. Variable significance on compressive strength

f_{c u}

of sustainable concrete mixture.

Figure 15. Variable significance on compressive strength

f_{c u}

of sustainable concrete mixture.

Figure 16. Variable significance on tensile strength

f_{t}

of sustainable concrete mixture.

Figure 16. Variable significance on tensile strength

f_{t}

of sustainable concrete mixture.

Figure 17. Violin plots contained embedded boxplots of (a) compressive strength (fcu) and (b) splitting tensile strength (ft) for sustainable concrete mixture.

Figure 18. Taylor diagram of (a) compressive strength

f_{c u}

and (b)

f_{t}

splitting tensile strength of sustainable concrete mixtures.

Figure 18. Taylor diagram of (a) compressive strength

f_{c u}

and (b)

f_{t}

splitting tensile strength of sustainable concrete mixtures.

Table 1. Key findings and limitations of previous studies.

Study	Methodology	Key Findings	Limitations
Manan et al. [12]	Gradient Boosting	Achieved R² = 0.91 for RCP-modified concrete	Ignored fiber-reinforced systems
Manan et al. [13]	Artificial Neural Networks (ANNs)	High training accuracy (R²: 0.93–0.99)	Overfitting led to poor testing performance (R²: 0.67–0.78)
Manan et al. [14]	Random Forest (RF)	Accurate dual-target predictions	Lack of interpretability for engineering use
Kattoof et al. [15]	Stepwise Polynomial Regression (SPR)	Managed nonlinear interactions	Struggled with multicollinearity among predictors
González-Taboada et al. [18]	Genetic Programming	Predicted mechanical properties of RCA concrete	Limited to single-property predictions
Behnood et al. [22]	M5P Model Tree	Predicted modulus of elasticity for RCA concrete	Narrow focus on RCA, excludes NF synergy
Current Study	Natural fibre (coir, bamboo, sisal) + RCA concrete	CART + SPR	Dual-target predictions (R² = 0.91 compressive, 0.89 tensile); transparent variable importance via CART; resolved multicollinearity via SPR.

Table 2. Statistics of input and output variables.

Variable	Abbr.	Mean	CoefVar	Minimum	Maximum	Skewness	Kurtosis
Cement	C	445.025	13.25	243	513.7	−1.27	1.42
W/B Ratio	W/B	0.417827	20.63	0.277228	0.55	−0.10	−1.09
Fine Aggregate	FA	597.192	16.69	470.3	784	0.30	−1.49
Coarse Aggregate	CA	1126.48	11.19	898	1420	0.09	−0.56
RCA	RCA	39.6429	119.50	0	100	0.44	−1.75
SCM	SCM	13.7744	180.76	0	153.9	2.15	6.69
Superplasticizer	SP	4.17523	159.79	0	20.52	1.07	−0.78
Natural fibre	NF	0.332512	194.84	0	3	2.13	4.07
Age	Age	30.1675	108.59	1	180	1.99	4.88
Cube-compressive strength	$f_{c u}$	42.3838	44.06	9.74726	96.7	0.94	0.31
Splitting tensile strength	$f_{t}$	3.48685	35.14	1.16813	6.2	0.25	−0.74

Table 3. ANOVA results of SPR model.

Term	Coef	p-Value
Constant	−3.5	0.806
C	0.2212	0.002
W/B	−52.86	0.000
SP	2.696	0.000
NF	26.09	0.007
Age	0.698	0.000
C*C	−0.000218	0.019
NF*NF	1.651	0.040
Age*Age	−0.002384	0.000
C*NF	−0.0356	0.007
C*Age	−0.000498	0.103
W/B*SP	−4.39	0.000
W/B*NF	−33.1	0.029
SP*NF	−1.188	0.000
SP*Age	0.00619	0.000
NF*Age	−0.0928	0.003
Statistical metrics errors
R²	87.34%
Adj. R²	86.85%
Pred. R²	86.45%
10-fold R²	86.05%

Table 4. ANOVA results of SPR model.

Term	Coef	p-Value
Constant	113.2	0.000	Significant
C	−0.1655	0.000	Significant
W/B	204.7	0.000	Significant
FA	−0.0816	0.058	insignificant
CA	−0.1939	0.000	Significant
RCA	0.2185	0.000	Significant
SCM	−4.332	0.000	Significant
SP	15.24	0.000	Significant
NF	−0.23	0.832	insignificant
Age	0.0269	0.050	Significant
W/B*W/B	−44.2	0.006	Significant
FA*FA	0.000072	0.000	Significant
CA*CA	0.000048	0.000	Significant
RCA*RCA	−0.000235	0.000	Significant
NF*NF	−0.1726	0.006	Significant
Age*Age	−0.000145	0.000	Significant
C*W/B	−0.1787	0.000	Significant
C*FA	0.000128	0.001	Significant
C*CA	0.000156	0.000	Significant
C*Age	−0.000073	0.000	Significant
W/B*FA	−0.1835	0.000	Significant
W/B*CA	0.0355	0.008	Significant
W/B*RCA	−0.2811	0.000	Significant
W/B*SCM	4.27	0.000	Significant
W/B*SP	−14.12	0.000	Significant
W/B*NF	−7.09	0.000	Significant
FA*CA	0.000023	0.095	insignificant
FA*SCM	−0.000922	0.000	Significant
FA*NF	0.00571	0.002	Significant
CA*RCA	−0.000062	0.014	Significant
CA*SCM	0.003071	0.000	Significant
CA*SP	−0.00928	0.000	Significant
CA*Age	0.000030	0.000	Significant
RCA*SCM	−0.00377	0.001	Significant
RCA*SP	0.00894	0.015	Significant
RCA*Age	0.000048	0.002	Significant
Statistical metrics errors
R²	88.94%
Adj. R²	87.89%
Pred. R²	86.66%
10-fold R²	86.33%

Table 5. Compressive strength model summary.

Total predictors	9
Important predictors	9
Number of terminal nodes	39
Minimum terminal node size	3
Node splitting	Least squared error
Optimal tree	Within 1 standard error of maximum R-squared
Model validation	10-fold cross-validation

Table 6. Statistical metrics error.

Statistics	Training	Test
R-squared	95.16%	91.09%
Root mean squared error (RMSE)	4.1052	5.5686
Mean squared error (MSE)	16.8529	31.0098
Mean absolute deviation (MAD)	2.9850	4.1076
Mean absolute percent error (MAPE)	0.0757	0.1055

Table 7. Splitting strength model summary.

Total predictors	9
Important predictors	9
Number of terminal nodes	28
Minimum terminal node size	3
Node splitting	Least squared error
Optimal tree	Within 1 standard error of maximum R-squared
Model validation	10-fold cross-validation

Table 8. Statistical metrics error.

Statistics	Training	Test
R²	94.55%	89.56%
RMSE	0.2858	0.3954
MSE	0.0817	0.1563
MAD	0.2301	0.2996
MAPE	0.0741	0.0939

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zghair, H.H.; Harith, I.K.; Hussain, T.H. Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning. Buildings 2026, 16, 1529. https://doi.org/10.3390/buildings16081529

AMA Style

Zghair HH, Harith IK, Hussain TH. Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning. Buildings. 2026; 16(8):1529. https://doi.org/10.3390/buildings16081529

Chicago/Turabian Style

Zghair, Hussein H., Iman Kattoof Harith, and Tholfekar Habeeb Hussain. 2026. "Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning" Buildings 16, no. 8: 1529. https://doi.org/10.3390/buildings16081529

APA Style

Zghair, H. H., Harith, I. K., & Hussain, T. H. (2026). Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning. Buildings, 16(8), 1529. https://doi.org/10.3390/buildings16081529

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning

Abstract

1. Introduction

2. Research Methodology

2.1. Data Preparation and Distribution

2.2. Correlation Matrix

2.3. Regression and ML Model Assumptions and Limits

2.3.1. Stepwise Polynomial Regression Model

2.3.2. Classification and Regression Tree (CART) Model

2.4. K-Fold Cross-Validation Technique

2.5. Criteria for Evaluating Models

3. Findings and Interpretation

3.1. Stepwise Polynomial Regression Model

3.1.1. Compressive Strength

3.1.2. Splitting Tensile Strength

3.2. CART Regression Model

3.2.1. Compressive Strength $f_{c u}$

3.2.2. Splitting Tensile Strength

3.2.3. Variable Significance

3.3. Model Validation and K-Fold Cross-Validation

3.4. Assessment of Practical Machine Learning Models

4. Future Work

4.1. Statistical Robustness and Error Distribution

4.2. Fiber-Type Heterogeneity and Model Limitations

4.3. Practical Engineering Implications

4.4. Recommended Future Improvements

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Eco-Innovation in Construction: Forecasting Natural Fiber-Reinforced Concrete Strength Using Machine Learning

Abstract

1. Introduction

2. Research Methodology

2.1. Data Preparation and Distribution

2.2. Correlation Matrix

2.3. Regression and ML Model Assumptions and Limits

2.3.1. Stepwise Polynomial Regression Model

2.3.2. Classification and Regression Tree (CART) Model

2.4. K-Fold Cross-Validation Technique

2.5. Criteria for Evaluating Models

3. Findings and Interpretation

3.1. Stepwise Polynomial Regression Model

3.1.1. Compressive Strength

3.1.2. Splitting Tensile Strength

3.2. CART Regression Model

3.2.1. Compressive Strength f c u

3.2.2. Splitting Tensile Strength

3.2.3. Variable Significance

3.3. Model Validation and K-Fold Cross-Validation

3.4. Assessment of Practical Machine Learning Models

4. Future Work

4.1. Statistical Robustness and Error Distribution

4.2. Fiber-Type Heterogeneity and Model Limitations

4.3. Practical Engineering Implications

4.4. Recommended Future Improvements

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.1. Compressive Strength $f_{c u}$