AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms

Olvera-Mayorga, Carlos Eduardo; López-Martínez, Manuel de Jesús; Rodríguez-Rodríguez, José A.; Vázquez-Reyes, Sodel; Solís-Sánchez, Luis O.; de la Rosa-Vargas, José I.; Duarte-Correa, David; González-Aviña, José Vidal; Olvera-Olvera, Carlos A.

doi:10.3390/app152312383

Open AccessArticle

AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms

by

Carlos Eduardo Olvera-Mayorga

¹

,

Manuel de Jesús López-Martínez

^1,*

,

José A. Rodríguez-Rodríguez

²

,

Sodel Vázquez-Reyes

^3,4

,

Luis O. Solís-Sánchez

⁴

,

José I. de la Rosa-Vargas

³

,

David Duarte-Correa

⁵

,

José Vidal González-Aviña

⁶

and

Carlos A. Olvera-Olvera

^1,*

¹

Laboratorio de Invenciones Aplicadas a la Industria (LIAI), Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98600, Zacatecas, Mexico

²

Laboratorio de Resistencia de Materiales y Mecánica de Suelos, Unidad Académica de Ingeniería I, Universidad Autónoma de Zacatecas, Zacatecas 98600, Zacatecas, Mexico

³

Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98600, Zacatecas, Mexico

⁴

Posgrados de Ingeniería y Tecnología Aplicada, Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98600, Zacatecas, Mexico

⁵

Tlachia Systems, Av. Felipe Carrillo Puerto 1001, Querétaro 76120, Querétaro, Mexico

⁶

Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Av. General Ramon Corona 2514, Zapopan 45138, Jalisco, Mexico

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12383; https://doi.org/10.3390/app152312383

Submission received: 21 October 2025 / Revised: 17 November 2025 / Accepted: 20 November 2025 / Published: 21 November 2025

Download

Browse Figures

Versions Notes

Abstract

The prediction of concrete compressive strength (CSMPa) is fundamental in experimental civil engineering as it enables the optimization of mix design and complements laboratory testing through predictive tools. This study presents a systematic and reproducible methodology for comparing eight regression algorithms—including linear models, neural networks, and boosting methods—applied to three experimental datasets that represent different types of concrete: high-performance concrete (HPC), conventional concrete, and recycled-aggregate concrete (RAC). In order to make such comparison, some performance metrics were calculated (RMSE, MAE, MAPE,

R^{2}

, and nRMSE) through hyperparameter optimization using RandomizedSearchCV and homogeneous cross-validation. The boosting methods achieved the best performance, with CatBoost standing out by reaching

R^{2}

values between 0.92 and 0.95 and RMSE between 3.4 and 4.4 MPa, confirming its inter-dataset stability and generalization capability. These results indicate consistent predictive accuracy across concretes of different compositions and production contexts. As an applied contribution, three interactive inference systems were developed in Google Colab to estimate CS from mix parameters, promoting reproducibility, open access, and practical use in quality-control processes.

Keywords:

concrete compressive strength; machine learning; multi-dataset analysis; hyperparameter optimization; gradient boosting; neural networks; inference system

1. Introduction

Concrete compressive strength (CS, MPa) is the most representative parameter of material performance as it governs the design and safety verification of structural elements such as columns, beams, and slabs [1].

In laboratory practice, CS is determined through standardized destructive tests according to ASTM C39/C39M [2] and ASTM C31/C31M [3], which, respectively, define the test method for determining CS and regulate the preparation, curing, and handling of specimens to ensure reliable and reproducible conditions.

Although these methods provide reliable and reproducible results, they require long curing periods, specialized equipment, and significant laboratory effort, therefore limiting their applicability during mix design and early-age quality assessment. Alternative methods have been standardized to overcome these limitations, such as accelerated strength testing (ASTM C684/C684M [4]) or non-destructive tests based on ultrasonic pulse velocity and rebound hammer techniques (ASTM C597 [5] and ASTM C805 [6]). However, their application depends on specific experimental conditions, and their accuracy is not always comparable to destructive tests.

From a mechanistic perspective, the development of CS in concrete is governed by the hydration kinetics of cement, which drive the formation of calcium silicate hydrates (C–S–H) and the progressive refinement of the pore structure [7,8]. As hydration proceeds, the capillary porosity decreases while the solid gel fraction increases, enhancing the material’s stiffness and load-bearing capacity [9]. These microstructural transformations depend strongly on parameters such as curing age, water-to-binder ratio (W/B), and cement content, which respectively reflect the degree of hydration, the initial porosity, and the amount of reactive binder available [10]. Therefore, even though the predictive models in this study are purely data-driven, their input variables implicitly encode these physical mechanisms, allowing the algorithms to learn the underlying relationships between mixture composition, microstructure evolution, and CS.

In this context, machine learning (ML) has emerged as a powerful tool for estimating CS from easily measurable mix and curing parameters, such as the water-to-cement ratio (W/C), aggregate proportions, or curing age. These techniques can capture complex nonlinear relationships among mixture components, offering a complementary alternative to conventional laboratory testing. Thus, ML-based models can support preliminary mix design, optimization of quality control, and reduction of exclusive dependence on destructive methods [11].

Since the late 1990s, artificial neural networks were the first to be successfully applied to this problem [12], laying the foundation for the use of more advanced methods such as boosting algorithms (CatBoost, XGBoost, and LightGBM), which have been widely adopted in civil engineering for their ability to improve accuracy and generalization in regression tasks [13,14].

Recent reviews highlight the trend toward more robust and interpretable models. Gamil [15] emphasized the importance of combining ensemble techniques with interpretability methods such as SHAP (SHapley Additive exPlanations) [16], while Zhang et al. [17] and Shaaban et al. [18] reported outstanding performance (

R^{2}

between 0.91 and 0.94) in predicting CS using boosting-based approaches. Complementarily, Olaye et al. [19] identified through SHAP interpretability that curing age and the water-to-cement ratio are the most influential variables affecting final strength. In addition, Cakiroglu et al. (2022) [20] applied interpretable ML models to predict the axial capacity of FRP-reinforced concrete columns, demonstrating that ensemble and tree-based algorithms, coupled with SHAP analysis, can provide both accurate predictions and mechanistic insight into variable influence within structural concrete systems.

Zhang et al. (2024) [21] also developed machine learning regression models to predict the mechanical strength of concrete and mortar using experimental datasets. Although their study evaluates single datasets rather than multi-dataset benchmarking, it reinforces the growing adoption of supervised ML methods—particularly ensemble and tree-based algorithms—for estimating strength from mixture proportions and curing parameters.

Furthermore, recent studies have extended ML applications to High-Performance Concrete (HPC) and additional properties such as slump flow. Wang et al. [22] developed hybrid models optimized with metaheuristics (GWO and QPSO), achieving

R^{2} > 0.99

, while Xu and Afzal [23] and Zhao et al. [24] used optimized neural network ensembles with comparable results. Nevertheless, the diversity of these methodologies and validation criteria reported in the literature makes it difficult to objectively compare model performance and limits their practical applicability in real-world environments [25,26].

Therefore, there is a clear need to establish a homogeneous, systematic, and reproducible methodological framework to compare the performance of different machine learning algorithms in predicting CS under equivalent experimental conditions and normalized metrics. This study addresses this need through the development of a comprehensive comparative methodology aimed not only at the objective evaluation of models but also at the construction of an inference system capable of estimating concrete strength from its mix variables, facilitating its practical application in mix design and quality control.

A comparative synthesis of the state of the art is presented in Table 1, summarizing the main experimental studies that have applied machine learning to predict CS. The table highlights the models used, the optimization algorithms employed, the characteristics of the datasets, the reported performance metrics, and the main limitations identified in each case.

The main purpose of this work is to develop a homogeneous, systematic, and reproducible methodological framework that enables the evaluation, comparison, and implementation of machine learning models for predicting concrete compressive strength (CS, MPa). The performance of eight supervised algorithms was analyzed—Linear Regression, SVR, MLP, KNN, Random Forest, XGBoost, LightGBM, and CatBoost—each trained and tested on three independent public datasets that represent distinct engineering contexts: (1) a high-performance concrete (HPC) dataset with low water-to-binder ratios and high strength levels [11]; (2) a conventional concrete dataset including compressive strength and slump measurements [27]; and (3) a recycled-aggregate concrete (RAC) dataset incorporating supplementary cementitious materials and partial replacement of coarse aggregates [28]. This experimental diversity enables an assessment of the generalization capability and stability of the models under varying mix designs, curing regimes, and material compositions. Hyperparameter tuning was performed with RandomizedSearchCV, and model performance was evaluated through RMSE, MAE, MAPE,

R^{2}

, and nRMSE metrics, ensuring statistical comparability across algorithms.

This study introduces a unified and reproducible benchmarking framework that standardizes data preprocessing, hyperparameter tuning, and cross-validation procedures across eight regression algorithms for predicting CS. The framework ensures statistical comparability among models and datasets, enabling a fair and consistent evaluation of performance. Through this approach, the research demonstrates the methodological reliability of gradient boosting algorithms—particularly CatBoost, XGBoost, and LightGBM—as the most stable and accurate predictors across heterogeneous concrete systems. Furthermore, the study provides an applied outcome by developing three interactive inference systems implemented in Google Colab, which allow engineers to estimate CS from mixture parameters as a complementary tool for quality control. The prospective laboratory validation of the best-performing system (CatBoost–Yeh) is outlined in the Future Work section.

Table 1. Summary of previous experimental research employing machine learning techniques for the prediction of concrete mechanical properties.

Ref.	Models	Optimization Algorithms	Data	Predicted Properties	Performance Metrics	Limitations
[17]	DeepForest (ensemble of 12 regression models)	Internal optimization (DeepForest framework)	200 HPC mixes	CS (MPa)	$R^{2} = 0.91$	Small dataset; no external validation
[18]	XGBoost, Random Forest, SVR, ANN, Linear Regression, Ridge, Lasso, KNN	Default parameters; Grid Search (basic tuning)	180 high-strength concrete mixes	CS (MPa)	$R^{2} \approx 0.94$	Small dataset; no interpretability; no external validation
[29]	ML models combined with fuzzy logic and simulated annealing	Simulated Annealing (SA)	Experimental engineering datasets	Decision-making in civil engineering	Relative error reduction (no $R^{2}$ )	High implementation complexity
[19]	Gradient Boosting, XGBoost, Random Forest, SVR, ANN, MLP, Lasso, KNN	Grid Search + k-fold cross-validation	150 experimental mixes	CS (MPa)	$R^{2}$ : XGB = 0.9349, GBR = 0.9209; MAE, RMSE also reported	Small dataset; limited generalization
[20]	KRR, Lasso, SVR, GBM, AdaBoost, RF, CatBoost, XGBoost	Cross-validated tuning (no metaheuristics); SHAP interpretability	117 experimental tests of FRP–RC columns	Axial load-carrying capacity (kN)	Best: XGBoost (test) $R^{2} = 0.982$ , RMSE $\approx 153.8$ kN, MAE $\approx 112.7$ kN, MAPE $\approx 0.054$ ; GBM and RF also high	Small dataset; only concentric loading; no external validation; limited applicability range; potential extrapolation issues; CatBoost slower
[22]	Gradient Boosting + GWO; T-SFIS + QPSO; DGT	Metaheuristics (GWO, QPSO)	191 HPC mixes	CS (MPa), slump flow	$R^{2} = 0.998$ (CS), $0.984$ (slump); RMSE = 1.226 MPa (CS), 3.233 mm (slump)	Risk of overfitting; limited extrapolation; high computational cost
[23]	Ensemble ML models (RF, XGB, GBR, ANN hybrids)	Ensemble hybridization + k-fold cross-validation	200 HPC mixes	CS (MPa)	$R^{2} \approx 0.96$	Dependent on lab data; no uncertainty analysis; no external validation
[24]	RBFNN optimized with IGWO and Dragonfly	Metaheuristics (IGWO, DA)	180 HPC mixes	CS (MPa)	$R^{2} = 0.96$ ; RMSE = 2.5 MPa	Sensitive to data quality; complex hyperparameter tuning
[30]	RF, XGB, ANN, SVR, Linear Regression, Decision Tree, KNN	Cross-validated tuning (no metaheuristics)	220 conventional concrete mixes	CS (MPa), flexural strength, slump	$R^{2} \approx 0.92$	No external validation; limited generalization
This work	Linear Regression, SVR, MLP, KNN, RF, XGBoost, LightGBM, CatBoost	RandomizedSearchCV	1030 UCI concrete mixes	CS (MPa)	$R^{2} = 0.950$ ; RMSE = 3.48 MPa; MAE = 2.54 MPa; MAPE = 8.61%	No uncertainty analysis (PICP); experimental validation in progress

2. Materials and Methods

2.1. Research Design and Workflow

The research was conducted following a structured, systematic, and reproducible methodology comprising seven main stages: (1) collection and integration of three public experimental concrete datasets; (2) review and selection of the most widely used regression algorithms for predicting concrete properties, based on their frequency of use and reported performance in recent literature; (3) preprocessing, normalization, and statistical analysis of the input variables; (4) training and optimization of eight supervised regression algorithms through random hyperparameter search (RandomizedSearchCV); (5) comparative evaluation of model performance within each dataset using standardized metrics (RMSE, MAE, MAPE,

R^{2}

, nRMSE) and graphical analyses (scatter plots, feature–importance distributions); in particular, the feature–importance results provide a feedback link to the physical interpretation of mechanisms; (6) stability and global ranking analysis across datasets to identify the most consistent and generalizable models; and (7) implementation of an interactive inference system based on the best-performing models.

The workflow also includes a feedback link from Stage 5, where model metrics and feature–importance analyses are interpreted in the context of known physicochemical mechanisms. This connection allows the statistical evaluation to inform the physical understanding of concrete behavior—relating algorithmic sensitivity to hydration kinetics, water–binder interactions, and pozzolanic activity—thus bridging data-driven modeling with material science.

Figure 1 summarizes the proposed workflow, highlighting the main stages of the study and the iterative processes applied individually to each dataset. Blue lines group the iterative steps applied individually to each dataset, black lines indicate the global stages performed once for the entire study and red lines correspond to selective preprocessing applied only in the case of SVR, MLP, and KNN. This methodological design ensured result traceability, algorithm and dataset comparability, and the practical transfer of knowledge toward the experimental validation of the developed inference system.

2.2. Data Acquisition

Three independent public datasets were used. In all cases, the original structures and values of the databases were preserved; only in the Yeh (1998) dataset where the column names were simplified to facilitate the visualization of figures and tables, reducing lengthy labels such as Cement (component 1) (kg in a m³ mixture) to Cement. This modification was purely nominal and did not affect the information or data format.

2.2.1. Dataset 1: Yeh (1998)

This is a classic reference dataset with

n = 1030

High-Performance Concrete (HPC) mixes [11,12]. The tests originated from an experimental program conducted at the National Chiao Tung University (Taiwan) and reported by Yeh (1998). After removing 25 duplicate records, a total of 1005 unique samples were retained and used for all subsequent analyses. The mixes were prepared with ordinary Portland cement (OPC) and mineral additives (ground granulated blast-furnace slag and fly ash), under varying water-to-binder ratios and superplasticizer dosages. Curing was performed in water at 20 °C until the testing age, and CS was determined on 100 × 200 mm cylindrical specimens following procedures equivalent to those established in ASTM C39. Although the author did not explicitly mention a curing standard, the described conditions are consistent with the standardized practices of ASTM C31 for the preparation and curing of concrete specimens.

It includes 8 predictor variables (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, and Age) and one target variable (CS, MPa), for a total of 9 attributes. The

f_{c}^{'}

values range from 2.33–82.6 MPa, covering concretes from low to very high strength.

The variables included in this dataset are described in Table 2.

2.2.2. Dataset 2: Ke–Qiu (2024)

This dataset was compiled from the mix proportion ledger of the Guangxi Road and Bridge Group Pavement Branch (China, November 2022) [27]. It includes

n = 1670

mixes of conventional concrete (NC) used in pavements and highway engineering structures. After removing 52 duplicate records, a total of 1618 unique samples were retained and used for subsequent analyses. Each mix was tested under real production conditions, with moist curing at controlled ambient temperature. It includes 9 predictor variables (Cement, Fine Aggregates, Coarse Aggregates, Water, Water Reducing Admixture, Fly Ash, Accelerating Agent, Silica Fume, and Time) and one target variable (Compressive Strength, MPa), totaling 10 attributes. The measured

f_{c}^{'}

values range from 4.3–76.3 MPa. The target variable is Strength (MPa), with

f_{c}^{'}

values ranging from 4.3 to 76.3 MPa. The repository also includes a complementary slump file (workability) not used in this study, but relevant for rheological behavior and consistency control analyses.

The contents of this dataset are described in Table 3.

2.2.3. Dataset 3: Biswal (2022)

An experimental dataset with

n = 188

mixes tested in the laboratory at the Department of Civil Engineering, IIT Bhubaneswar (India) [28,31]. After removing three duplicate records, a total of 185 unique samples were retained and used for subsequent analyses. It focuses on Recycled Aggregate Concrete (RAC), assessing its mechanical behavior through 150 mm cubic specimens molded and water-cured at ambient temperature, with compression tests performed at 7, 28, and 56 days. Although the article does not explicitly mention a reference standard, the described procedure is consistent with the standard practices defined by ASTM C31 for curing and ASTM C39 for determining CS. The experimental program investigates the water-to-binder ratio (0.25–0.75) and the influence of supplementary cementitious materials (Fly Ash, GGBS, Metakaolin) combined with chemical admixtures (Superplasticizer, Viscosity Modifying Agent, Accelerator). The target variable was obtained at different curing ages, with

f_{c}^{'}

values ranging from 6.0–73.76 MPa. This dataset represents a sustainable concrete scenario, featuring a more complex composition and a larger number of experimental variables (15 predictor variables and one target variable, totaling 16 attributes).

The variables included in this dataset are described in Table 4.

2.3. Regression Models

For the prediction of CS, eight supervised regression algorithms were selected, representing different machine learning approaches—from classical linear models to ensemble techniques and neural networks. Each algorithm is briefly described below, along with its basic mathematical formulation and relevant characteristics.

Linear Regression: A statistical model that assumes a linear relationship between the independent variables $X = [x_{1}, x_{2}, \dots, x_{p}]$ and the target variable y (CS, MPa). Its formulation is given in Equation (1):

$\hat{y} = β_{0} + \sum_{j = 1}^{p} β_{j} x_{j} + ε,$

(1)

where $β_{j}$ are the coefficients to be estimated and $ε$ is an error term. It is easy to interpret and computationally efficient, but its predictive capacity decreases when the relationship between variables is nonlinear [32].
Random Forest: An ensemble method composed of B decision trees ${T_{b}}_{b = 1}^{B}$ , trained on bootstrap samples and random subsets of features. The final prediction, shown in Equation (2), is the average of all trees:

$\hat{y} = \frac{1}{B} \sum_{b = 1}^{B} T_{b} (X) .$

(2)

It reduces variance and captures nonlinear interactions. It is robust to outliers but may require more computational resources for very large datasets [33].
Support Vector Regression (SVR): Extends support vector machines to regression, seeking a function $f (x) = 〈 w, x 〉 + b$ that minimizes the objective in Equation (3):

$min_{w, b, ξ, ξ^{*}} \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*}),$

(3)

subject to $| y_{i} - f (x_{i}) | \leq ε + ξ_{i}$ , with $ξ_{i}, ξ_{i}^{*} \geq 0$ . Nonlinearities are handled through kernel functions. It performs well in high-dimensional spaces, although training can be slower for large datasets [34].
Multilayer Perceptron (MLP): A feedforward neural network with L layers, where each neuron computes its activation as in Equation (4):

$a^{(l)} = g (W^{(l)} a^{(l - 1)} + b^{(l)}),$

(4)

where $g (\cdot)$ denotes the activation function (e.g., ReLU or sigmoid), $W^{(l)}$ and $b^{(l)}$ represent the weight matrix and bias vector of layer l, and $a^{(l)}$ is the corresponding activation vector. The network output $\hat{y}$ is obtained after propagating through all layers and minimizing a loss function such as the mean squared error. This architecture can capture highly nonlinear relationships but remains sensitive to hyperparameter choices and data normalization [35].
K-Nearest Neighbors (KNN): Predicts $\hat{y}$ as the average of the k nearest observations $N_{k} (x)$ according to a distance metric $d (\cdot, \cdot)$ , as shown in Equation (5):

$\hat{y} = \frac{1}{k} \sum_{x_{i} \in N_{k} (x)} y_{i} .$

(5)

It is simple and does not require explicit training, but performance depends on the chosen distance metric, the value of k, and the scale of variables [36].
XGBoost: An optimized implementation of gradient boosting that builds the model additively, as expressed in Equation (6):

${\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + η f_{t} (x_{i}),$

(6)

where $f_{t}$ is a regression tree and $η$ is the learning rate. The overall objective combines a differentiable loss L with a regularization term $Ω$ , as shown in Equation (7):

$Obj = \sum_{i = 1}^{n} L (y_{i}, {\hat{y}}_{i}) + \sum_{t = 1}^{T} Ω (f_{t}) .$

(7)

Each new tree is fitted to minimize the gradient of the loss function with respect to previous predictions. It incorporates $L_{1} / L_{2}$ regularization, efficient handling of missing values, and parallel computation [37].
LightGBM: A histogram-based gradient boosting method with a leaf-wise growth strategy that enhances computational efficiency. Its general mathematical formulation follows Equations (6) and (7), but it differs by using histograms to accelerate split point search and a leaf-wise growth strategy that selects the leaf with the highest information gain. It also includes memory reduction techniques and support for distributed training [38].
CatBoost: A gradient boosting algorithm that shares the general formulation described in XGBoost (Equations (6) and (7)) but introduces key differences in the structure and training of base models. It employs symmetric decision trees (oblivious trees) and an ordered boosting scheme that prevents the use of future data in gradient calculation, thereby reducing overfitting. It also handles categorical variables natively through target statistics with smoothing schemes to mitigate noise [39].

Including these models enables a representative comparison across linear, distance-based, neural network, and ensemble approaches, both with and without boosting. Subsequently, each algorithm was tuned through hyperparameter optimization to maximize its predictive capability.

2.4. Evaluation Metrics

The selected evaluation metrics quantify the prediction error and explanatory capability of the regression models. The root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (

R^{2}

) were considered, all of which are widely used in the literature [32]. Additionally, the normalized root mean square error (nRMSE) was included to facilitate the comparison of model performance and accuracy across the three datasets. The metrics are defined in Equations (8)–(12).

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}},

(8)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |,

(9)

MAPE (%) = \frac{100}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|,

(10)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

(11)

nRMSE (%) = \frac{RMSE}{σ_{y, test}} \times 100 .

(12)

RMSE. Measures the typical error in the same units as the target variable and quadratically penalizes large deviations (Equation (8)). It is useful when it is desirable to penalize substantial errors and is standard in regression analysis [40,41]. In interpretative terms, lower RMSE values indicate better performance.
MAE. Summarizes the average error with a direct interpretation and lower sensitivity to outliers compared to RMSE (Equation (9)). Several studies recommend MAE when a more robust measure of average error is required [42]. In this metric, lower values represent more accurate predictions.
MAPE. Expresses error as a percentage, facilitating comparison across ranges and communication to non-technical audiences (Equation (10)). However, it is known that MAPE can become unstable when $y_{i}$ is (near) zero; therefore, following the literature [43], terms with $y_{i} = 0$ (if any) are excluded to avoid division by zero. Lower MAPE values reflect lower relative error.
$R^{2}$ . Indicates the proportion of explained variance and is widely reported as a measure of overall fit (Equation (11)); however, it requires careful interpretation outside the linear model with intercept and may lead to misleading conclusions if used in isolation [44,45]. In this metric, higher values indicate better fit (upper-bounded by 1), whereas negative values indicate that the model performs worse than simply predicting the mean.
nRMSE. Corresponds to the normalized version of RMSE, calculated by dividing it by the standard deviation of the target variable in the test set (Equation (12)). When expressed as a percentage, it enables direct comparison of model performance and accuracy across datasets with different scales or CS ranges. This normalization approach is consistent with the concept of Normalized Root Mean Square Error commonly used in the literature to evaluate and compare models across different units or scales [46,47]. Lower nRMSE values indicate better relative performance and greater consistency across datasets.

Together, these metrics provide a clearer and more balanced view of performance: RMSE and MAE indicate the model’s deviation in absolute terms, MAPE expresses that deviation as an easily interpretable percentage, and

R^{2}

reflects how much of the data behavior is explained by the model. Using them jointly avoids relying on a single metric that could overlook specific weaknesses; therefore, the adopted criterion is to minimize RMSE/MAE/MAPE and maximize

R^{2}

, in accordance with common recommendations in the literature [41,43].

2.5. Data Preprocessing

Each dataset was individually inspected to ensure its integrity and consistency prior to modeling. In all cases, duplicate records were removed to ensure that the models operated only with representative and redundancy-free information, preventing the results from being affected by repeated patterns or measurement inconsistencies.

In the Yeh dataset, 25 duplicate records were identified and removed, resulting in a total of 1005 unique samples. In the Ke–Qiu dataset, 52 duplicates were removed, yielding a final total of 1618 valid samples, while in the Biswal dataset, 3 duplicates were detected, resulting in 185 unique samples. All subsequent statistical analyses, model training, and performance evaluations were conducted using the final datasets after duplicate removal (Yeh:

n = 1005

, Ke–Qiu:

n = 1618

, Biswal:

n = 185

).

In addition to duplicate removal, each dataset was also visually and statistically inspected to identify potential outliers or anomalous records. Descriptive statistics, histograms, and correlation heatmaps were used to verify the consistency and physical plausibility of all variables. Although a few extreme values were observed (e.g., in cement content or curing age), they were retained because they fall within realistic experimental ranges and correspond to valid concrete mix designs reported in the original studies. These ranges were further cross-checked against reference literature on concrete materials and mix design [1,8], confirming their physical plausibility. Consequently, no artificial trimming or winsorization was applied, ensuring that the models learned from the full experimental variability of the data.

Subsequently, the predictive variables (X) and the dependent variable (y) were separated, the latter corresponding to the CS (MPa). The distribution of y was evaluated using histograms with density fitting and the calculation of the skewness coefficient for each dataset. The obtained values were 0.395 for Yeh, 0.007 for Ke–Qiu, and 0.203 for Biswal, all classified as slightly or nearly symmetric. According to common statistical criteria [48,49,50], values close to zero indicate symmetry; between

\pm 0.5

and

\pm 1

indicate moderate skewness; and greater than

\pm 1

indicate high skewness. Since the three datasets showed approximately symmetric distributions, no transformations such as logarithmic, square root, or Box–Cox were applied. This decision is supported both by the statistical analysis and the visual inspection of the target variable distributions, shown in Figure 2, which presents the histograms corresponding to the three datasets: (1) Yeh, (2) Ke–Qiu, and (3) Biswal.

To assess the predictive performance of the models, the data from the three datasets were independently split into two subsets: 70% for training and 30% for testing.

A key aspect of preprocessing was feature standardization (zero mean, unit variance) for models sensitive to input magnitudes (SVR, MLP, KNN); operationally, this scaling was implemented via scikit-learn Pipelines fitted inside each training fold to avoid data leakage. This decision is based on the fact that certain algorithms are sensitive to variable scales, which may cause a feature with larger numerical values to exert greater influence on the result than another with smaller values, even if their true importance is the same [41].

It is important to note that the output variable (CS, MPa) was kept in its original scale, as it represents a physical magnitude whose value must be preserved to allow direct interpretation of the error metrics. Thus, the best test root mean square errors were 3.49 MPa (Yeh), 3.88 MPa (Ke–Qiu), and 3.76 MPa (Biswal), values that directly reflect the magnitude of the prediction error with respect to the experimental tests. Such direct interpretation would not be possible if the target variable had been normalized.

Figure 3 illustrates the before/after distributions for visualization only; during model training and tuning, standardization was performed fold-wise within cross-validation through Pipelines (no global pre-scaling on the full dataset) only for scale-sensitive models (SVR, MLP, KNN)—thus avoiding data leakage. The same fold-wise procedure was applied to Ke–Qiu and Biswal. This transformation homogenizes the numerical magnitudes of the features without altering their general distribution shape, preserving the physical interpretability of the data. The same procedure was applied identically to the Ke–Qiu and Biswal datasets.

From a physical standpoint, this standardization procedure preserves the proportional relationships among mix constituents—such as the water-to-binder and aggregate-to-cement ratios—because all predictors are transformed linearly and independently without altering their relative magnitudes. Thus, the physical balance of the concrete mix is retained, and the learned interactions remain consistent with the material’s actual compositional ratios. The goal of scaling is therefore numerical stabilization, not physical rescaling, ensuring that the model’s learning process reflects the true physicochemical proportions of the mixtures.

This scaling effect is particularly relevant in certain machine learning algorithms whose mathematical formulations depend directly on the magnitude of the input variables. Understanding this behavior helps justify why some models require prior normalization to achieve optimal performance. The most representative cases used in this study are described below:

SVR: This method seeks to find a function that remains within an optimal tolerance margin around the actual data, maximizing the distance between the support vectors and the regression hyperplane. If the variables have very different scales, the internal metric (dot product or kernels) becomes distorted, affecting the correct placement of the margin and, consequently, the quality of the prediction [34].
MLP: Feedforward neural networks use activation functions such as ReLU or tanh to transform inputs into internal signals. When input features are on very different scales, some neurons may receive excessively large or small values, causing activation function saturation and hindering gradient propagation during training, which slows down or even prevents convergence [35].
KNN: This algorithm assigns predictions based on the distances between observations, typically using the Euclidean metric. If one variable has a much larger numerical range than the others, it will dominate the distance calculation, diminishing the influence of other variables that may be more relevant to the studied phenomenon [36].

In contrast, tree-based models—such as Random Forest, XGBoost, LightGBM, and CatBoost—do not require prior normalization, as they generate splits through partition rules based on absolute thresholds and are therefore invariant to the scale of the variables [32].

Linear Regression was also kept unscaled, since both its mathematical formulation and an empirical verification performed in this study confirmed that feature standardization does not change the model coefficients or prediction metrics (RMSE, MAE, MAPE,

R^{2}

, nRMSE). The results obtained with and without scaling were identical for the Ke–Qiu and Biswal datasets and showed only negligible numerical differences in the Yeh dataset (attributed to rounding). This behavior is consistent with the theoretical scale invariance of linear models based on least-squares fitting; therefore, Linear Regression was maintained in its original physical units to preserve coefficient interpretability.

This selective preprocessing criterion, applied consistently across the three datasets, ensures that each algorithm receives the variables in the most suitable representation to maximize predictive performance while maintaining methodological homogeneity within the overall modeling pipeline.

2.6. Hyperparameter Optimization

Hyperparameter optimization was performed independently for each dataset using a standardized procedure with RandomizedSearchCV. This random search method efficiently explores large hyperparameter spaces at a lower computational cost compared to exhaustive search (Grid Search) [51]. The technique samples random combinations of hyperparameters from predefined distributions, evaluating performance with K-fold cross-validation (K = 3–5, shuffle = True, random_state = 42). For SVR, MLP, and KNN, models were wrapped in Pipelines with StandardScaler so that scaling parameters were fitted only on each training fold and applied to the corresponding validation fold, eliminating data leakage and ensuring a fair comparison. The use of this homogeneous approach across the three datasets allowed balancing search space coverage and computational time, while ensuring reproducibility by fixing random seeds.

To guarantee the reproducibility of the experiments, a common random seed (42)—widely used in the scientific community and in machine learning libraries for its ease of replication—was adopted for all algorithms. This unified configuration ensures that the reported performance differences strictly reflect algorithmic and data-related effects, enhancing methodological reproducibility and eliminating potential variability caused by stochastic initialization.

Each hyperparameter configuration was evaluated through 5-fold CV to ensure statistical robustness. Scaling for SVR, MLP, and KNN was performed inside the Pipeline within each fold to prevent data leakage, while tree-based models and Linear Regression were evaluated without scaling. A total of 50 random combinations were tested for all algorithms, except for the computationally intensive MLP model, which used 30 iterations and 3-fold CV to maintain efficiency without compromising reliability.

The search intervals for each hyperparameter were defined based on recommendations from specialized literature and official algorithm documentation, complemented by prior empirical experience. The same approach was applied across all three datasets to maximize performance and generalization capability while maintaining a reasonable computational cost.

The optimized hyperparameters for each model are detailed below, along with their theoretical justification, selected range, and bibliographic support.

2.6.1. Random Forest Regressor

n_estimators [100, 200, 300, 400, 500]: Number of trees in the ensemble. A higher number reduces variance and improves prediction stability, although it increases computational cost [33].
max_depth [5, 10, 15, 20, None]: Maximum depth of the trees. Limiting depth decreases overfitting risk by controlling model complexity [41].
min_samples_split [2, 5, 10]: Minimum number of samples required to split a node. Higher values create more general partitions and reduce variance.
min_samples_leaf [1, 2, 4]: Minimum number of samples per terminal leaf. Prevents splits based on very few samples, improving generalization [32].

2.6.2. Support Vector Regressor (SVR)

kernel [‘rbf’]: Chosen for its effectiveness in modeling nonlinear patterns in physical datasets.
C [0.1, 1, 3, 10, 30, 100, 300, 1000]: Controls regularization strength; smaller values increase bias, larger values reduce bias but may increase variance.
gamma [‘scale’, ‘auto’, 1, 0.3, 0.1, 0.03, 0.01]: Defines the kernel width; smaller values produce smoother fits, larger values more localized boundaries.
epsilon [0.01, 0.05, 0.1, 0.2, 0.5]: Dets the tolerance for error margins, adjusting robustness to noise.

Parameter ranges were selected based on established recommendations for regression with RBF kernels [34,52].

2.6.3. Multilayer Perceptron (MLP)

hidden_layer_sizes [(64,), (128,), (64,32), (128,64)]: Architectures with one or two hidden layers, balancing complexity and convergence.
activation [‘relu’, ‘tanh’]: Nonlinear activation functions; ReLU mitigates vanishing gradients, tanh favors centered data.
alpha [ $10^{- 5}$ , $10^{- 4}$ , $10^{- 3}$ , $10^{- 2}$ ]: L2 penalty coefficient for weight regularization, reducing overfitting risk.
learning_rate [‘adaptive’]: Dynamically adjusts step size during training for faster convergence.
max_iter [1000, 2000]: Ensures full convergence under 3-fold CV given stochastic initialization.

The tested architectures and regularization levels follow best practices for small- to medium-scale tabular datasets [35,53,54].

2.6.4. K-Nearest Neighbors (KNN)

n_neighbors [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]: Controls the number of reference points for local interpolation.
weights [‘uniform’, ‘distance’]: Distance-based weighting reduces bias in heterogeneous datasets.
p [1, 2]: Defines the Minkowski metric order; $p = 1$ (Manhattan) increases robustness to outliers, $p = 2$ (Euclidean) emphasizes global smoothness.

The search intervals and metrics were selected following prior comparative studies on concrete strength prediction and standard practices for distance-based regressors [36,55].

2.6.5. XGBoost

n_estimators [100, 200, 300]: Number of sequential trees in the boosting process.
learning_rate [0.01, 0.05, 0.1]: Controls the contribution of each tree. Lower rates favor generalization but require more iterations [37].
max_depth [3, 5, 7]: Controls model complexity and its ability to capture nonlinear interactions.
subsample [0.6, 0.8, 1.0]: Proportion of data used in each iteration, useful for reducing overfitting [56].

2.6.6. LightGBM

n_estimators, learning_rate, max_depth: Same principles as in XGBoost.
num_leaves [31, 50, 100]: Controls the maximum number of leaves in each tree. Higher values increase complexity and reduce bias but may increase variance [38].
feature_fraction [0.6, 0.8, 1.0]: Fraction of features used by each tree, helping to reduce overfitting and accelerate training [38].

2.6.7. CatBoost

iterations [100, 300, 500]: Total number of trees in the model.
learning_rate [0.01, 0.05, 0.1]: Defines the magnitude of the update at each iteration.
depth [4, 6, 8, 10]: Depth of the symmetric (oblivious) trees used in CatBoost [39].
l2_leaf_reg [1, 3, 5, 7]: L2 Regularization applied to leaf values, reducing overfitting.
bagging_temperature [0.2, 0.5, 1.0]: Controls the randomness of weighted sampling; lower values reduce variance at the cost of increased bias [39].

2.6.8. Linear Regression

Ordinary linear regression does not include regularization hyperparameters by default, so its standard implementation was used as a baseline model. This model served as a reference for evaluating the relative improvement achieved by nonlinear and more complex methods.

This optimization process was designed to fit each model to the specific characteristics of each dataset, ensuring a fair comparison under homogeneous training conditions and preventing exposure of the test set during tuning.

2.7. Inference System Implementation

As the applied component of this study, three independent interactive inference systems were developed, each based on the best-performing model for its corresponding dataset: CatBoost for Yeh, XGBoost for Ke–Qiu, and LightGBM for Biswal. These systems enable the estimation of CS from the mix parameters available in each dataset, translating the predictive models into practical tools for concrete design and evaluation.

The main function, named launch_dataset_inference(), was developed and executed in the cloud computing environment Google Colab. It is designed to dynamically generate the inference interface corresponding to each dataset. Its modular structure allows the reuse of a single code block across the three systems, changing only the input parameters: the dataset name, the pre-trained model, and the ordered list of variables (feature configuration) associated with it. Based on this information, the function constructs numerical input controls, collects user-provided values, organizes them into a data frame, and sends them to the corresponding model to generate a prediction. The result is immediately displayed in the interface, expressed in megapascals (MPa) and rounded to two decimal places.

In the System 1 (Yeh), based on CatBoost, the user inputs eight predictive variables: cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and concrete age. In the System 2 (Ke–Qiu), implemented with XGBoost, nine parameters are included: cement, fine and coarse aggregates, water, water-reducing admixture, fly ash, accelerating agent, silica fume, and curing time. Finally, the System 3 (Biswal), built with LightGBM, considers fifteen variables related to recycled concretes and supplementary admixtures, including cement, GGBS, metakaolin, water-to-total cementitious materials ratio (w/TCM), natural and recycled aggregates, superplasticizer, VMA, sand, and age.

In general, the inference performed by each system can be represented by the following function:

f_{c}^{'} = f (x_{1}, x_{2}, \dots, x_{n}),

where

x_{i}

corresponds to the specific mix parameters of each dataset, and

f (\cdot)

represents the trained predictive model. The systems can be executed directly in Google Colab from any web browser without requiring advanced technical knowledge. Their modular architecture allows future integration into web platforms or mobile devices, facilitating use in laboratory or on-site applications.

Currently, the systems are presented as proofs of concept that demonstrate the feasibility of integrating machine learning models into interactive inference environments for predicting concrete compressive strength (

f_{c}^{'}

) from mix parameters. Their modular implementation in Google Colab facilitates adaptation to standardized laboratory frameworks, such as those established by ASTM C39, enabling their potential use alongside experimental procedures.

The characteristics and performance of the three developed systems are summarized in Table 5, which also presents the lowest RMSE value obtained on the test set.

2.8. Implementation and Computational Environment

All computational procedures were implemented in Python 3.12 using the Google Colab cloud platform (Google LLC, Mountain View, CA, USA). Data manipulation and visualization relied on widely adopted scientific libraries, including NumPy (NumPy Developers, global open-source project) [57], Pandas (Pandas Development Team, global open-source project) [58], Matplotlib (Matplotlib Development Team, USA) [59], Seaborn (Seaborn Developers, global open-source project) [60], and SciPy (SciPy Community, global open-source project) [61].

Model training, hyperparameter optimization, and evaluation were conducted using Scikit-learn (INRIA/Scikit-learn Developers, Paris, France) [62], XGBoost (Distributed Machine Learning Community, global open-source project) [37], LightGBM (Microsoft Corporation, Redmond, WA, USA) [38], and CatBoost (Yandex, Moscow, Russia) [39].

Interactive user interfaces were developed using ipywidgets and the IPython display framework (Jupyter/IPython Projects, San Francisco, CA, USA) [63]. All software versions used in the study are documented in the requirements.txt file included in the repository referenced in the Data Availability Statement.

3. Results

3.1. Model Performance per Dataset

All metrics reported in this and subsequent sections correspond to the test subset (30% of the total data), ensuring an unbiased evaluation of generalization performance.

The performance of the eight regression algorithms was evaluated using the RMSE, MAE, MAPE,

R^{2}

, and nRMSE metrics. This analysis was applied independently to the three datasets in order to compare both absolute accuracy and relative consistency among them. The obtained results are presented in Table 6, Table 7 and Table 8.

Across the three datasets, a consistent performance pattern was observed among the models. In the Yeh dataset, the CatBoost algorithm achieved the best results (RMSE = 3.71 MPa,

R^{2}

= 0.944, nRMSE = 23.69%), slightly outperforming LightGBM and XGBoost. In the Ke–Qiu dataset, the best-performing model was XGBoost (RMSE = 3.88 MPa,

R^{2}

= 0.907, nRMSE = 30.47%), followed closely by CatBoost and Random Forest. Finally, in the Biswal dataset, the top performance corresponded to LightGBM (RMSE = 3.83 MPa,

R^{2}

= 0.958, nRMSE = 20.55%). Overall, these results confirm that the gradient boosting algorithms achieved the best balance between accuracy and generalization capability across the three analyzed scenarios.

When comparing the minimum nRMSE values obtained by the best model for each dataset, the Biswal dataset exhibited the lowest relative error (20.55%), followed by Yeh (23.69%) and Ke–Qiu (30.47%). This suggests that the predictability of CS was higher in the Biswal dataset, likely due to its narrower experimental dispersion and well-controlled curing conditions, whereas the Ke–Qiu dataset presented greater variability among field-produced samples.

A consistent trend was also observed among the lower-performing algorithms. Linear and distance-based methods (Linear Regression, KNN, and SVR) showed clear limitations in capturing nonlinear relationships and complex interactions between mix composition and curing age, reflected in their higher nRMSE values (ranging from 35% to 62%) and lower

R^{2}

. In contrast, gradient boosting algorithms demonstrated greater stability and adaptability under different experimental conditions, confirming their robustness for modeling the nonlinear behavior of concrete compressive strength.

This behavior highlights the inability of linear and proximity-based approaches to adequately represent the complexity of the problem, in contrast to the robustness of gradient boosting models, which demonstrated greater stability and generalization capability under different experimental conditions and CS ranges.

The performance results were consolidated into a single comparative visualization (Figure 4, Figure 5 and Figure 6), integrating the outcomes obtained for the three datasets.

This visualization displays the performance of the eight regression algorithms applied to datasets (1) Yeh, (2) Ke–Qiu, and (3) Biswal, distinguished by numerical labels in the charts. Each bar chart summarizes the five evaluation metrics—nRMSE, RMSE, MAE, MAPE, and

R^{2}

—used to assess model performance. For the error metrics (nRMSE, RMSE, MAE, and MAPE), lower values indicate better model fit, whereas for the coefficient of determination (

R^{2}

), higher values reflect greater predictive accuracy.

3.2. Prediction Scatter Analysis

To complement the quantitative metrics presented in Table 6, Table 7 and Table 8 and the visual comparisons in Figure 4, Figure 5 and Figure 6, scatter plots were generated between the actual and predicted values for the eight evaluated models in each dataset. Figure 7, Figure 8 and Figure 9 group these results by dataset, showing the degree of alignment of the predictions with respect to the identity line (

y = x

).

Across all three datasets, the observed patterns remain consistent with the quantitative metrics: the gradient boosting models (CatBoost, XGBoost, and LightGBM) exhibit the closest alignment to the identity line (

y = x

), reflecting their superior predictive accuracy and low residual dispersion. In particular, CatBoost achieved the tightest clustering in the Yeh dataset, XGBoost in Ke–Qiu, and LightGBM in Biswal, matching the ranking of the best-performing models reported in Table 6, Table 7 and Table 8. Conversely, the Linear Regression and KNN models consistently exhibited the poorest performance across datasets. These algorithms showed wider scatter and systematic bias, especially at higher strength ranges, which correspond to their lower

R^{2}

values and higher RMSE. Their performance contrast highlights the inability of linear and distance-based methods to capture the nonlinear and interaction effects present in the mix composition and curing processes.

Furthermore, these scatter plots also allow examining whether the prediction accuracy varies across the strength spectrum. Following conventional classifications for concrete strength [64], low-strength concretes were considered as those with

C S < 30

MPa, whereas high-strength concretes correspond to

C S > 60

MPa. The visual inspection of the residual distribution revealed no systematic bias or heteroscedasticity between these ranges: the dispersion of points remained approximately uniform across both low- and high-strength regions. This indicates that the models—particularly the gradient boosting algorithms—maintained comparable accuracy throughout the entire range of CS values. Such behavior demonstrates that the inference systems preserve stable performance across concretes of different quality levels, from conventional to high-performance mixes.

Overall, these visualizations confirm the superiority of boosting methods in accurately estimating CS and their ability to maintain stable performance under different experimental conditions represented in the three datasets.

3.3. Feature Importance Analysis

The feature importance analysis is presented in Figure 10, Figure 11 and Figure 12 for the models that provide this measure directly: Linear Regression, Random Forest, XGBoost, LightGBM, and CatBoost. The KNN, MLP, and SVR models do not produce native importance estimates and were therefore excluded from this comparison. For each algorithm, the importance values were derived using its native internal criterion: absolute standardized coefficients in Linear Regression; mean decrease in impurity (MDI) for Random Forest; and the built-in gain-based importance metric for XGBoost, LightGBM, and CatBoost. All importance scores were normalized to sum to 100%, allowing a consistent cross-model comparison. This analysis allows identifying the factors that most influence the prediction of CS and verifying the physical consistency of the relationships learned by the models.

3.3.1. Dataset 1 (Yeh, 1998)

In the Yeh dataset, the feature–importance patterns were highly consistent among the ensemble models. CatBoost—identified as the best performer—highlighted age (35.2%), cement (22.6%), and water (15.5%) as the dominant predictors, together explaining over 70% of the total variance. XGBoost and Random Forest reproduced this structure with slight variations, confirming the prevalence of curing time and binder content on strength development. This hierarchy reflects the fundamental principles of cement hydration, where strength increases with age and cement content but decreases with higher water–binder ratios—an empirical relationship consistent with the classical Abrams’ law [1,65].

Figure 10. Relative importance of the predictive variables in Dataset 1 (Yeh). Subfigures show the results for: (a) Linear Regression, (b) Random Forest, (c) CatBoost, (d) XGBoost, and (e) LightGBM.

3.3.2. Dataset 2 (Ke–Qiu, 2024)

In the Ke–Qiu dataset, representing conventional concretes produced under field conditions, XGBoost achieved the best predictive accuracy and revealed cement (31.3%), water (17.0%), and water-reducing admixture (15.6%) as the most influential features. Fine and coarse aggregates also contributed moderately (≈8%), whereas chemical additives such as silica fume and accelerating agent showed minor effects (<2%). CatBoost and LightGBM confirmed this variable hierarchy, indicating that strength depends mainly on the joint balance between binder quantity, effective water content, and admixture dosage—factors directly governing workability, hydration, and porosity in normal-strength concretes.

Figure 11. Relative importance of the predictive variables in Dataset 2 (Ke–Qiu). Subfigures show the results for: (a) Linear Regression, (b) Random Forest, (c) CatBoost, (d) XGBoost and (e) LightGBM.

3.3.3. Dataset 3 (Biswal, 2022)

For the Biswal dataset, corresponding to recycled-aggregate concretes with supplementary cementitious materials, LightGBM yielded the most accurate and interpretable feature rankings. The variables age (26.7%), cement (16.5%), and fly ash (6.8%) dominated the model, followed by water/TCM, TCM and GGBS. These parameters collectively describe the main physico-chemical mechanisms of strength gain in blended and recycled concretes, where pozzolanic reactions between SCMs (fly ash, GGBS, metakaolin) and portlandite progressively refine pore structure and enhance matrix densification. Catboost and XGBoost reproduced the same hierarchy with minimal variation, reinforcing the robustness of boosting ensembles in capturing these nonlinear, synergistic effects.

Figure 12. Relative importance of the predictive variables in Dataset 3 (Biswal). Subfigures show the results for: (a) Linear Regression, (b) Random Forest, (c) CatBoost, (d) XGBoost and (e) LightGBM.

Taken together, these results confirm the physical consistency of the learned relationships across datasets and models, highlighting the dominant influence of cement content, water–binder ratio, and curing age on CS development.

3.4. Ranking Stability Across Datasets

To analyze the consistency of algorithm performance across different experimental contexts, the individual rankings for each dataset were computed based on the nRMSE (%) metric. Subsequently, the average ranking was obtained to identify the models with the highest stability and generalization capability across datasets.

Figure 13 summarizes these results: panel (a) shows the relative order of the eight algorithms within each dataset, whereas panel (b) presents the overall average ranking across the three datasets. In panel (a), rank 1 denotes the best-performing model within each dataset, visualized using a color scale in which lighter tones indicate better ranks and darker tones denote worse ranks. In panel (b), a lower average rank indicates better performance and greater inter-dataset stability.

The average ranking again confirmed the dominance of the gradient boosting family, with CatBoost (1.67), LightGBM (2.00), and XGBoost (2.33) achieving the top positions. The specific optimal model varied slightly across datasets—CatBoost led in Yeh, LightGBM in Biswal, and XGBoost in Ke–Qiu—indicating that while all three share high generalization capacity, their relative advantages depend on dataset characteristics such as binder composition and curing regime. Intermediate performance was observed for MLP and Random Forest (4.33 and 5.00), whereas SVR, KNN, and Linear Regression ranked lowest, reflecting their limited ability to capture nonlinear mix–strength interactions.

The Spearman correlation coefficients between dataset rankings remained high—0.88 (Yeh–Ke–Qiu), 0.93 (Yeh–Biswal), and 0.74 (Ke–Qiu–Biswal)—demonstrating a strong positive association (

ρ > 0.7

) in the relative performance order of the models. This confirms that the algorithms exhibit coherent ranking behavior under different experimental conditions, reinforcing the robustness of boosting ensembles for predicting CS across concretes of varying compositions.

Overall, these findings evidence a statistically consistent inter-dataset ranking structure, supporting the reliability and transferability of boosting-based regressors as the most stable and generalizable family of models.

3.5. Inference Systems

Three interactive inference systems based on machine learning algorithms were developed for predicting CS, each built from the best-performing model obtained for its corresponding dataset. Figure 14 shows the general interface of the three systems, labeled 1, 2, and 3, corresponding to the Yeh, Ke–Qiu, and Biswal datasets, respectively.

Each system allows users to enter mix parameters—such as cement, water, aggregates, and curing age—and obtain the predicted CS in real time. These tools enable immediate estimation of CS from mix proportions and curing age, providing users with an interactive and reproducible experience through Google Colab. All systems were implemented in the Python programming language using open-source machine learning frameworks and scientific libraries. Further technical details, including the computational environment, dependencies, and source code, are provided in the Data Availability Statement.

The development of these systems represents an applied phase of technological transfer, aimed at bridging predictive modeling from theoretical research to practical applications. Rather than replacing experimental testing, the proposed tools are intended to complement standardized laboratory procedures, such as ASTM C31/C31M and ASTM C39/C39M, by providing rapid and reproducible estimates of CS that can support decision-making during mix design and quality control.

Overall, the developed systems represent a first step toward the intelligent automation of concrete design, providing an accessible, reproducible, and low-cost tool that complements traditional physical testing. These platforms can support preliminary mix design, experimental planning, and quality control, helping to reduce time, cost, and reliance on destructive testing.

The graphical interfaces shown in Figure 14 are fully functional and can be used directly by executing the accompanying Google Colab notebook. The notebook, together with the datasets required for activation of each interface, is provided in the repository referenced in the Data Availability Statement. Once the notebook is executed in its entirety, the three interfaces become available for entering mix parameters and obtaining real-time predictions of compressive strength (MPa).

4. Discussion

This section interprets the comparative results across three experimental contexts, focusing on the physical meaning of variable sensitivities and the nonlinear structures learned by the models.

4.1. Performance Overview

Gradient boosting algorithms (CatBoost, XGBoost, and LightGBM) consistently outperformed linear, neighbor-based, and neural network models, confirming their superior capacity to capture nonlinear relationships and threshold effects among mix parameters and curing conditions [66,67].

4.2. Mechanistic Interpretation I: Inter-Dataset Sensitivity

Across datasets, variable sensitivity revealed physically coherent trends. In Yeh’s high-performance concrete, age, cement, and water dominated, reproducing the inverse relationship of Abrams’ law [65] In Ke–Qiu’s conventional concretes, cement, water, and admixtures controlled strength via hydration and porosity. In Biswal’s recycled concretes, age and SCMs (fly ash, GGBS, metakaolin) reflected pozzolanic reactions and microstructural densification [68]. These dataset-dependent patterns indicate that boosting models adapt to material composition and curing microclimate, capturing hydration kinetics and binder–aggregate interactions beyond simple statistical fitting.

4.3. Mechanistic Interpretation II: Physical Thresholds and Nonlinear Transitions

The tree-based boosting algorithms capture nonlinear behavior through decision thresholds that align with physical transition zones. Splits around w/b ≈ 0.40–0.45 correspond to percolation limits of capillary porosity [69,70], while divisions near 7–14–28 days match hydration kinetics and microstructural refinement [7,68]. In blended and recycled concretes, threshold interactions among SCMs reproduce the onset of pozzolanic activity [71]. Thus, the internal tree structure encodes microstructural transitions—porosity percolation, C–S–H connectivity, and phase evolution—demonstrating that predictive accuracy arises from physically meaningful representations of strength development.

4.4. Inter-Dataset Stability and Generalization

The ranking and correlation analyses (Section 3.4) demonstrated that the boosting algorithms maintained a consistent performance hierarchy across all datasets, confirming their high inter-dataset stability. Rather than being dataset-specific, their predictive behavior generalized effectively to concretes of different compositions and strength ranges. This statistical robustness supports the interpretation that boosting models capture transferable patterns related to hydration and microstructural development, providing a physically grounded form of generalization beyond mere numerical consistency.

4.5. Engineering Relevance of Prediction Errors

From an engineering perspective, the predictive errors obtained in this study are within the range of experimental variability expected for standardized compression tests. When normalized with respect to the average CS of each dataset, the best-performing models (CatBoost, XGBoost, and LightGBM) yielded relative RMSE values of approximately 6–7% for Yeh, 9–10% for Ke–Qiu, and 7–8% for Biswal. These magnitudes are comparable to the inherent variability observed in laboratory testing according to ASTM C39/C39M, Section 11.1 (“Precision and Bias”) [2], which report typical deviations between replicate specimens of about 5–10% depending on specimen geometry, curing, and operator influence.

This correspondence indicates that the residual uncertainty of the models is of the same order as that of the experimental procedure itself, implying that the data-driven predictions are not only statistically accurate but also engineering-equivalent in reliability to physical testing. Therefore, the developed inference systems can be used as practical tools for preliminary mix design, rapid estimation of CS, or verification of laboratory results, supporting decision-making in quality control and material optimization without replacing standardized testing protocols.

4.6. Practical Implications and Inference Systems

The three inference systems developed in Google Colab translate the modeling results into practical, user-oriented applications. Each system enables the rapid estimation of CS from input parameters such as cement content, water dosage, aggregates, and curing age, allowing engineers to perform reproducible “what-if” analyses without the need for local infrastructure or advanced programming skills. These interactive tools complement standardized laboratory procedures (ASTM C31/C39) [2,3] by providing immediate, non-destructive strength estimations that can support mix proportioning, quality control, and early decision-making during construction. The open-source implementation aligns with FAIR principles—Findable, Accessible, Interoperable, and Reusable—and promotes the transparent adoption of AI-based predictive systems in civil engineering practice.

4.7. Limitations and Threats to Validity

Despite the statistical consistency and mechanistic coherence of the results, certain limitations must be acknowledged. The datasets used encompass bounded composition and strength ranges and do not include environmental parameters such as temperature, humidity, or curing regime, all of which can influence hydration kinetics and mechanical development. Consequently, model extrapolation beyond the observed domain should be approached with caution. The use of normalized metrics (nRMSE) effectively allowed cross-dataset comparison, but true generalization to extreme concretes—such as ultra-high-performance or lightweight types—requires empirical validation.

Although each dataset was modeled independently, the variable Age embodies different hydration kinetics depending on the binder chemistry. In Portland-cement systems (Yeh, Ke–Qiu), strength gain is mainly governed by primary C–S–H formation, whereas in blended concretes containing fly ash, GGBS, or metakaolin (Biswal), pozzolanic reactions evolve more slowly and continue beyond 28 days. Thus, “Age” functions as an empirical proxy for hydration rather than a physically normalized parameter. Future work should incorporate a binder-type categorical descriptor or normalize curing time relative to the degree of hydration to ensure mechanistic comparability across datasets.

Building on these results, the next phase of this research will focus on the experimental validation of the developed inference systems under controlled laboratory conditions. This step will assess whether the predictive accuracy demonstrated in silico can be reproduced in real mix designs and curing environments, thus verifying the models’ generalization and practical reliability. The experimental design and validation procedures are detailed in Section 5.

5. Future Work

Future work will focus on implementing the CatBoost-based inference system in real laboratory environments to validate its predictive performance under controlled experimental conditions. The selection of the CatBoost–Yeh model is justified not only by its high predictive accuracy but also by its ease of experimental replication, since the methodology reported by Yeh (1998) is well documented and involves a reasonable number of measurable variables suitable for laboratory implementation.

The experimental procedure will replicate the methodology of Yeh (1998) for the preparation and testing of concrete cylinders, using 100 × 200 mm specimens cast and cured according to the practices established in ASTM C31/C31M [3] and tested in compression following ASTM C39/C39M [2] standards. Several new concrete mixtures will be designed within the empirical ranges represented in the Yeh dataset, guided by the proportions suggested by the inference system but without duplicating specific records from the original databases, in order to evaluate the model’s true generalization rather than memorization capability.

This experimental phase will enable a direct comparison between the model’s predicted and measured strengths, providing an assessment of the accuracy, robustness, and practical applicability of the developed system. The insights obtained from this validation will guide the future integration of AI-based predictive systems into standard laboratory workflows and industrial mix-design practices, reinforcing the bridge between computational modeling and experimental civil engineering.

6. Conclusions

This study developed and applied a systematic and reproducible methodology to compare the performance of eight supervised regression algorithms in predicting the compressive strength of concrete (CS, MPa), using three experimental datasets representative of different material and production contexts. The optimization strategy based on RandomizedSearchCV, combined with normalized performance metrics, ensured a fair and statistically coherent comparison among models and datasets, providing a comprehensive evaluation of their stability and generalization capability.

The results consistently demonstrate that gradient boosting algorithms—particularly CatBoost, XGBoost, and LightGBM—outperform linear, neighbor-based, and neural network methods across all analyzed scenarios.

Within this family, the optimal model varied slightly across datasets: CatBoost achieved the highest accuracy in the Yeh dataset, XGBoost in Ke–Qiu, and LightGBM in Biswal. Across all cases, the boosting models reached

R^{2}

values between 0.93 and 0.96 and RMSE values between 3.7 and 3.9 MPa, confirming their robustness, statistical stability, and engineering reliability for predicting CS in diverse concrete systems.

The inter-dataset analysis revealed a strong consistency in performance rankings (Spearman correlation coefficients

ρ > 0.8

), confirming the statistical stability of boosting models even under substantial variations in concrete composition, aggregate type, and the presence of supplementary cementitious materials.

Beyond their statistical performance, the boosting models exhibited mechanistic coherence with established empirical and physicochemical principles of concrete behavior, as detailed in the Discussion section.

Despite these advances, certain limitations remain related to the composition and strength ranges of the datasets, which do not include environmental variables such as temperature, humidity, or curing regime. These omissions constrain extrapolation beyond the observed domain, particularly for extreme concretes (ultra-high-performance or lightweight types).

The implementation of three interactive inference systems—one for each dataset—represents a significant practical contribution, translating predictive modeling outcomes into an accessible environment via Google Colab. These tools enable real-time estimations without requiring local infrastructure or advanced programming skills and align with the FAIR principles of openness and reproducibility. Their potential applications in mix design, quality control, and early-age strength prediction mark an important step toward integrating artificial intelligence into concrete engineering practice.

In summary, this research provides a robust, comparative, and transparent methodological framework for applying machine learning to the prediction of concrete mechanical properties. By integrating a multi-dataset approach, normalized metrics, stability analysis, and practical inference tools, this work lays the foundation for developing more generalizable, reproducible, and transferable models, contributing to advances in materials engineering and the digitalization of construction processes.

Author Contributions

Conceptualization, C.E.O.-M.; methodology, C.E.O.-M.; software, C.E.O.-M., S.V.-R. and M.d.J.L.-M.; validation, C.E.O.-M.; formal analysis, C.E.O.-M. and C.A.O.-O.; investigation, C.E.O.-M., J.I.d.l.R.-V. and J.V.G.-A.; resources, L.O.S.-S., S.V.-R., D.D.-C. and J.V.G.-A.; data curation, C.E.O.-M., L.O.S.-S., S.V.-R. and J.I.d.l.R.-V.; writing—original draft preparation, C.E.O.-M. and J.A.R.-R.; writing—review and editing, C.A.O.-O., D.D.-C. and J.I.d.l.R.-V.; visualization, C.E.O.-M., L.O.S.-S. and D.D.-C.; supervision, C.A.O.-O. and M.d.J.L.-M.; project administration, C.A.O.-O. and M.d.J.L.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets analyzed in this study are publicly available and can be freely accessed as follows: Dataset 1—Concrete Compressive Strength (Yeh, 1998), available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength (accessed on 19 November 2025). Dataset 2—Concrete Compressive Strength and Slump Dataset (Ke & Qiu, 2024), available at Mendeley Data: https://data.mendeley.com/datasets/zrsbhndz9f/1 (accessed on 19 November 2025). Dataset 3—Recycled Aggregate Concrete with Fly Ash, GGBS, and Metakaolin (Biswal et al., 2022), available at Mendeley Data: https://data.mendeley.com/datasets/5wkxzmzwnz/2 (accessed on 19 November 2025). All code, Colab notebooks, trained models, and figures generated during this study are openly available in the Supplementary GitHub repository: https://github.com/1carloso1/ai-inference-system-concrete-strength (accessed on 19 November 2025). All scripts were implemented in Python 3.12.12 within the Google Colab environment, using open-source machine learning libraries (scikit-learn 1.6.1, XGBoost 3.1.1, LightGBM 4.6.0, CatBoost 1.2.8, NumPy 2.0.2, Pandas 2.2.2, Matplotlib 3.10.0, and Seaborn 0.13.2). A complete list of dependencies is provided in the requirements.txt file within the repository to ensure full reproducibility. No new experimental data were generated in this study.

Conflicts of Interest

David Duarte-Correa is employed by Tlachia Systems. All other authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ML	Machine Learning
CS	Compressive Strength
MPa	Megapascals
CV	Cross Validation
W/C	Water-to-Cement Ratio
HPC	High-Performance Concrete
RAC	Recycled Aggregate Concrete
SCM	Supplementary Cementitious Materials
TCM	Total Cementitious Materials
GGBS	Ground-Granulated Blast-Furnace Slag
SP	Superplasticizer
VMA	Viscosity-Modifying Agent
NCA	Natural Coarse Aggregate
RCA	Recycled Coarse Aggregate
SVR	Support Vector Regression
MLP	Multilayer Perceptron
KNN	k-Nearest Neighbors
RF	Random Forest
ANN	Artificial Neural Network
SHAP	SHapley Additive exPlanations
GWO	Grey Wolf Optimizer
IGWO	Improved Grey Wolf Optimizer
QPSO	Quantum-behaved Particle Swarm Optimization
RBFNN	Radial Basis Function Neural Network
DA	Dragonfly Algorithm
SA	Simulated Annealing
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
nRMSE	Normalized Root Mean Square Error
$R^{2}$	Coefficient of Determination
PICP	Prediction Interval Coverage Probability
FAIR	Findable, Accessible, Interoperable, Reusable
UCI	UCI Machine Learning Repository
IIT	Indian Institute of Technology
ASTM	ASTM International (Standards Organization)
MDI	Mean Decrease in Impurity

References

Neville, A.M. Properties of Concrete, 5th ed.; Pearson Education Limited: Harlow, UK, 2011. [Google Scholar]
ASTM C39/C39M; Standard Test Method for Compressive Strength of Cylindrical Concrete Specimens. ASTM International: West Conshohocken, PA, USA, 2023. [CrossRef]
ASTM C31/C31M; Standard Practice for Making and Curing Concrete Test Specimens in the Field. ASTM International: West Conshohocken, PA, USA, 2022.
ASTM C684/C684M; Standard Test Method for Compressive Strength of Hydraulic Cement Mortars (Using Portions of Prisms Broken in Flexure). ASTM International: West Conshohocken, PA, USA, 2020.
ASTM C597; Standard Test Method for Pulse Velocity Through Concrete. ASTM International: West Conshohocken, PA, USA, 2016. [CrossRef]
ASTM C805/C805M; Standard Test Method for Rebound Number of Hardened Concrete. ASTM International: West Conshohocken, PA, USA, 2018. [CrossRef]
Taylor, H.F.W. Cement Chemistry, 2nd ed.; Thomas Telford Publishing: London, UK, 1997. [Google Scholar]
Mehta, P.K.; Monteiro, P.J.M. Concrete: Microstructure, Properties, and Materials, 4th ed.; McGraw–Hill Education: New York, NY, USA, 2013. [Google Scholar]
Scrivener, K.L.; Snellings, R.; Lothenbach, B. A Practical Guide to Microstructural Analysis of Cementitious Materials; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Kosmatka, S.H.; Kerkhoff, B.; Panarese, W.C.; Tanesi, J. Diseño y Control de Mezclas de Concreto, 14th ed.; Traducción al español del manual clásico de PCA; Portland Cement Association: Skokie, IL, USA, 2004. [Google Scholar]
Dua, D.; Graff, C. UCI Machine Learning Repository: Concrete Compressive Strength Data Set. 2017. Available online: https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength (accessed on 16 July 2025).
Yeh, I.-C. Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
Rathakrishnan, M.; Lavanya, C.; Ganesan, K. Comparative analysis of machine learning models to predict compressive strength of slag-based concrete. Mater. Today Proc. 2022, 62, 5286–5293. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, S.; Liu, C. Prediction of concrete strength using machine learning with ensemble methods. Eng. Struct. 2020, 223, 111136. [Google Scholar] [CrossRef]
Gamil, M. Machine learning techniques for predicting the compressive strength of concrete: A review. Front. Built Environ. 2023, 9, 1145591. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (accessed on 11 October 2025).
Zhang, C.; Li, B.; Wang, J.; Zhou, Y. DeepForest-based ensemble learning model for predicting high-performance concrete compressive strength. Sci. Rep. 2024, 14, 69616. [Google Scholar] [CrossRef]
Shaaban, M.; Amin, M.; Selim, S.; Riad, I.M. Machine learning approaches for forecasting compressive strength of high-strength concrete. Sci. Rep. 2025, 15, 25567. [Google Scholar] [CrossRef]
Olawale, O.A.; Akinosho, T.D.; Oyedele, L.O.; Ajayi, A.O.; Akanbi, L.A.; Delgado, J.M.D. Explainable artificial intelligence framework for predicting compressive strength of concrete with SHAP-based model interpretation. AI Civ. Eng. 2025, 1, 4. [Google Scholar] [CrossRef]
Cakiroglu, C.; Islam, K.; Bekdaş, G.; Kim, S.; Geem, Z.W. Interpretable Machine Learning Algorithms to Predict the Axial Capacity of FRP-Reinforced Concrete Columns. Materials 2022, 15, 2742. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Z.; Wang, Q.; Li, H. Machine Learning Modelling for Prediction of Concrete/Mortar Strength Using Experimental Datasets. J. Build. Eng. 2024, 71, 110912. [Google Scholar] [CrossRef]
Wank, Q.; Zhang, X.; Li, H.; Zhou, Y. Hybrid machine learning models for predicting compressive strength and slump flow of high-performance concrete. Sci. Rep. 2025, 15, 10860. [Google Scholar] [CrossRef]
Xu, Y.; Afzal, A.; Li, Z.; Wang, Y. Applying machine learning techniques in the form of ensemble and hybrid models to appraise hardness/strength properties of high-performance concrete. J. Intell. Fuzzy Syst. 2024, 46, 2749–2764. [Google Scholar] [CrossRef]
Zhao, P.; Li, Y.; Chen, H. Predicting the compressive strength of high-performance concrete by using Radial Basis Function with optimization (Improved Grey Wolf optimizer and Dragonfly algorithm). J. Intell. Fuzzy Syst. 2023, 45, 7917–7932. [Google Scholar] [CrossRef]
Cheng, Y.; Liu, J.; Wu, T. Data-driven modeling and uncertainty analysis for concrete compressive strength prediction using ensemble learning. Eng. Appl. Artif. Intell. 2021, 100, 104214. [Google Scholar] [CrossRef]
Bitencourt, L.; Barbosa, F.; Monteiro, D. A reproducibility analysis of ML models for concrete strength: Challenges and recommendations. J. Build. Eng. 2024, 87, 107142. [Google Scholar]
Ke, L.; Ming, Q. Dataset of compressive strength and slump of normal concrete. Mendeley Data 2024, V1. [Google Scholar] [CrossRef]
Biswal, U.S.; Pasla, D.; Mishra, M. Experimental dataset for concrete compressive strength prediction from IIT Bhubaneswar concrete laboratory. Mendeley Data 2022, V2. [Google Scholar] [CrossRef]
Chou, J.-S.; Pham, A.-D. Enhanced artificial intelligence for ensemble approach to predicting civil infrastructure costs. Constr. Build. Mater. 2013, 49, 554–563. [Google Scholar] [CrossRef]
Vargas, A.; Martínez, L.; López, J. Machine-learning-based predictive models for compressive strength, flexural strength, and slump of concrete. Appl. Sci. 2024, 14, 4426. [Google Scholar] [CrossRef]
Biswal, U.S.; Mishra, M.; Singh, M.K.; Pasla, D. Experimental investigation and comparative machine learning prediction of the compressive strength of recycled aggregate concrete incorporated with fly ash, GGBS, and metakaolin. Innov. Infrastruct. Solut. 2022, 7, 242. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), Chia Laguna Resort, Italy, 13–15 May 2010; pp. 249–256. Available online: https://proceedings.mlr.press/v9/glorot10a.html (accessed on 11 October 2025).
Zhang, Z. Introduction to machine learning: K-nearest neighbors. Ann. Transl. Med. 2016, 4, 218. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Available online: https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 11 October 2025).
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; Available online: https://papers.nips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html (accessed on 11 October 2025).
Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis, 5th ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Kvalseth, T.O. Cautionary note about R². Am. Stat. 1985, 39, 279–285. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination (R²) is more informative than SMAPE, MAE, MAPE, MSE, and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
ScienceDirect Topics. Root Mean Square Error—Overview. 2024. Available online: https://www.sciencedirect.com/topics/engineering/mean-square-error (accessed on 11 October 2025).
Lightning AI. Normalized Root Mean Squared Error (NRMSE). 2024. Available online: https://lightning.ai/docs/torchmetrics/latest/regression/normalized_root_mean_squared_error.html (accessed on 11 October 2025).
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Kim, H.-Y. Statistical notes for clinical researchers: Assessing normal distribution (2) using skewness and kurtosis. Restor. Dent. Endod. 2013, 38, 52–54. [Google Scholar] [CrossRef]
Doane, D.P.; Seward, L.E. Measuring skewness: A forgotten statistic? J. Stat. Educ. 2011, 19, 2. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Drucker, H.; Burges, C.J.C.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. In Proceedings of the Advances in Neural Information Processing Systems 9 (NeurIPS), Denver, CO, USA, 2–5 December 1996; pp. 155–161. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Ng, A.Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, AB, Canada, 4–8 July 2004; p. 78. [Google Scholar] [CrossRef]
Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson Education: London, UK, 2018. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array Programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
Reback, J.; McKinney, W.; Jbrockmendel; Van Den Bossche, J.; Augspurger, T.; Cloud, P.; Gfyoung; Sinhrks; Klein, A.; Hawkins, S.; et al. pandas-dev/pandas: Pandas; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Waskom, M.L. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberl, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Rrettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Pérez, F.; Granger, B.E. IPython: A System for Interactive Scientific Computing. Comput. Sci. Eng. 2007, 9, 21–29. [Google Scholar] [CrossRef]
American Concrete Institute (ACI). ACI 363R-10: Report on High-Strength Concrete; ACI Committee 363: Farmington Hills, MI, USA, 2010. [Google Scholar]
Abrams, D.A. Design of Concrete Mixtures; Bulletin No. 1, Structural Materials Research Laboratory; Lewis Institute: Chicago, IL, USA, 1918; Available online: https://babel.hathitrust.org/cgi/pt?id=mdp.39015011944555 (accessed on 4 November 2025).
Ahmad, A.; Farooq, F.; Ostrowski, K.A.; Malik, M. Comparative Study of Ensemble Learning Algorithms for Predicting the Compressive Strength of High-Performance Concrete. Materials 2023, 16, 2549. [Google Scholar] [CrossRef]
Thomas, M.; Yang, Y.; Kim, S. Interpretable Gradient Boosting Models for Predicting Concrete Properties: Insights into Nonlinear Interactions and Mix Design Variables. J. Build. Eng. 2023, 74, 106731. [Google Scholar] [CrossRef]
Thomas, M.D.A. Supplementary Cementing Materials in Concrete, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2015; ISBN 978-1466592630. [Google Scholar]
Powers, T.C.; Brownyard, T.L. Studies of the Physical Properties of Hardened Portland Cement Paste. J. Am. Concr. Inst. 1958, 18, 101–132. [Google Scholar]
Jennings, H.M. A Model for the Microstructure of Calcium Silicate Hydrate in Cement Paste. Cem. Concr. Res. 2000, 30, 101–116. [Google Scholar] [CrossRef]
Scrivener, K.L.; John, V.M.; Gartner, E.M. Eco-efficient Cements: Potential, Economically Viable Solutions for a Low-CO₂, Cement-based Materials Industry. Cem. Concr. Res. 2015, 114, 2–26. [Google Scholar] [CrossRef]

Figure 1. General workflow of the proposed methodology comprising seven stages: data collection, algorithm selection, preprocessing, optimized training, performance evaluation, ranking stability analysis, and inference system development.

Figure 2. Histograms of the target variable (y, CS) for the three datasets: (1) Yeh, (2) Ke–Qiu, and (3) Biswal.

Figure 3. Predictor variables of the Yeh dataset: (a) Cement, (b) Slag, (c) Fly ash, (d) Water, (e) Superplasticizer, (f) Coarse aggregate, (g) Fine aggregate, and (h) Age.

Figure 4. Comparative performance of the regression algorithms for the Dataset 1 (Yeh).

Figure 5. Comparative performance of the regression algorithms for the Dataset 2 (Ke–Qiu).

Figure 6. Comparative performance of the regression algorithms for the Dataset 3 (Biswal).

Figure 7. Scatter plots between actual and predicted values for the eight regression models in Dataset 1 (Yeh). The subfigures show: (a) Linear Regression, (b) Random Forest, (c) SVR, (d) MLP, (e) KNN, (f) CatBoost, (g) XGBoost y (h) LightGBM.

Figure 8. Scatter plots between actual and predicted values for the eight regression models in Dataset 2 (Ke–Qiu). The subfigures show: (a) Linear Regression, (b) Random Forest, (c) SVR, (d) MLP, (e) KNN, (f) CatBoost, (g) XGBoost y (h) LightGBM.

Figure 9. Scatter plots between actual and predicted values for the eight regression models in Dataset 3 (Biswal). The subfigures show: (a) Linear Regression, (b) Random Forest, (c) SVR, (d) MLP, (e) KNN, (f) CatBoost, (g) XGBoost y (h) LightGBM.

Figure 13. Ranking stability of the regression algorithms across the three datasets: (a) heatmap of the individual ranks per dataset based on nRMSE; (b) bar chart of the average ranks computed across datasets.

Figure 14. General interface of the three developed inference systems based on the best-performing models: (1)—Yeh, (2)—Ke–Qiu, and (3)—Biswal.

Table 2. Description of variables in Dataset 1 (Yeh,

n = 1005

, after duplicate removal).

Table 2. Description of variables in Dataset 1 (Yeh,

n = 1005

, after duplicate removal).

Name	Data Type	Unit	Description
Cement	Quantitative	kg/m³	Mass of Portland cement per cubic meter of concrete
Blast Furnace Slag	Quantitative	kg/m³	Mass of ground granulated blast-furnace slag
Fly Ash	Quantitative	kg/m³	Mass of fly ash used as supplementary cementitious material
Water	Quantitative	kg/m³	Mixing water mass per cubic meter of concrete
Superplasticizer	Quantitative	kg/m³	High-range water-reducing admixture dosage
Coarse Aggregate	Quantitative	kg/m³	Mass of coarse aggregate per cubic meter
Fine Aggregate	Quantitative	kg/m³	Mass of fine aggregate (sand) per cubic meter
Age	Quantitative	Days (1–365)	Curing time at testing
CS	Quantitative	MPa	Uniaxial compressive strength measured at test age

Table 3. Description of variables in Dataset 2 (Ke–Qiu,

n = 1618

, after duplicate removal).

Table 3. Description of variables in Dataset 2 (Ke–Qiu,

n = 1618

, after duplicate removal).

Name	Data Type	Unit	Description
Cement	Quantitative	kg/m³	Portland cement dosage per cubic meter
Fine_Aggregates	Quantitative	kg/m³	Fine aggregate (sand) mass per cubic meter
Coarse_Aggregates	Quantitative	kg/m³	Coarse aggregate mass per cubic meter
Water	Quantitative	kg/m³	Mixing water dosage per cubic meter
Water_reducing_Admixture	Quantitative	kg/m³	Water-reducing admixture/superplasticizer dosage
Fly_Ash	Quantitative	kg/m³	Fly ash dosage used as supplementary cementitious material (SCM)
Accelerating_Agent	Quantitative	kg/m³	Set-accelerating admixture dosage
Silica_Fume	Quantitative	kg/m³	Silica fume dosage used as SCM
Time	Quantitative	Days	Curing time (age) at compressive testing
Strength	Quantitative	MPa	Compressive strength measured at test age

Table 4. Description of variables in Dataset 3 (Biswal,

n = 185

, after duplicate removal).

Table 4. Description of variables in Dataset 3 (Biswal,

n = 185

, after duplicate removal).

Name	Data Type	Unit	Description
cement	Quantitative	kg/m³	Portland cement dosage per cubic meter
flyash	Quantitative	kg/m³	Fly ash dosage used as SCM
GGBS	Quantitative	kg/m³	Ground-granulated blast-furnace slag dosage (SCM)
MK	Quantitative	kg/m³	Metakaolin dosage used as SCM
TCM	Quantitative	kg/m³	Total cementitious materials (cement + SCMs) per cubic meter
water	Quantitative	kg/m³	Mixing water mass per cubic meter
water_TCM	Quantitative	– (ratio)	Water-to-binder ratio (water/TCM)
SP	Quantitative	kg/m³	Superplasticizer dosage
VMA	Quantitative	kg/m³	Viscosity-modifying admixture dosage
NCA_20_DOWN	Quantitative	kg/m³	Natural coarse aggregate < 20 mm (mass per m³)
NCA_10_DOWN	Quantitative	kg/m³	Natural coarse aggregate < 10 mm (mass per m³)
RCA_20_DOWN	Quantitative	kg/m³	Recycled coarse aggregate < 20 mm (mass per m³)
RCA_10_DOWN	Quantitative	kg/m³	Recycled coarse aggregate < 10 mm (mass per m³)
SAND	Quantitative	kg/m³	Fine aggregate (sand) mass per cubic meter
AGE	Quantitative	Days	Curing time at testing
CS	Quantitative	MPa	Compressive strength of recycled-aggregate concrete at test age

Table 5. Summary of the developed inference systems, indicating the best-performing model, number of input variables, concrete type, and lowest RMSE obtained on the test set. All systems were implemented in Google Colab.

System	Dataset	Model	# Variables	Concrete Type	RMSE (MPa)
1	Yeh	CatBoost	8	Conventional/HPC	3.71
2	Ke–Qiu	XGBoost	9	Normal concrete	3.88
3	Biswal	LightGBM	15	Recycled concrete	3.83

Table 6. Performance metrics of the eight regression algorithms for Dataset 1 (Yeh,

n = 1005

).

Table 6. Performance metrics of the eight regression algorithms for Dataset 1 (Yeh,

n = 1005

).

Model	RMSE (MPa)	MAE (MPa)	MAPE (%)	$R^{2}$	nRMSE (%)
Linear Regression	9.68	7.59	30.72	0.618	61.81
Random Forest	4.36	3.18	11.83	0.922	27.84
SVR	5.51	4.00	13.79	0.876	35.20
MLP	4.23	2.97	9.67	0.927	27.02
KNN	7.66	5.63	21.99	0.761	48.91
CatBoost	3.71	2.74	9.85	0.944	23.69
XGBoost	3.95	2.97	10.54	0.936	25.20
LightGBM	3.73	2.64	8.98	0.943	23.83

Table 7. Performance metrics of the eight regression algorithms for Dataset 2 (Ke–Qiu,

n = 1618

).

Table 7. Performance metrics of the eight regression algorithms for Dataset 2 (Ke–Qiu,

n = 1618

).

Model	RMSE (MPa)	MAE (MPa)	MAPE (%)	$R^{2}$	nRMSE (%)
Linear Regression	6.84	4.34	21.88	0.711	53.75
Random Forest	4.34	3.05	12.55	0.884	34.09
SVR	5.71	3.33	16.92	0.799	44.84
MLP	4.85	3.33	13.35	0.855	38.11
KNN	5.38	3.41	16.03	0.822	42.23
CatBoost	4.13	2.76	11.83	0.895	32.41
XGBoost	3.88	2.66	10.61	0.907	30.47
LightGBM	4.17	2.83	11.03	0.893	32.72

Table 8. Performance metrics of the eight regression algorithms for Dataset 3 (Biswal,

n = 185

).

Table 8. Performance metrics of the eight regression algorithms for Dataset 3 (Biswal,

n = 185

).

Model	RMSE (MPa)	MAE (MPa)	MAPE (%)	$R^{2}$	nRMSE (%)
Linear Regression	7.40	6.20	23.67	0.843	39.64
Random Forest	6.57	5.41	24.10	0.876	35.19
SVR	5.45	3.93	16.48	0.915	29.22
MLP	5.06	3.93	16.56	0.926	27.14
KNN	8.39	6.52	27.50	0.798	44.94
CatBoost	3.90	3.01	12.73	0.956	20.89
XGBoost	4.29	3.42	14.17	0.947	22.99
LightGBM	3.83	2.97	12.32	0.958	20.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Olvera-Mayorga, C.E.; López-Martínez, M.d.J.; Rodríguez-Rodríguez, J.A.; Vázquez-Reyes, S.; Solís-Sánchez, L.O.; de la Rosa-Vargas, J.I.; Duarte-Correa, D.; González-Aviña, J.V.; Olvera-Olvera, C.A. AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms. Appl. Sci. 2025, 15, 12383. https://doi.org/10.3390/app152312383

AMA Style

Olvera-Mayorga CE, López-Martínez MdJ, Rodríguez-Rodríguez JA, Vázquez-Reyes S, Solís-Sánchez LO, de la Rosa-Vargas JI, Duarte-Correa D, González-Aviña JV, Olvera-Olvera CA. AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms. Applied Sciences. 2025; 15(23):12383. https://doi.org/10.3390/app152312383

Chicago/Turabian Style

Olvera-Mayorga, Carlos Eduardo, Manuel de Jesús López-Martínez, José A. Rodríguez-Rodríguez, Sodel Vázquez-Reyes, Luis O. Solís-Sánchez, José I. de la Rosa-Vargas, David Duarte-Correa, José Vidal González-Aviña, and Carlos A. Olvera-Olvera. 2025. "AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms" Applied Sciences 15, no. 23: 12383. https://doi.org/10.3390/app152312383

APA Style

Olvera-Mayorga, C. E., López-Martínez, M. d. J., Rodríguez-Rodríguez, J. A., Vázquez-Reyes, S., Solís-Sánchez, L. O., de la Rosa-Vargas, J. I., Duarte-Correa, D., González-Aviña, J. V., & Olvera-Olvera, C. A. (2025). AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms. Applied Sciences, 15(23), 12383. https://doi.org/10.3390/app152312383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Based Inference System for Concrete Compressive Strength: Multi-Dataset Analysis of Optimized Machine Learning Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Design and Workflow

2.2. Data Acquisition

2.2.1. Dataset 1: Yeh (1998)

2.2.2. Dataset 2: Ke–Qiu (2024)

2.2.3. Dataset 3: Biswal (2022)

2.3. Regression Models

2.4. Evaluation Metrics

2.5. Data Preprocessing

2.6. Hyperparameter Optimization

2.6.1. Random Forest Regressor

2.6.2. Support Vector Regressor (SVR)

2.6.3. Multilayer Perceptron (MLP)

2.6.4. K-Nearest Neighbors (KNN)

2.6.5. XGBoost

2.6.6. LightGBM

2.6.7. CatBoost

2.6.8. Linear Regression

2.7. Inference System Implementation

2.8. Implementation and Computational Environment

3. Results

3.1. Model Performance per Dataset

3.2. Prediction Scatter Analysis

3.3. Feature Importance Analysis

3.3.1. Dataset 1 (Yeh, 1998)

3.3.2. Dataset 2 (Ke–Qiu, 2024)

3.3.3. Dataset 3 (Biswal, 2022)

3.4. Ranking Stability Across Datasets

3.5. Inference Systems

4. Discussion

4.1. Performance Overview

4.2. Mechanistic Interpretation I: Inter-Dataset Sensitivity

4.3. Mechanistic Interpretation II: Physical Thresholds and Nonlinear Transitions

4.4. Inter-Dataset Stability and Generalization

4.5. Engineering Relevance of Prediction Errors

4.6. Practical Implications and Inference Systems

4.7. Limitations and Threats to Validity

5. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI