Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models

Mo, Yelan; Li, Bixiong; Yan, Chengcheng; Hu, Xiangxin

doi:10.3390/buildings16081537

Open AccessArticle

Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models

by

Yelan Mo

,

Bixiong Li

,

Chengcheng Yan

^* and

Xiangxin Hu

College of Architecture and Environment, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(8), 1537; https://doi.org/10.3390/buildings16081537

Submission received: 3 March 2026 / Revised: 27 March 2026 / Accepted: 31 March 2026 / Published: 14 April 2026

(This article belongs to the Special Issue Development and Structural Applications of Green High-Performance Fiber-Reinforced Concrete)

Download

Browse Figures

Versions Notes

Abstract

Machine learning has been widely used for concrete compressive strength prediction, yet previous studies have focused mainly on algorithm comparison and isolated feature-processing strategies. The coupled influence of dataset characteristics on prediction error has received less systematic attention. This study investigates concrete strength prediction from a data structure perspective by examining three structural variables, namely, sample size, feature size, and compressive strength range. A unified experimental framework was constructed using 15 concrete datasets. Correlation, partial correlation, information entropy, and relief were employed to reorganize feature subsets, and the resulting error trends were evaluated using artificial neural network (ANN), support vector regression (SVR), and random forest (RF) models. The results show that prediction error generally decreases first and then becomes stable as feature size increases, although the location of the low-error region depends on the dataset and the filtering method. Larger sample size is associated with improved prediction stability, whereas wider strength range tends to increase prediction difficulty. Based on these observations, an empirical relationship was established to describe the joint effect of sample size, feature size, and strength range on prediction error. The findings indicate that the attainable error level in concrete strength prediction is controlled not only by model form but also by dataset organization and feature configuration. Within the present framework, the study provides a practical basis for designing feature systems and interpreting model performance across datasets with different structural characteristics.

Keywords:

feature system; data structure analysis; concrete compressive strength prediction; artificial intelligence; machine learning

1. Introduction

Concrete compressive strength (CS) prediction is a typical nonlinear regression problem in civil engineering. With the development of artificial intelligence, machine learning methods have gradually become important tools for concrete strength prediction and mixture-related decision support [1,2,3]. Early studies can be traced back to the perceptron-based engineering applications of Adeli and Yeh [4] and the ANN model for high-performance concrete proposed by Yeh [5], which laid the foundation for subsequent data-driven prediction studies. Since then, ANN, SVR, RF, and other models have been widely applied to predict the mechanical properties of concrete and related cementitious materials [5,6,7,8,9,10,11,12]. These studies generally show that machine learning can achieve satisfactory predictive accuracy when the dataset quality and feature system are appropriate [2,13].

However, recent studies also indicate that increasing algorithmic complexity does not necessarily lead to a proportional reduction in prediction error. Comparative analyses have shown that, under ordinary dataset sizes, the performance differences among ANN, SVR, and RF are often limited [14]. In parallel, dimensionality-reduction studies have reported that the blind introduction of PCA-like transformations may fail to improve model performance and may even reduce accuracy in some cases [13,15,16,17]. These findings suggest that attainable prediction accuracy is constrained not only by algorithm selection but also by the way information is organized within the dataset.

In machine learning, feature engineering is widely regarded as an important step in shaping an effective data space [18,19]. Earlier studies showed that correlation structure, redundancy, and nonlinear projection of input variables can affect model stability and generalization [16,20,21,22,23,24]. Around this issue, feature selection has developed into a relatively mature methodological system, including filter, wrapper, embedded, ensemble, unsupervised, weighting, and evolutionary approaches [25,26,27,28,29,30,31]. Classical ranking strategies such as correlation-based methods, information-based methods, and relief-family algorithms remain widely used because they can characterize variable relevance while preserving the physical meaning of engineering parameters [22,32,33,34]. In concrete and structural engineering, these methods have been applied not only to compressive strength prediction [35,36,37,38,39] but also to shear capacity, bond strength, and other resistance problems [40,41,42,43,44]. Nevertheless, the existing literature still tends to discuss these methods mainly as preprocessing tools, and less attention has been given to their role in revealing dataset-dependent error behavior.

For concrete compressive strength prediction, most studies still rely primarily on mixture proportion variables [45], whereas some have extended the input system to chemical composition or higher-dimensional descriptors [46,47]. Recent results suggest that moderate feature expansion may be beneficial, but redundant variables do not continuously improve performance [48]. Interpretable machine learning studies have further shown that the contribution of input variables depends on data distribution and variable organization [49,50]. At the same time, published concrete datasets usually contain from several hundred to about one thousand samples, with feature dimensions commonly ranging from approximately 5 to 15 [15,42,43,51,52,53,54,55,56,57]. Within this scale range, different feature-ranking strategies can produce different optimized subsets even on the same dataset [15,53], and such differences become more pronounced across datasets. Similar structure-dependent behavior has also been reported in concrete creep modeling [57,58]. These observations indicate that prediction error is jointly affected by dataset composition, target value distribution, and feature space organization rather than by model form alone. Yet previous studies have not systematically examined the coupled effects of sample size, feature size, and strength range under a unified framework.

Accordingly, this study investigates concrete strength prediction from a data structure perspective using 15 datasets under a unified framework. Here, data structure refers to the organization of the dataset as reflected by sample size, feature size, and compressive strength range, together with the resulting variation in feature subset composition and target value distribution. The study compares error responses under different feature organizations and model settings and further establishes an empirical relationship between prediction error and these three structural variables. The novelty of this work lies not in proposing another prediction algorithm but in quantifying how key dataset characteristics jointly delimit the attainable error level in concrete strength prediction. This provides a basis for interpreting prediction error from the viewpoint of dataset organization. The overall experimental framework, including feature set optimization and model-based prediction procedures, is illustrated in Figure 1.

2. Data and Feature Structure Analysis

2.1. Dataset and Structural Characteristic Analysis

To analyze how data structure influences the prediction performance of 28-day concrete compressive strength, this study selected 15 concrete datasets from different sources. In this study, data structure is described from two related levels. The first level is dataset-level structure, including sample size, feature size, and strength range. The second level is feature space structure, including dependency, redundancy, and information distribution among input variables. Section 2.1 focuses on the first level, while Section 2.2 further analyzes the second level. Among the 15 datasets, data0 comes from actual engineering production data. It contains 7083 groups of material mix proportions and corresponding 28-day compressive strength values of 100 mm × 100 mm × 100 mm concrete specimens. There are 17 input features in this dataset. The other datasets, data1–data14, are collected from the published literature. They include different types of concrete, such as normal concrete, high-performance concrete, self-compacting concrete, ultra-high-performance concrete, and recycled concrete. The sample size, number of features, and strength range are clearly different among these datasets. Detailed information is shown in Table 1.

As shown in Table 1, there are large differences in sample size among the datasets. The smallest dataset contains only several dozen samples, while the largest one includes more than 500 samples. In terms of feature dimension, most datasets contain 5–11 features, while some datasets include more material composition variables. Regarding strength distribution, the minimum strength is about 3 MPa, and the maximum strength reaches 240 MPa. The strength range varies greatly. Sample size determines the statistical stability of model parameter estimation. Feature dimension affects the model’s expression ability and complexity. The strength range directly influences the dispersion level of the prediction problem. Therefore, different datasets have natural structural differences. These differences may have an important influence on prediction error.

It should be noted that the purpose of this study is not to treat all 15 datasets as one fully homogeneous concrete population. Instead, they are regarded as structurally different cases analyzed under the same feature-ranking framework and the same prediction protocol in later sections. In this sense, dataset heterogeneity is not simply noise to be removed but part of the research object itself, because the study aims to examine how prediction error responds to different combinations of sample size, feature size, and target strength range. At the same time, since data1–data14 are collected from different literature sources, differences in raw-material systems, specimen preparation, curing conditions, and testing procedures may introduce additional variability. This source-related variability may affect absolute comparability across datasets and should, therefore, be regarded as a limitation when interpreting the results.

To further illustrate the internal statistical characteristics of concrete data in feature space, data0 was taken as an example for detailed analysis. This dataset was selected because it contains a relatively large sample size and a richer feature system, which makes it suitable for showing typical distribution patterns and variable coupling characteristics in engineering concrete data. Statistical analysis was carried out on its 17 input features and the target variable. Figure 2 shows the kernel density distribution of each feature. Table 2 lists the statistical parameters of each feature, including mean, standard deviation, variance, skewness, kurtosis, minimum, maximum, and range. From Figure 2, it can be observed that most features show clear asymmetry, and some variables have long-tail distributions. The skewness and kurtosis values in Table 2 further indicate that several material dosage variables have obvious skewed distributions. In particular, some admixture or special material variables have zero values in most samples and appear as non-zero only in a small number of samples. This shows a typical sparse distribution pattern.

Such distribution characteristics indicate that concrete material data are not ideally independent and identically distributed. Instead, they show clear engineering mix constraints and material coupling characteristics. For example, under a fixed water-to-binder ratio, there is a definite relationship between water content and binder content. There is also a proportional substitution relationship between mineral admixtures and cement content. These structural characteristics may lead to strong correlations or redundant information among features. As a result, they may influence the stability of model parameter estimation and prediction performance. These observations suggest that the feature space of concrete datasets is organized by both physical constraints and mixture design rules rather than by independent variation in each variable. As a result, strong correlations, redundancy, and local sparsity may naturally appear among features, which can further influence the stability of parameter estimation and the final prediction performance. Accordingly, the cross-dataset comparison in this study should be interpreted as a structural comparison of different cases under a unified protocol rather than as a comparison within one perfectly homogeneous population.

2.2. Methods for Feature Space Structure Analysis

After confirming that the datasets have clear differences in sample size, number of features, and strength range, it is necessary to further analyze the structural relationships of input features in high-dimensional space. The compressive strength of concrete is influenced by the coupling effect of many material components. There are usually proportional constraints and combination relationships among features, so they are not independent from each other. The correlation, redundancy, and information contribution among features will directly affect the stability of model parameter estimation and prediction accuracy.

In earlier studies, correlation analysis, partial correlation analysis, information entropy, and relief were mainly used as feature selection or feature-ranking methods. In this study, these four methods are used from a different perspective: they are treated as complementary descriptors of feature space structure. Correlation and partial correlation mainly reflect dependency and net dependency among variables. Information entropy reflects the information contribution and uncertainty reduction ability of variables. Relief reflects the local discriminative ability of variables in the neighborhood of samples. Although these four methods do not exhaust all possible structural properties of high-dimensional data, together they provide interpretable information from the perspectives of linear association, conditional association, information contribution, and local sample sensitivity. Therefore, they are sufficient for the comparative structural analysis intended in this study. Accordingly, the following subsections briefly introduce the four analytical methods used in this study.

2.2.1. Correlation

The correlation coefficient is a statistical metric that expresses how strongly two variables are related. Before determining the compressive strength of concrete, the Pearson correlation coefficient, as shown in Equation (1), is usually used. In Equation (1),

x_{i}

and

y_{i}

are two variables. The two variables’ means are

\bar{x}

and

\bar{y}

, respectively;

r

is the correlation coefficient, which can range from −1 to 1. If the value is positive, the two variables are positively linked; if the value is negative, the two variables are negatively linked. A greater value suggests a more powerful connection.

r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}} \sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

(1)

Correlation analysis is used here to describe the direct linear association between each feature and compressive strength as well as the pairwise dependency among input variables. For concrete datasets, this is meaningful because many mix-design variables are linked by engineering constraints. For example, under a relatively fixed water-to-binder ratio, water content and binder content often change in a coordinated manner. In such cases, two variables may both be related to compressive strength, while also showing strong mutual dependence. Therefore, correlation analysis helps identify whether the feature space is dominated by a few strongly associated variables or contains obvious redundant information. From the perspective of feature space structure, it mainly reflects the degree of linear dependency and concentration of information among variables.

2.2.2. Partial Correlation

The strength limit of concrete is generated by many combined influencing elements, and CS is a manifestation of the compressive bearing capacity of concrete under the coupling of many influencing factors. Because the link between variables is quite complex and they may be affected by more than one variable, correlation coefficient may not be able to adequately describe the correlation between factors in the analysis of compressive strength. Partial correlation coefficient is a preferable option, calculated by ignoring the effects of other factors and computing the correlation between two variables, as shown in Equation (2). It denotes that the partial correlation coefficient between

i

and

j

equals

r_{i j \cdot l_{1} l_{2} l_{3} \dots l_{g}}

under the control variable

l_{1} l_{2} l_{3} \dots l_{g}

.

r_{i j \cdot l_{1} l_{2} l_{3} \dots l_{g}} = \frac{r_{i j \cdot l_{1} l_{2} \dots l_{g - 1}} - r_{i l_{g} \cdot l_{1} l_{2} \dots l_{g - 1}} r_{j l_{g} \cdot l_{1} l_{2} \dots l_{g - 1}}}{\sqrt{(1 - r_{i l_{g} \cdot l_{1} l_{2} \dots l_{g - 1}}^{2}) (1 - r_{j l_{g} \cdot l_{1} l_{2} \dots l_{g - 1}}^{2})}}

(2)

Compared with ordinary correlation, partial correlation is more suitable for describing the net relationship between variables in multivariate concrete systems. When many material factors act together, the observed correlation between one variable and compressive strength may partly come from the indirect influence of other variables. Partial correlation reduces this interference by controlling the effects of the remaining variables. Therefore, it is useful for identifying whether an apparently important variable still keeps a strong direct association with the target after the common dependency structure is removed. From the perspective of feature space structure, partial correlation provides information about conditional dependency and helps distinguish direct contribution from indirect correlation. However, because its calculation depends on the invertibility of the correlation matrix, its stability may decrease for datasets with inappropriate dimension or strong collinearity.

2.2.3. Information Entropy

Information entropy can be used as a measure to evaluate the uncertainty of the influencing factor on the concrete compressive strength prediction system. If the impact factor has a high regression utility for the system, it implies that it provides a considerable amount of information and should be included as a key feature. Information entropy is used in all of the decision tree-based algorithm models to evaluate the contribution of the feature, and information entropy calculation is processed by Equation (3).

t

indicates the information gain point-a decision tree node,

p (i |t)

represents the fraction of the category sample

i

at the specified node

t

, and

c

represents the number of pivot points at node

t

.

E n t r o p y (t) = - \sum_{i = 1}^{c} p (i |t) \log p (i |t)

(3)

The Gini index, illustrated in Equation (4), is the product derived by neglecting the higher-order terms after the Taylor expansion of

\log p (i |t)

in Equation (3) as a derivative of information entropy. The Gini index is easier to calculate and has a comparable meaning to the information entropy. The Gini index can be used to measure the contribution of attributes as an approximation to information entropy.

G i n i = 1 - \sum_{i = 1}^{c} {[p (i |t)]}^{2}

(4)

Calculate the contribution of each influencing factor to concrete’s compressive strength or the variance of the Gini index of each influencing factor. The greater the variance, the better the influencing factor’s information gains for the system and the greater the influencing factor’s importance. The contribution of the influence factors to node

t

is calculated using Equation (4), where

Δ G i n i (t)

is the variation in the Gini index;

G i n i {(t)}_{l}

and

G i n i {(t)}_{r}

are the Gini indices of the two new nodes after the bifurcation.

Δ G i n i (t) = G i n i (t) - G i n i {(t)}_{l} - G i n i {(t)}_{r}

(5)

2.2.4. Relief

Relief is made up of several algorithms. It was created with the intention of determining feature weights for binary classification tasks. The RReliefF algorithm, which was developed subsequently, was created exclusively for regression problems with continuous target variables. The RReliefF technique, which is widely used in the field of feature selection, computes feature weights based on the ability of features to differentiate close samples [33]. Many researchers employ the RReliefF method as feature selection before establishing compressive strength prediction systems, according to the results of the aforementioned literature survey. The RReliefF algorithm is implemented by first randomly selecting a sample point in the sample space, determining the feature weighting by judging the short-range sample’s response value, and then continually updating the weighting in numerous iterations. As a result, the final feature contribution and iteration times are inextricably linked. There is a full implementation of the RReliefF algorithm in the literature [34]; hence, it will not be discussed further here.

2.3. Analysis of Feature Structure Differences Among Different Datasets

After introducing the four analytical perspectives of feature space structure, the feature weight results of each dataset were further used to compare the structural differences among datasets. In this section, the ranking results are not interpreted merely as a basis for feature subset generation but as evidence of how information concentration, dependency pattern, and structural complexity vary across datasets. The feature weights of data0 under the four analysis methods are shown in Table 3, and the feature weight distributions of different datasets under different methods are shown in Figure 3.

From Table 3, it can be seen that in the data0 dataset, the ranking of feature weights obtained by different methods shows some differences, but the overall trend is consistent. For example, cement content, water–binder ratio, and some mineral admixture variables have relatively high weights under several methods. This indicates that these variables have a stable influence on compressive strength prediction. At the same time, some variables show different importance rankings under different methods. For example, a feature with a high weight in correlation analysis may have a lower weight in information entropy or relief analysis. This difference indicates that different methods characterize the feature space structure from distinct perspectives. Correlation and partial correlation focus more on linear dependency, while information entropy and relief reflect the role of features in distribution changes or local structure.

Figure 3 further shows the stacked feature weight results of data1–data14 under the four methods. From the overall trend, there are clear differences in feature weight distribution among different datasets. For datasets with large sample size and relatively concentrated strength distribution, the feature weights usually show a more concentrated pattern. A few key variables occupy a high proportion of weight, while the other variables have low weights. This indicates that in such datasets, strength variation is mainly controlled by a few core factors.

For datasets with a wide strength range or clear differences in material type, the feature weight distribution is more dispersed. The differences among feature weights are relatively small. This means that strength response is affected by multiple factors together, and the feature space structure is more complex.

In addition, there are structural differences among different types of concrete datasets. In normal concrete datasets, the water–binder ratio and binder-related variables usually have higher weights. In high-strength or ultra-high-performance concrete datasets, mineral admixtures and steel fiber content show higher importance. This difference reflects that changes in material system will lead to changes in feature structure.

It can also be observed from Figure 3 that when the number of features in a dataset is large, the weight distribution tends to be more dispersed, and the differences among feature weights become smaller. When the number of features is small, the weight distribution is more concentrated. This phenomenon shows that feature dimension influences the weight distribution pattern in feature space. When the number of features increases, information is distributed among more variables, and the weight of each single feature becomes relatively smaller. When the number of features is small, a few variables carry more information.

Based on the results of Table 3 and Figure 3, it can be concluded that there are clear differences in feature space structure among different datasets. These differences are mainly reflected in the degree of weight concentration, the type of key variables, and the consistency of ranking under different methods. These structural differences provide a basis for further analysis of prediction error under different data conditions.

3. Modeling Methods

After completing the analysis of dataset-level structure and feature space structure, prediction models were established under a unified experimental protocol. The purpose of introducing machine learning models in this study is not to identify the best algorithm for concrete strength prediction but to examine whether the influence of data structure on prediction error remains stable across different learning mechanisms. In this way, the variation in model error under different data feature structure conditions can be analyzed more clearly.

The 28-day compressive strength of concrete is influenced by the coupling effect of many material factors. The input variables usually have correlations and structural constraints. Different models may respond differently to such feature structures. Therefore, comparing the prediction performance of different types of models under the same data conditions helps to analyze the role of data structure factors in the formation of prediction error.

To ensure comparability among models while avoiding excessive algorithmic complexity, this study selected three representative models: artificial neural network (ANN), support vector regression (SVR), and random forest (RF). These three models represent parameterized nonlinear mapping, kernel-based regression, and tree ensemble learning, respectively. They were chosen because they differ clearly in learning mechanism while remaining interpretable and widely used in concrete strength prediction. More complex stacked or hybrid ensemble models were not introduced in this study because the objective here is not algorithm development but controlled comparison of model responses to data structure. Introducing additional high-complexity models may increase predictive flexibility but it may also weaken the interpretability of the structural analysis.

The ANN establishes the mapping between input and output through weighted connections and activation functions among multiple neurons. Its input layer uses a linear weighted combination structure. It can keep certain sensitivity to feature structure while improving nonlinear expression ability.

SVR constructs an optimal regression hyperplane and uses a kernel function to map input variables into a high-dimensional feature space. In this way, it can describe nonlinear relationships while controlling model complexity.

RF is based on a decision tree ensemble mechanism. It builds multiple tree models using random feature subsets and random sample subsets. The final result is obtained by averaging the outputs. This can improve prediction stability and reduce the dependence of a single model on feature correlation.

The three models have differences in structure and expression mechanism. ANN focuses on parameterized function mapping. SVR emphasizes optimal margin and linear representation in kernel space. RF relies on tree splitting rules and ensemble averaging.

By comparing the prediction error trends of the three models on different datasets under the same data division and evaluation criteria, the influence of sample size, feature number, and strength range on model performance can be analyzed. This provides a basis for further analysis of error patterns.

3.1. Artificial Neural Network Model

Artificial neural network (ANN) is a feedforward nonlinear model based on neuron structure. Its basic structure includes an input layer, a hidden layer, and an output layer. In this study, a single hidden layer network is adopted. The number of nodes in the input layer is equal to the number of features. The output layer has one node, which is used to predict the 28-day compressive strength of concrete.

The output of the hidden layer neuron can be written as

h_{j} = f (\sum w_{i j} x_{i} + b_{j})

(6)

The output layer is calculated as

\hat{y} = \sum v_{j} h_{j} + c

(7)

where

x_{i}

is the input feature,

w_{i j}

is the weight from the input layer to the hidden layer,

b_{j}

is the bias of the hidden layer,

v_{j}

is the weight from the hidden layer to the output layer,

c

is the bias of the output layer, and

f (\cdot)

is the activation function.

The network is trained by the backpropagation algorithm. The parameters are updated by minimizing the mean squared error. Although the neural network has nonlinear expression ability, its first layer is still a linear weighted combination of input variables. Therefore, the model is sensitive to the correlation structure among features. When there is multicollinearity among features or when the sample size is small, the weight updating process may become unstable, which may affect prediction performance.

In this study, ANN is used as a representative model of parameterized nonlinear mapping. Its role is not only to provide a benchmark of predictive accuracy but also to observe how a gradient-based nonlinear model responds to changes in feature size and feature dependency. The model parameters, including hidden layer size, learning rate, and maximum iteration number, were determined under the same cross-validation principle used for the other models. The final hyperparameter values are listed in Table 4. Data preprocessing and dataset division were kept consistent across all datasets and all models to ensure that the observed error differences mainly reflect data structure conditions rather than inconsistent training settings.

To maintain methodological consistency, all three models were trained and evaluated under the same experimental logic. For each dataset and each feature subset, model hyperparameters were determined by cross-validation within the training procedure, and the final parameter settings were selected according to prediction stability and error performance. The purpose of this strategy was not to obtain the absolute optimum configuration for every possible case but to ensure that all models were tuned under a comparable optimization principle. Therefore, the final comparison in this study focuses on the response of the three model types to data structure variables rather than on aggressive model-specific optimization.

3.2. Support Vector Machine Model

Support vector regression (SVR) constructs an optimal regression hyperplane to build the function mapping between input variables and output variables. Its optimization problem can be written as

\min_{w, b, ξ_{i}, ξ_{i}^{*}} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})

(8)

and is subject to the following constraints:

\begin{array}{l} y_{i} - (w \cdot x_{i} + b) \leq ε + ξ_{i} \\ (w \cdot x_{i} + b) - y_{i} \leq ε + ξ_{i}^{*} \\ ξ_{i} \geq 0, ξ_{i}^{*} \geq 0 \end{array}

(9)

Here,

w

is the weight vector,

b

is the bias term.

C

is the penalty parameter.

ε

is the error tolerance interval.

ξ_{i}

and

ξ_{i}^{*}

are the slack variables.

The parameters

C

and

ε

are determined by cross-validation. SVR maps the input variables into a high-dimensional space through a kernel function. Its regression function is still linear in the high-dimensional space. Therefore, its performance is still influenced by the structural characteristics of the data, especially the adequacy of sample size, the concentration of feature information, and the complexity of the target distribution.

In this study, SVR is used as a representative model of kernel-based regression. Compared with ANN, SVR places stronger emphasis on margin control and functional regularization. The final SVR settings were selected under the same optimization principle used for ANN and RF so that the observed error differences could be interpreted primarily from the perspective of data structure rather than model-specific tuning intensity.

3.3. Random Forest Model

Random forest (RF) is an ensemble learning model based on decision trees. It builds multiple decision trees and combines their results to improve prediction stability. The RF model generates several subsets by random sampling. A decision tree is trained on each subset. The final prediction result is obtained by averaging the outputs of all trees.

For regression problems, the prediction value of random forest is the average value of all decision tree outputs:

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (x)

(10)

Here,

T

is the number of decision trees, and

f_{t} (x)

is the prediction output of the t-th decision tree. During training, random forest selects feature subsets randomly to build node-splitting rules. This reduces the influence of feature correlation on a single tree and improves model stability. Compared with ANN and SVR, RF is generally less sensitive to feature scale and some forms of global linear dependence. However, when the sample size is small, the target strength range is wide, or the effective information is scattered across too many variables, the prediction error is still influenced by data structure conditions.

In this study, RF is used as a representative model of tree ensemble learning. Its introduction allows the comparison to include a model that is relatively robust to feature scaling and partial correlation so that the sensitivity of different learning mechanisms to data structure can be contrasted more clearly. The main RF parameters were also determined under the same cross-validation principle, and the final settings are listed in Table 4.

3.4. Evaluation Indicators and Experimental Procedure

The performance of the concrete compressive strength prediction system is evaluated by statistical indicators. These indicators describe the fitting effect between the predicted values and the true values of samples. In this study, the statistical index system is used to evaluate the prediction performance under different feature optimization methods.

The selected indicator is the mean absolute error (MAE), as shown in Equation (11). In the formula,

y_{i}

is the actual value,

y_{i}^{'}

is the predicted value, and

n

is the number of samples.

The mean absolute error is calculated as

M A E = \frac{\sum_{i = 1}^{n} |(y_{i}^{'} - y_{i})|}{n}

(11)

By using unified evaluation indicators and experimental procedures, it is possible to compare the prediction error trends of different models under different data structure conditions. This provides a basis for further analysis of error patterns. MAE is used as the primary indicator in the subsequent structural comparison because it directly reflects the attainable error level across datasets and feature subset configurations.

4. Influence of Data Feature Structure on Prediction Error

4.1. Relationship Between Feature Size and Prediction Error

After establishing the three prediction models under the unified protocol described above, prediction analysis was carried out on each dataset under different feature subset sizes. Based on the four feature-filtering methods, feature-ranking results were obtained for the 15 datasets. According to the model framework described in Algorithm 1, the ranked features were sequentially introduced into the ANN, SVR, and RF models through the SFS procedure so that prediction performance could be examined under different datasets, feature subset sizes, and model types. To improve interpretability in the main text, the overall results are summarized in Figure 4, Figure 5 and Figure 6, while the complete numerical values are provided in Appendix A Table A1. Figure 4 presents the lowest MAE obtained at each feature size for the three models, Figure 5 reorganizes the same results from the perspective of feature-filtering methods, and Figure 6 further illustrates representative dataset-level MAE trajectories.

Algorithm 1 Feature filtering model

function

algorithm 1 (T_{k}

-

data set,

k

-

number of data set; T_{k}^{i}

-

feature subset,

k

-number of data set,

i

-size of feature subset)
for

each data set T_{k}

(k = 0 ~ 14

) do
  Ranking feature in Correlation, Partial correlation, Information entropy and Relief
  Generate ranking list of Correlation, Partial correlation, Information entropy and Relief
    for each ranking list do
      for

i = 2

do
Get the top

i

features and generate the feature subset T_{k}^{i}

      end for
    end for
  end for
return

T_{k}^{i}

end function

As shown in Figure 4, the relationship between feature size and prediction error exhibits a clear common pattern across ANN, RF, and SVR. For most datasets, MAE is relatively high when only a small number of features is retained. As feature size increases, the prediction error decreases noticeably at first and then gradually approaches a relatively stable level. This indicates that the initial increase in feature number helps the model retain the main structural information of the concrete material system, whereas the marginal benefit becomes weaker once the dominant informative variables have already been included. Therefore, the influence of feature size is evident, but the relationship is not strictly monotonic.

Figure 4 also shows that the magnitude and location of MAE reduction are dataset-dependent. In some datasets, the low-error region appears after only a moderate increase in feature number, suggesting that the main predictive information is concentrated in a relatively compact subset. In other datasets, the low-error region appears later, implying that more variables are required before the model can adequately represent the underlying strength system. This result indicates that the optimal feature size is not fixed across datasets but is jointly influenced by the internal organization of variables, sample size, and target value distribution.

Figure 5 provides a complementary interpretation from the perspective of feature-filtering methods. Although correlation, partial correlation, relief, and information entropy produce different ranking sequences, the global pattern remains similar: for most datasets, MAE decreases during the early stage of feature expansion and then enters a relatively stable interval. This means that the existence of a reasonable feature size range is not caused by a single ranking strategy but reflects a more general interaction between dataset structure and model response. At the same time, the positions of the local minima vary across methods. In some datasets, correlation-based methods reach low-error regions earlier, whereas in other cases, relief or information entropy gives lower MAE at intermediate feature sizes. Thus, prediction performance is controlled not only by how many features are retained but also by how variable relevance is ordered.

To make the trend more directly visible, Figure 6 plots several representative datasets. The representative curves show that the MAE–feature size relationship often exhibits an early decrease or the formation of a low-error region, followed by stabilization or local rebound depending on the dataset and model. Such behavior indicates that, once the principal informative variables have already been included, adding lower-ranked variables may mainly introduce redundancy, weak relevance, or stronger collinearity rather than additional effective information. As a result, the prediction error no longer decreases continuously and may even increase slightly in some cases.

A comparison among the three models further shows that the broad pattern is preserved, although the sensitivity differs. ANN usually exhibits a larger variation range in some datasets, indicating stronger sensitivity to feature configuration. RF is comparatively smoother in many cases, suggesting greater robustness to local changes in the retained subset. SVR is generally between the two. However, despite these differences in response intensity, all three models show the same structural trend: an overly small feature subset cannot adequately represent the dataset, whereas an excessively large subset does not necessarily improve prediction accuracy further.

The datasets associated with higher error are also not random. When Figure 6, Figure 7 and Figure 8 are interpreted together with Table 1, comparatively high MAE is more likely to appear under three conditions: limited sample size, wide compressive strength range, and feature subsets that remain structurally mismatched to the dataset even after feature expansion. This pattern is especially visible in datasets such as ‘data3’ and ‘data10’, where the material system is more complex and the target value range is broader so that the attainable error level remains relatively high even when the feature subset is enlarged. Therefore, the observed trend should not be interpreted simply as a numerical effect of adding variables but as the result of whether the retained variables can adequately match the intrinsic structure of the dataset.

Based on Figure 6, Figure 7 and Figure 8, it can, therefore, be concluded that feature size is an important structural factor influencing prediction error in concrete compressive strength prediction. Its effect is significant and stable across different datasets and different models, but it is not strictly linear or strictly decreasing. Instead, there exists a relatively reasonable feature size range, within which useful physical and statistical information can be retained while excessive redundancy is avoided. This conclusion provides the basis for the next section, in which feature size is further discussed together with dataset size and strength range as coupled structural variables.

4.2. Influence of Data Size and Strength Range on Prediction Error

Based on the above analysis of feature size, this section further examines the effects of dataset size and strength range on model performance. The 15 datasets show clear differences in sample number and target value distribution, as summarized in Table 1. These structural differences provide the context for interpreting the error patterns observed in Figure 6, Figure 7 and Figure 8.

From the perspective of dataset size, larger datasets generally correspond to lower and more stable prediction errors. When more samples are available, the models can observe a broader range of variable combinations during training and estimate the mapping between input variables and compressive strength more reliably. In contrast, datasets with limited sample size tend to show stronger MAE fluctuation when the feature subset changes. This phenomenon can be observed by comparing the smoother low-error regions in some larger datasets with the more irregular trajectories shown by several smaller datasets in Figure 6. Therefore, dataset size provides the basic support for the stability of the prediction system.

From the perspective of strength range, wider compressive strength distribution generally corresponds to higher prediction errors. A broader target range usually implies greater heterogeneity of material systems and a more complex mapping between input variables and strength response. Under such conditions, even when feature size is increased, the lower-bound MAE may remain relatively high. This pattern is particularly evident in datasets such as ‘data3’ and ‘data10’, in which error levels remain comparatively high and whose variation with feature size is more pronounced. Combined with the information in Table 1, this suggests that wide target value dispersion increases the intrinsic difficulty of prediction.

When Figure 6 and Figure 8 are interpreted together with Table 1, it becomes clear that feature size does not act independently. Its effect depends on the broader structural context of the dataset. For datasets with relatively sufficient sample size and a more concentrated strength range, the error tends to stabilize after a moderate increase in feature number. For datasets with fewer samples or wider strength ranges, the response to feature expansion is more sensitive, and local fluctuation or delayed stabilization is more likely to occur. Therefore, feature size, dataset size, and strength range should be understood as jointly acting structural variables rather than isolated factors.

4.3. Three-Factor Coupling Mechanism Analysis

To further analyze the combined influence of dataset size, feature size, and strength range on prediction error in a quantitative way, this study established an empirical error model. The lower-bound MAE values extracted from the feature size experiments summarized in Figure 6 and Figure 7, with the full numerical values retained in Appendix A Table A1, were used together with dataset size and compressive strength range to fit the relationship among the three structural variables and prediction error. Multiple linear regression was then used to obtain the explicit expression linking prediction accuracy, dataset size, feature size, and compressive strength range, as given in Equation (12), where D is the dataset size, F is the feature set size, R is the strength range, and f(D,F,R) is the prediction error (MAE) under the coupled effect of the three structural variables.

\begin{matrix} f (D, F, R) = 0.7000 F + 6.8531 R + 0.2517 F^{2} - 9.7657 R^{3} + 0.2844 F^{4} + 0.0127 F R^{3} \\ + 0.6725 R^{4} - 0.0334 D^{5} - 0.0245 D^{4} R - 0.0191 D^{3} F^{2} - 0.1194 D F^{4} + 0.0190 F^{4} R \\ - 0.0314 F R^{4} + 3.3344 R^{5} + 8.1182 \end{matrix}

(12)

Equation (12) does not imply a universal law for all concrete datasets; rather, it provides an empirical relationship within the investigated data conditions. Its role is to summarize the coupled trend observed in the 15 datasets under the unified experimental framework. The fitted model shows that MAE is jointly influenced by dataset size, feature size, and target value range, which is consistent with the visual patterns already observed in Figure 6, Figure 7 and Figure 8. The fitting quality indicates that the three structural variables explain a substantial part of the variation in prediction error, while the validation results suggest that the empirical relationship is reasonably stable within the present dataset collection.

Based on Equation (12), a three-dimensional sectional plot of MAE for the 28-day concrete compressive strength prediction system was generated, as shown in Figure 7. Figure 7 visualizes the joint relationship between prediction error and the three structural variables, whereas Figure 8 further decomposes this coupling mechanism into several two-factor views. In this sense, Figure 6, Figure 7 and Figure 8 provide the experimental basis for identifying the lower-error regions associated with feature size variation, while Figure 7 and Figure 8 summarize the broader structural mechanism in a more compact form.

Based on the previous analysis, the coupling mechanism can be discussed from three aspects: data size, feature number, and strength range. Figure 8 shows the variation trend of prediction error under different combinations of these factors. It presents the interaction among the three factors more clearly.

From Figure 8a, it can be seen that when feature size is fixed, prediction error is highly sensitive to the strength range of the target variable. Figure 8b shows that when the strength range is controlled, prediction error decreases with increasing dataset size and also exhibits an optimal feature size interval under a given sample condition. Figure 8c further indicates that when dataset size is fixed, the combined effect of feature size and strength range still determines the attainable error level. These results support the interpretation derived from Figure 6, Figure 7 and Figure 8, namely, that the change in prediction error is not caused by a single factor but by the coupled action of structural variables.

Therefore, the empirical model should be interpreted as a structural summary of the current dataset collection rather than as a fully general predictive law. Its practical implication is that reducing prediction error requires coordinated consideration of dataset size, feature subset size, and target value range. Removing redundant variables, moderately increasing sample size, and avoiding unnecessarily broad strength mixtures can all improve the stability of the prediction system under the investigated conditions.

5. Discussion

5.1. Relationship Between Feature Structure Differences and Concrete Material System

The feature weight analysis results show that there are clear differences in the distribution of variable importance among different datasets. In normal concrete datasets, cement content, water–binder ratio, and some binder-related variables have high weights under different feature evaluation methods. This result is consistent with the strength formation mechanism of concrete. The water–binder ratio controls the degree of hydration reaction, and the amount of cement and binder directly determines the amount of hydration products.

In contrast, in high-strength and ultra-high-performance concrete datasets, mineral admixtures and chemical admixture variables show higher importance. This means that when the material system changes, the key factors controlling strength also change. A single variable is not enough to explain strength differences, and the feature weight distribution becomes more dispersed. This result indicates that feature structure is closely related to material type. Feature engineering should be optimized according to the specific concrete system.

In addition, different feature-filtering algorithms produce different ranking results for different datasets. Correlation and partial correlation focus more on linear dependency. Information entropy and relief reflect the discrimination ability of variables in local sample distribution. Therefore, in datasets with mixed information or strong variable correlation, a single filtering method cannot fully describe feature contribution.

5.2. Influence of Data Structure on Model Error

The results in Section 4 show that ANN, SVR, and RF have similar error trends across different datasets. Although the three models have different learning mechanisms, Figure 6, Figure 7 and Figure 8 show that they respond to feature size variation in a broadly consistent way, while Table 1 and Figure 8 indicate that this response is further constrained by dataset size and strength range. This means that the overall variation in error is not determined only by model form. Instead, it is first shaped by data structure and then reflected through different models with different degrees of sensitivity.

In terms of feature size, the error generally decreases first and then becomes stable, with local fluctuation in some datasets. The visual evidence in Figure 6 and Figure 8 suggests that this pattern results from a transition from information insufficiency to structural redundancy. When too few features are retained, the model cannot adequately represent the material system. When more informative variables are gradually included, the prediction error decreases because the effective structure of the input space becomes more complete. However, after the main informative variables have already been retained, adding further lower-ranked features tends to contribute weak or redundant information, and in some cases may intensify collinearity or local instability. This explains why the error no longer decreases continuously and why a slight rebound can be observed in several datasets.

The response pattern also differs among ANN, SVR, and RF. ANN is usually more sensitive to changes in feature configuration, which suggests that parameterized nonlinear mapping is more strongly affected by the way information is organized in the input space. RF is comparatively more robust in many datasets, indicating that tree-based ensemble learning can absorb part of the disturbance caused by feature correlation and local subset perturbation. SVR generally lies between the two. However, these differences do not overturn the global pattern shown in Figure 6, Figure 7 and Figure 8. Instead, they indicate that different learning paradigms reveal the same structural constraints with different response intensities.

Figure 7 further shows that the feature-filtering mechanism changes the location and width of the low-error region but not the existence of that region itself. This means that prediction error is controlled not only by the number of retained variables but also by whether the retained subset matches the actual structure of the dataset. As a result, the most effective ranking strategy is dataset-dependent, which is consistent with the view that feature space structure differs across concrete systems.

Taken together, the results suggest that prediction error increases mainly under three kinds of conditions: when the sample size is insufficient, when the target strength range is excessively wide, and when the retained feature subset is structurally mismatched to the dataset. Therefore, improving model performance in concrete compressive strength prediction does not rely only on selecting a more complex algorithm. More importantly, it requires controlling the structure of the input system, including dataset size, target value distribution, and feature subset configuration. In this sense, data organization and feature configuration should be regarded as the primary layer of error control, whereas model type is the secondary layer through which these structural effects are expressed.

This interpretation also helps define the scope of the present conclusions. The results support a structural explanation of error variation under the current unified framework, but they do not imply that any single model, feature subset size, or filtering method will remain optimal for all future datasets. Instead, the practical implication is that new datasets should first be examined in terms of their structural characteristics, including sample support, target range complexity, and feature space organization, before attempting further gains through model complexity alone.

5.3. Interpretation of the Empirical Error Model

Based on sample size, feature size, and strength range, this study established an empirical error model. The model shows that prediction error has a functional relationship with data structure variables. It can quantitatively describe the error variation trend under different data conditions.

Through fitting analysis of 15 datasets, it can be seen that increasing data size helps reduce error, while enlarging the strength range increases prediction error. The effect of feature size shows nonlinear characteristics, and its influence depends on sample size and the distribution of the target variable.

The purpose of this empirical model is to explain the error differences among different datasets from the perspective of data structure rather than only comparing model performance. The model form has some influence on error level, but it does not change the overall trend of error variation with data structure. This indicates that in concrete strength prediction, data organization and feature configuration play a fundamental role in prediction accuracy.

6. Conclusions

This study focused on the prediction of 28-day concrete compressive strength. From the perspective of data structure and feature engineering, the influence of feature system, dataset size, and strength range on prediction error was systematically analyzed under three representative models, namely ANN, SVR, and RF. By comparing 15 concrete datasets from different sources, the structural reasons for the differences in prediction performance were identified. An empirical model describing the relationship between prediction error and data structure variables was further established. The results show that, in concrete strength prediction, data organization and feature configuration have a fundamental influence on model performance.

Based on the above analysis, the following conclusions can be drawn:

(1): There exists a reasonable feature size range for concrete strength prediction systems, but this range is dataset-dependent rather than universally fixed. For most normal concrete and high-performance concrete datasets, a relatively small, optimized feature subset can achieve stable prediction accuracy. For ultra-high-performance concrete and structurally more complex datasets, increasing feature number appropriately can improve prediction ability, although larger feature size does not continuously improve performance.
(2): The distribution of feature importance is closely related to the material system. In normal concrete datasets, cement content, water–binder ratio, and binder-related variables usually play dominant roles. In high-strength or ultra-high-performance concrete datasets, mineral admixtures and chemical admixture variables become more important. This indicates that feature engineering should be designed according to the specific concrete system rather than transferred mechanically across datasets.
(3): The four feature-filtering methods, including correlation, partial correlation, information entropy, and relief, can all be used to characterize feature space structure, but their effectiveness depends on feature distribution and variable correlation in the dataset. There is no single filtering method that performs best for all concrete datasets. For datasets with mixed information or unbalanced variable contribution, information entropy and relief may show relative advantages, whereas in structurally simpler datasets, correlation-based methods may reach low-error regions earlier.
(4): ANN, SVR, and RF show consistent global trends in error variation, which indicates that prediction error is primarily constrained by data structure rather than determined only by model type. ANN is generally more sensitive to changes in feature configuration, whereas RF and SVR are relatively more stable. However, the model type does not overturn the overall trend that prediction error usually decreases first and then stabilizes as the retained feature subset becomes structurally more adequate.
(5): Dataset size, strength range, and feature size jointly determine the attainable performance of the prediction system. The empirical error model established in this study can quantitatively describe the coupled relationship among these three structural variables and prediction error within the investigated dataset collection. Its role is to summarize the structural trend observed under the unified framework of this study rather than to serve as a universal predictive law for all future concrete datasets.
(6): From a practical perspective, the findings suggest that improving concrete compressive strength prediction does not rely only on selecting a more complex algorithm. Researchers and engineers should first examine the structural characteristics of a dataset, especially sample support, target strength range, and feature space organization, before attempting additional gains through model complexity alone. In this sense, data organization and feature configuration should be regarded as the first layer of prediction error control.

At the same time, the conclusions of this study should be interpreted within the scope of the present data conditions. The 15 datasets were analyzed under a unified protocol, but they originate from different concrete systems and literature sources. Therefore, the identified feature size range and the empirical error relationship should be understood as structurally informative rather than universally transferable. Further validation of unseen datasets is still needed to assess the broader generalizability of the empirical model.

Author Contributions

Conceptualization, Y.M. and B.L.; methodology, Y.M., B.L. and C.Y.; data curation, Y.M., C.Y. and X.H.; formal analysis, Y.M., B.L. and C.Y.; investigation, Y.M., B.L., C.Y. and X.H.; resources, B.L.; software, B.L.; supervision, B.L.; validation, Y.M., B.L., C.Y. and X.H.; visualization, Y.M. and B.L.; funding acquisition, Y.M. and B.L.; writing—original draft preparation, Y.M. and C.Y.; writing—review and editing, Y.M., B.L., C.Y. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support from the Sichuan Provincial Natural Science Foundation (2026NSFSC0353) for this study.

Data Availability Statement

The data used in this study are available upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Prediction performance (green-highlighted values indicate the minimum prediction error within each group).

Data Set	Feature Size	ANN				RF				SVR
Data Set	Feature Size	Correlation	Partial Correlation	Relief	Information Entropy	Correlation	Partial Correlation	Relief	Information Entropy	Correlation	Partial Correlation	Relief	Information Entropy
data0	2	4.05	3.801	7.268	5.794	3.832	3.824	7.226	6.182	4.118	3.698	7.299	5.701
	3	3.955	3.773	5.799	5.608	4.036	3.822	6.1	6.24	4.134	3.73	5.747	5.362
	4	3.934	3.653	5.509	4.1	4.086	3.844	5.877	4.155	4.062	3.497	5.326	4.03
	5	3.939	3.648	4.142	4.048	4.088	3.879	4.178	4.02	4.059	3.616	3.989	3.924
	6	3.968	3.762	4.113	4.026	4.106	4.027	4.176	3.926	4.166	3.68	3.99	3.963
	7	3.854	3.728	4.158	4.095	4.011	3.928	4.176	3.931	3.98	3.828	4.008	4.016
	8	3.765	3.712	4.138	3.629	3.989	3.929	4.17	3.702	4.027	3.838	4.031	3.711
	9	3.84	3.706	4.084	3.696	3.867	3.918	4.058	3.694	4.069	3.832	4.03	3.74
	10	3.732	3.757	4.176	3.758	3.739	3.921	3.981	3.692	4.015	3.944	3.96	3.897
	11	3.725	3.593	4.262	3.674	3.726	3.78	3.981	3.68	4.053	3.959	3.978	3.993
	12	3.722	3.685	3.888	3.718	3.731	3.777	3.763	3.677	4.067	4.005	3.738	4.011
	13	3.705	3.711	3.782	3.696	3.73	3.774	3.77	3.672	4.106	4.021	3.795	3.932
	14	3.76	3.656	3.779	3.705	3.731	3.691	3.754	3.679	4.104	3.984	3.891	3.968
	15	3.845	3.667	3.73	3.7	3.733	3.673	3.734	3.667	4.132	3.958	3.95	3.967
	16	3.813	3.752	3.828	3.825	3.673	3.691	3.739	3.673	4.096	3.999	4.056	3.999
	17	3.771	3.771	3.771	3.771	3.671	3.671	3.671	3.671	4.018	4.018	4.018	4.018
data1	2	17.45	16.6	21.23	17.37	10.15	18.21	19.17	16.11	16.33	16.48	23.08	14.15
	3	17.88	26.56	22.21	16.51	9.852	11.82	19.49	15.9	15.92	11.61	21.4	13.27
	4	18.53	11.87	42.87	24.96	9.16	9.419	18.85	10.02	14.79	9.95	20.48	11.58
	5	28.12	12.59	25.43	24.16	9.165	9.496	18.73	9.688	12.78	10.1	20.4	12.64
	6	20.69	13.93	23.79	21.55	8.19	8.539	18.81	9.174	9.009	8.881	19.16	13.5
	7	19.39	21.47	22.96	15.4	8.103	7.998	18.55	9.332	9.036	9.245	18.83	13.65
	8	16.17	14.57	31.57	28.62	8.241	8.154	18.25	8.945	7.948	8.342	18.37	13.1
	9	16.59	15.71	16.52	14.45	8.22	8.059	8.037	8.023	8.448	8.448	8.448	8.448
data2	2	10.24	10.24	12.92	10.87	8.336	8.336	11.19	7.221	10.26	10.26	12.89	10.2
	3	11.98	10.29	12.64	9.209	7.351	7.277	9.747	6.769	8.998	9.233	12.79	8.912
	4	11.57	10.75	12.94	10.46	6.752	7.495	8.919	6.75	7.291	7.362	10.32	7.291
	5	9.389	8.96	10.24	8.405	6.825	7.118	8.192	6.445	7.195	6.87	10.37	6.657
	6	7.443	7.195	7.18	8.468	6.64	6.621	6.627	6.581	6.476	6.476	6.476	6.476
data3	2	10.81	-	32.91	10.81	10.04	-	34.75	10.04	10.46	-	34.14	10.46
	3	18.03	-	30.27	18.03	8.803	-	29.65	8.803	7.027	-	31.23	7.027
	4	24.29	-	36.34	26.51	8.938	-	29.65	8.429	9.167	-	31.65	8.431
	5	19.8	-	24.75	22.44	8.462	-	18.08	8.511	9.489	-	17.71	9.489
	6	25.54	-	34.93	22.09	8.694	-	18.1	8.697	10.21	-	19.5	10.21
	7	28.97	-	19.99	28.82	8.689	-	8.509	8.509	10.49	-	10.49	10.49
data4	2	16.56	13.2	19.61	16.33	15.76	15.71	16.7	16.19	14.42	11.95	16.5	13.95
	3	16.66	15.67	23.41	22.21	15.3	15.34	17.15	15.19	12.89	12.89	17.43	18.26
	4	19.93	18.11	21.6	30.08	14.22	15.08	16.95	15.14	10.61	12.21	18.67	17.99
	5	34.17	35.88	25	29.39	14.41	15.18	15.02	15.37	11.44	11.7	17.61	16.49
	6	36.72	30.21	36.99	19.94	15.19	16.42	15.19	15.35	13.09	12.61	16.65	15.12
	7	44.59	31.94	41.15	27.27	16.62	16.86	15.17	15.99	15.12	12.88	15.34	15.9
	8	38.54	35.72	50.87	31.42	16.86	15.82	18.32	15.85	15	14.79	16.28	14.68
	9	40.44	37.43	54.37	30.57	17.22	16.02	17.02	15.51	15.35	14.46	13.98	14
	10	27.37	37.36	33.5	41.29	17.07	17.4	16.16	16.94	14.52	14.52	14.05	14.48
	11	32.12	24.67	43.52	31	16.61	16.49	16.48	16.58	14.23	14.23	14.23	14.23
data5	2	4.614	3.985	7.478	5.499	2.796	3.399	4.403	2.773	3.008	3.426	4.883	3.008
	3	5.387	5.73	9.231	9.56	3.443	3.708	4.517	3.678	3.785	4.02	4.419	3.726
	4	9.296	6.969	8.223	9.81	3.447	3.617	3.824	3.859	3.361	3.047	4.095	3.683
	5	13.51	6.432	12.18	12.16	3.731	3.725	3.837	3.953	3.309	3.004	3.929	3.929
	6	9.014	12.27	14.1	10.86	3.88	3.852	3.857	3.824	3.726	3.726	3.726	3.726
data6	2	12.44	12.44	9.926	14.37	7.991	7.991	9.728	7.962	9.197	9.197	9.659	10.44
	3	10.27	10.27	11.43	12.71	8.072	8.072	9.952	7.887	9.478	9.478	9.737	9.771
	4	11.78	11.78	11.61	9.286	8.128	8.128	10.08	7.707	8.873	8.873	9.659	9.935
	5	10.97	10.97	12.24	9.282	8.075	8.075	10.11	7.453	9.277	9.277	9.767	9.223
	6	11.36	11.36	12.39	9.78	7.983	7.983	10.18	7.653	9.782	9.782	9.902	8.896
	7	12.51	12.51	16.08	10.52	7.721	7.721	11.46	7.564	9.278	9.278	10.97	9.381
	8	15.32	15.32	12.56	12.04	7.592	7.592	7.62	7.671	9.057	9.057	9.057	9.057
data7	2	17.3	13.72	16.87	22	17.04	16.84	16.8	16.67	15.31	13.53	17.15	17.1
	3	17.22	17.41	18.38	17.57	16.93	16.33	16.72	14.19	14.81	13.43	17.92	14.17
	4	21.56	30.37	23.22	24.15	16.9	15.35	17.08	16.79	14.29	13.1	18.83	15.21
	5	33.58	39.28	23.73	30.83	16.53	16.3	16.92	16.81	14.66	11.7	19.11	16.01
	6	33.32	34.62	63.4	23.56	16.24	16.29	17.72	16.77	13.92	13.92	17.82	15.57
	7	30.29	22.24	51.4	27.79	16.97	17.14	19.8	17.05	13.48	13.48	19.2	16.04
	8	30.23	37.98	48.82	29.29	17.15	17.08	18.22	16.83	12.5	13.53	16.84	14.9
	9	27.6	31.73	27.64	31.12	16.52	17.23	15.26	16.94	13.63	12.13	13.6	14.49
	10	25.67	23.82	29.44	32.29	16.65	18.04	15.13	16.47	12.99	14.08	13.94	12.99
	11	31.77	26.83	38.02	24.64	17.43	17.43	17.83	17.56	13.41	13.41	13.41	13.41
data8	2	10.26	10.26	8.277	7.553	9.598	8.569	6.278	7.939	9.419	7.912	7.526	7.616
	3	9.331	9.288	7.588	8.476	8.535	6.629	6.594	6.571	8.211	5.592	7.089	5.592
	4	9.099	10.05	10.7	8.355	8.73	7.066	7.205	7.461	8.403	6.361	7.49	6.756
	5	8.362	8.885	6.981	9.035	7.394	7.397	7.529	7.879	6.49	6.49	6.49	7.851
	6	8.951	7.187	6.995	8.006	7.941	7.931	8.001	7.863	7.836	7.836	7.836	7.836
data9	2	16.17	13.58	10.66	17.19	11.82	12.21	11.24	12.03	13.86	14.93	10.86	13.06
	3	25.35	15.65	20.49	28.96	12.12	12.1	13.03	11.92	14.8	6.703	11.51	13.78
	4	12.65	11.33	21.61	28.7	12.18	11.2	12.46	12.06	7.937	8.199	12.83	12.36
	5	11.84	10.61	21.9	22.88	12.61	11.84	12.74	11.57	7.723	7.647	13.86	11.16
	6	11.92	13.22	9.841	11.65	11.76	11.63	11.78	11.51	8.499	8.499	8.499	8.499
data10	2	32.14	30.13	25.96	30.47	27.84	25.34	30.2	28.6	30.14	27.71	29.78	31.83
	3	31.27	37.74	37.7	43.2	28.3	27.61	29.41	27.72	29.44	30.91	31.86	33.74
	4	36.91	41.74	110.5	57.46	26	28.35	29.69	27.73	31.55	32.74	32.86	29.7
	5	87.53	70.8	64	65.66	25.83	28.42	28.5	27.86	26.62	31.92	32.08	30.97
	6	58.12	109.3	73.69	70.89	27.07	27.94	27.36	28.7	28.34	29.66	28.68	32.72
	7	51.65	73.03	79.89	131.1	27.84	27.79	27.85	28.28	28.26	28.26	30.09	28.26
	8	119.2	59.68	164	152.5	27.69	27.69	27.66	27.71	29.86	29.86	29.86	29.86
data11	2	8.346	7.481	11	8.346	6.88	6.17	10.39	6.88	8.277	7.54	11.42	8.277
	3	8.396	6.597	11.13	6.097	6.545	5.277	9.212	4.861	8.344	6.37	10.56	6.321
	4	7.488	5.862	9.489	5.973	5.985	5.099	8.338	4.833	8.257	6.095	10.14	5.835
	5	5.313	5.595	11.6	5.604	4.4	5.371	7.698	4.708	5.507	5.477	10.08	5.857
	6	5.811	5.117	6.838	6.057	4.648	4.766	6.491	4.819	5.689	5.716	7.097	5.717
	7	5.154	5.85	4.885	5.613	4.699	4.736	4.679	4.73	5.534	5.534	5.534	5.534
data12	2	3.331	3.331	3.331	3.395	3.019	3.019	3.019	2.986	3.197	3.197	3.197	3.197
	3	3.235	2.893	2.893	2.953	2.79	2.395	2.395	2.386	3.221	2.802	2.802	2.802
	4	3.22	2.696	2.696	2.763	2.64	2.083	2.083	2.097	3.341	2.668	2.668	2.668
	5	2.16	2.136	2.136	2.202	2.031	2.035	2.035	2.026	2.124	2.124	2.124	2.124
data13	2	7.721	-	7.379	6.556	2.869	-	6.862	3.945	3.568	-	5.544	5.214
	3	5	-	8.685	6.277	3.947	-	6.879	3.947	5.13	-	5.513	5.13
	4	7.615	-	7.089	6.629	3.991	-	2.859	2.929	5.283	-	3.159	3.147
	5	12.07	-	7.48	6.057	3.999	-	2.837	2.838	5.204	-	3.112	3.112
	6	13.17	-	5.768	8.832	2.983	-	2.897	2.917	3.225	-	3.245	3.245
	7	13.22	-	13.63	15.82	2.882	-	2.893	2.912	3.141	-	3.141	3.141
data14	2	7.771	7.446	9.941	7.771	6.112	6.176	8.58	6.112	7.552	7.342	9.889	7.552
	3	7.141	5.993	10.04	6.621	5.626	4.746	7.94	4.918	7.041	5.641	9.745	6.317
	4	6.741	5.776	9.562	6.05	5.272	4.645	7.563	4.701	6.424	5.579	9.45	5.803
	5	6.283	5.097	8.557	5.867	4.892	4.731	6.401	4.58	5.938	4.815	7.866	5.62
	6	5.538	4.548	6.322	5.508	4.633	4.499	5.445	4.47	4.994	4.487	5.431	4.934
	7	4.636	4.662	4.71	4.472	4.105	4.111	4.108	4.11	4.081	4.081	4.081	4.081

References

DeRousseau, M.A.; Kasprzyk, J.R.; Srubar, W.V. Computational design optimization of concrete mixtures: A review. Cem. Concr. Res. 2018, 109, 42–53. [Google Scholar] [CrossRef]
Chaabene, W.B.; Flah, M.; Nehdi, M.L. Machine learning prediction of mechanical properties of concrete: Critical review. Constr. Build. Mater. 2020, 260, 119889. [Google Scholar] [CrossRef]
Behnood, A.; Golafshani, E.M. Artificial intelligence to model the performance of concrete mixtures and elements: A review. Arch. Comput. Methods Eng. 2022, 29, 1941–1964. [Google Scholar] [CrossRef]
Adeli, H.; Yeh, C. Perceptron learning in engineering design. Comput. Aided Civ. Infrastruct. Eng. 1989, 4, 247–256. [Google Scholar] [CrossRef]
Yeh, I.C. Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
Hu, X.; Li, B.; Mo, Y.; Alselwi, O. Progress in Artificial Intelligence-based Prediction of Concrete Performance. J. Adv. Concr. Technol. 2021, 19, 924–936. [Google Scholar] [CrossRef]
Chang, W.; Zheng, W. Effects of key parameters on fluidity and compressive strength of ultra-high performance concrete. Struct. Concr. 2020, 21, 747–760. [Google Scholar] [CrossRef]
Asteris, P.; Kolovos, K.; Douvika, M.; Roinos, K. Prediction of self-compacting concrete strength using artificial neural networks. Eur. J. Environ. Civ. Eng. 2016, 20, s102–s122. [Google Scholar] [CrossRef]
Sobhani, J.; Najimi, M.; Pourkhorshidi, A.R.; Parhizkar, T. Prediction of the compressive strength of no-slump concrete: A comparative study of regression, neural network and ANFIS models. Constr. Build. Mater. 2010, 24, 709–718. [Google Scholar] [CrossRef]
Šipoš, T.K.; Miličević, I.; Siddique, R. Model for mix design of brick aggregate concrete based on neural network modelling. Constr. Build. Mater. 2017, 148, 757–769. [Google Scholar] [CrossRef]
Asteris, P.G.; Kolovos, K.G. Self-compacting concrete strength prediction using surrogate models. Neural Comput. Appl. 2019, 31, 409–424. [Google Scholar] [CrossRef]
Saleh, M.A.; Kazemi, F.; Abdelgader, H.S.; Isleem, H.F. Optimization-based multitarget stacked machine-learning model for estimating mechanical properties of conventional and fiber-reinforced preplaced aggregate concrete. Arch. Civ. Mech. Eng. 2025, 25, 185. [Google Scholar] [CrossRef]
Young, B.A.; Hall, A.; Pilon, L.; Gupta, P.; Sant, G. Can the compressive strength of concrete be estimated from knowledge of the mixture proportions?: New insights from statistical analysis and machine learning methods. Cem. Concr. Res. 2019, 115, 379–388. [Google Scholar] [CrossRef]
Sah, A.K.; Hong, Y.-M. Performance comparison of machine learning models for concrete compressive strength prediction. Materials 2024, 17, 2075. [Google Scholar] [CrossRef] [PubMed]
Wan, Z.; Xu, Y.; Šavija, B. On the use of machine learning models for prediction of compressive strength of concrete: Influence of dimensionality reduction on the model performance. Materials 2021, 14, 713. [Google Scholar] [CrossRef]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Ouyang, B.; Li, Y.; Song, Y.; Wu, F.; Yu, H.; Wang, Y.; Bauchy, M.; Sant, G. Learning from Sparse Datasets: Predicting Concrete’s Strength by Machine Learning. arXiv 2020, arXiv:2004.14407. [Google Scholar]
Guyon, I.; Elisseef, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Garcia, S.; Luengo, J.; Herrera, F. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl. Based Syst. 2016, 98, 1–29. [Google Scholar] [CrossRef]
Langley, P. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance, New Orleans, LA, USA, 4–6 November 1994; Volume 184. [Google Scholar]
Baudat, G.; Anouar, F. Generalized discriminant analysis using a kernel approach. Neural Comput. 2000, 12, 2385–2404. [Google Scholar] [CrossRef]
Liu, C. Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 572–581. [Google Scholar] [CrossRef]
Li, Y.; Li, T.; Liu, H. Recent advances in feature selection and its applications. Knowl. Inf. Syst. 2017, 53, 551–577. [Google Scholar] [CrossRef]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Al-Tashi, Q.; Abdulkadir, S.J.; Rais, H.M.; Mirjalili, S.; Alhussian, H. Approaches to multi-objective feature selection: A systematic literature review. IEEE Access 2020, 8, 125076–125096. [Google Scholar] [CrossRef]
Spolaôr, N.; Monard, M.C.; Tsoumakas, G.; Lee, H.D. A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 2016, 180, 3–15. [Google Scholar] [CrossRef]
Guan, D.; Yuan, W.; Lee, Y.-K.; Najeebullah, K.; Rasel, M.K. A review of ensemble learning based feature selection. IETE Tech. Rev. 2014, 31, 190–198. [Google Scholar] [CrossRef]
Bolón-Canedo, V.; Alonso-Betanzos, A. Ensembles for feature selection: A review and future trends. Inf. Fusion 2019, 52, 1–12. [Google Scholar] [CrossRef]
Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A review of unsupervised feature selection methods. Artif. Intell. Rev. 2020, 53, 907–948. [Google Scholar] [CrossRef]
Niño-Adan, I.; Manjarres, D.; Landa-Torres, I.; Portillo, E. Feature weighting methods: A review. Expert Syst. Appl. 2021, 184, 115424. [Google Scholar] [CrossRef]
De La Iglesia, B. Evolutionary computation for feature selection in classification problems. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2013, 3, 381–407. [Google Scholar] [CrossRef]
Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European Conference on Machine Learning on Machine Learning, Catania, Italy, 6–8 April 1994. [Google Scholar]
Urbanowicz, R.J.; Meeker, M.; La Cava, W.; Olson, R.S.; Moore, J.H. Relief-based feature selection: Introduction and review. J. Biomed. Inform. 2018, 85, 189–203. [Google Scholar] [PubMed]
Robnik-Šikonja, M.; Kononenko, I. An adaptation of Relief for attribute estimation in regression, Machine Learning. In Proceedings of the Fourteenth International Conference (ICML’97), San Francisco, CA, USA, 8–12 July 1997; pp. 296–304. [Google Scholar]
Lv, Y.; Shi, X.; Ran, L.; Shang, M. Random Forest-Based Ensemble Estimator for Concrete Compressive Strength Prediction via AdaBoost Method. In Proceedings of the International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, Hohhot, China, 26–28 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 557–565. [Google Scholar]
Goliatt, L.; Farage, M. An Extreme Learning Machine with Feature Selection for Estimating Mechanical Properties of Lightweight Aggregate Concretes. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; IEEE: New York, NY, USA, 2018; pp. 1–7. [Google Scholar]
Farooq, F.; Czarnecki, S.; Niewiadomski, P.; Aslam, F.; Alabduljabbar, H.; Ostrowski, K.A.; Śliwa-Wieczorek, K.; Nowobilski, T.; Malazdrewicz, S. A comparative study for the prediction of the compressive strength of self-compacting concrete modified with fly ash. Materials 2021, 14, 4934. [Google Scholar] [CrossRef] [PubMed]
Kang, M.-C.; Yoo, D.-Y.; Gupta, R. Machine learning-based prediction for compressive and flexural strengths of steel fiber-reinforced concrete. Constr. Build. Mater. 2021, 266, 121117. [Google Scholar]
Liu, F.; Ding, W.; Qiao, Y.; Wang, L. An artificial neural network model on tensile behavior of hybrid steel-PVA fiber reinforced concrete containing fly ash and slag power. Front. Struct. Civ. Eng. 2020, 14, 1299–1315. [Google Scholar] [CrossRef]
Cao, Y.; Fan, Q.; Azar, S.M.; Alyousef, R.; Yousif, S.T.; Wakil, K.; Jermsittiparsert, K.; Ho, L.S.; Alabduljabbar, H.; Alaskar, A. Computational parameter identification of strongest influence on the shear resistance of reinforced concrete beams by fiber reinforcement polymer. Structures 2020, 27, 118–127. [Google Scholar] [CrossRef]
Rinchon, J.P.M.; Concha, N.C.; Calilung, M.G.V. Reinforced concrete ultimate bond strength model using hybrid neural network-genetic algorithm. In Proceedings of the 2017 IEEE 9th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), Manila, Philippines, 1–3 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Lu, S.; Koopialipoor, M.; Asteris, P.G.; Bahri, M.; Armaghani, D.J. A novel feature selection approach based on tree models for evaluating the punching shear capacity of steel fiber-reinforced concrete flat slabs. Materials 2020, 13, 3902. [Google Scholar] [CrossRef]
Liu, T.; Wang, Z.; Zeng, J.; Wang, J. Machine-learning-based models to predict shear transfer strength of concrete joints. Eng. Struct. 2021, 249, 113253. [Google Scholar] [CrossRef]
Haruna, S.I.; Farouk, A.I.; Ibrahim, Y.E.; Nawar, M.T.; Abdulrahman, S.; Abdulhadi, M. Insights into the Feature-Selection Mechanisms for Modeling the Shear Capacity of Stud Connectors in Concrete: A Machine Learning Approach. J. Compos. Sci. 2026, 10, 34. [Google Scholar] [CrossRef]
Nunez, I.; Marani, A.; Flah, M.; Nehdi, M.L. Estimating compressive strength of modern concrete mixtures using computational intelligence: A systematic review. Constr. Build. Mater. 2021, 310, 125279. [Google Scholar] [CrossRef]
Mirzahosseini, M.; Jiao, P.; Barri, K.; Riding, K.A.; Alavi, A.H. New machine learning prediction models for compressive strength of concrete modified with glass cullet. Eng. Comput. 2019, 36, 876–898. [Google Scholar] [CrossRef]
Dragaš, J.; Marinković, S.; Radonjanin, V. Prediction models for high-volume fly ash concrete practical application: Mechanical properties and experimental database. Građevinski Mater. I Konstr. 2021, 64, 19–43. [Google Scholar] [CrossRef]
Qi, C.; Huang, B.; Wu, M.; Wang, K.; Yang, S.; Li, G. Concrete strength prediction using different machine learning processes: Effect of slag, fly ash and superplasticizer. Materials 2022, 15, 5369. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Sun, B. Concrete compressive strength prediction using an explainable boosting machine model. Case Stud. Constr. Mater. 2023, 18, e01845. [Google Scholar] [CrossRef]
Fu, H.; Zhou, X.; Xu, P.; Sun, D. Prediction of Compressive Strength of Concrete Using Explainable Machine Learning Models. Materials 2025, 18, 5009. [Google Scholar] [CrossRef] [PubMed]
Abuodeh, O.R.; Abdalla, J.A.; Hawileh, R.A. Assessment of compressive strength of ultra-high performance concrete using deep machine learning techniques. Appl. Soft Comput. 2020, 95, 106552. [Google Scholar] [CrossRef]
Kaloop, M.R.; Kumar, D.; Samui, P.; Hu, J.W.; Kim, D. Compressive strength prediction of high-performance concrete using gradient tree boosting machine. Constr. Build. Mater. 2020, 264, 120198. [Google Scholar] [CrossRef]
Vakharia, V.; Gujar, R. Prediction of compressive strength and portland cement composition using cross-validation and feature ranking techniques. Constr. Build. Mater. 2019, 225, 292–301. [Google Scholar] [CrossRef]
Abuodeh, O.; Abdalla, J.A.; Hawileh, R.A. Prediction of compressive strength of ultra-high performance concrete using SFS and ANN. In Proceedings of the 2019 8th International Conference on Modeling Simulation and Applied Optimization (ICMSAO), Manama, Bahrain, 15–17 April 2019; IEEE: New York, NY, USA, 2019; pp. 1–5. [Google Scholar]
Timur Cihan, M. Prediction of concrete compressive strength and slump by machine learning methods. Adv. Civ. Eng. 2019, 2019, 3069046. [Google Scholar] [CrossRef]
Keleş, M.K.; Keleş, A.E.; Kiliç, Ü. Prediction of concrete strength with data mining methods using artificial bee colony as feature selector. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 28–30 September 2018; IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]
Li, K.; Long, Y.; Wang, H.; Wang, Y.-F. Modeling and Sensitivity Analysis of Concrete Creep with Machine Learning Methods. J. Mater. Civ. Eng. 2021, 33, 04021206. [Google Scholar] [CrossRef]
Li, W.; Li, H.; Liu, C.; Min, K. Concrete Creep Prediction Based on Improved Machine Learning and Game Theory: Modeling and Analysis Methods. Buildings 2024, 14, 3627. [Google Scholar] [CrossRef]

Figure 1. Experiment flow.

Figure 2. Kernel density distribution of data0 features.

Figure 3. Feature weight stacked graph.

Figure 4. Heat maps of minimum MAE across feature subset sizes for three prediction models.

Figure 5. Heat maps of minimum MAE across feature subset sizes for four feature-filtering methods.

Figure 6. Representative MAE variation curves with feature subset size under three prediction models.

Figure 7. 3D sectional view.

Figure 8. Two-factor performance impact.

Table 1. Sample size, number of features, and strength range of data0~data14.

Data Set	Type	Data Size	Feature Set Size	Min of CS (MPa)	Max of CS (MPa)
data0	normal concrete	7083	17	12.9	79.9
data1	high-performance concrete	207	9	3	133.6
data2	self-compacting concrete	127	6	10.2	73.5
data3	ultra-high-performance concrete	58	9	77	211
data4	self-compacting concrete	169	11	10.2	117.03
data5	slump-free concrete	32	6	52.2	76.7
data6	masonry aggregate concrete	147	8	8.7	80.5
data7	self-compacting concrete	205	11	10.2	122
data8	rice husk ash concrete	60	6	42.47	92.21
data9	self-compacting concrete	80	6	10.2	73.5
data10	ultra-high-performance concrete	110	8	95	240
data11	high-performance concrete	425	7	8.536	81.751
data12	high-performance concrete	357	5	4.096	91.3
data13	recycled concrete	74	7	25.8	52.4
data14	high-performance concrete	528	7	8.54	81.75

Table 2. Feature statistics of data0.

Feature	Mean	Std	Var	Ske	Kur	Mode	Min	Med	Max	Rg
ShengWei_P·O42.5	0.96	15.10	227.96	16.84	305.82	0.00	0.00	0.00	420.00	420.00
ShengTai_P·O42.5	232.20	58.26	3394.48	0.08	0.90	220.00	0.00	230.00	430.00	430.00
Fly_Ash	81.40	14.69	215.86	0.78	9.68	80.00	0.00	80.00	200.00	200.00
Fiber_expansion_agent	0.65	4.67	21.77	7.01	47.13	0.00	0.00	0.00	34.00	34.00
Siliceous_compacting_agent	0.20	2.58	6.63	13.03	167.91	0.00	0.00	0.00	34.00	34.00
Expansion_agent	0.04	1.21	1.47	28.01	782.55	0.00	0.00	0.00	34.00	34.00
Datang_S95_mineral_powder	60.64	23.55	554.56	−1.61	2.11	60.00	0.00	70.00	110.00	110.00
Shangluo_medium_sand	383.24	164.72	27,133.86	0.51	0.87	300.00	0.00	350.00	970.00	970.00
Tongchuan_medium_sand	380.68	164.72	27,133.64	0.55	0.90	300.00	0.00	350.00	970.00	970.00
Coarse_sand	167.36	145.53	21,180.04	0.17	−0.96	0.00	0.00	200.00	780.00	780.00
Commercial_coarse_sand	347.41	154.69	23,929.07	0.72	1.44	300.00	0.00	300.00	950.00	950.00
5–25 mm_crushed_stone	752.23	217.99	47,520.97	−2.55	6.51	750.00	0.00	770.00	1050.00	1050.00
Brick_slag	0.67	21.55	464.44	32.17	1041.71	0.00	0.00	0.00	750.00	750.00
5–10 mm_fine_stone	188.97	172.61	29,793.39	2.38	7.19	200.00	0.00	200.00	950.00	950.00
Water_reducing_agent	8.78	1.64	2.68	0.13	0.65	8.00	0.00	8.80	16.20	16.20
water	77.12	19.79	391.63	1.57	2.20	70.00	0.00	70.00	200.00	200.00
Sewage	59.01	24.58	604.37	−1.83	1.72	70.00	0.00	70.00	100.00	100.00
28d_compressive_strength	44.84	10.38	107.82	−0.38	0.76	45.47	12.93	45.47	79.93	67.00

Table 3. Feature weight of data0.

Correlation		Partial Correlation		Information Entropy		Relief
ShengTai_P·O42.5	0.799	ShengTai_P·O42.5	0.722	ShengTai_P·O42.5	0.525	Shangluo_medium_sand	−3.07 × 10⁻³
Fly_Ash	0.536	Datang_S95_mineral_powder	0.517	Water_reducing_agent	0.084	ShengWei_P·O42.5	−8.09 × 10⁻³
Water_reducing_agent	0.513	ShengWei_P·O42.5	0.386	Datang_S95_mineral_powder	0.063	Water_reducing_agent	−0.01550063
Sewage	0.487	water	0.141	Commercial_coarse_sand	0.054	water	−3.08 × 10⁻²
5−10 mm_fine_stone	0.425	Sewage	0.137	water	0.052	ShengTai_P·O42.5	−3.75 × 10⁻²
5−25 mm_crushed_stone	0.403	Water_reducing_agent	0.131	5–25 mm_crushed_stone	0.051	Expansion_agent	−1.44 × 10⁻¹
Datang_S95_mineral_powder	0.388	Coarse_sand	0.097	Shangluo_medium_sand	0.035	Brick_slag	−2.85 × 10⁻¹
water	0.199	Expansion_agent	0.065	Tongchuan_medium_sand	0.035	Fiber_expansion_agent	−6.56 × 10⁻¹
Coarse_sand	0.136	5–10 mm_fine_stone	0.064	Coarse_sand	0.027	5–25 mm_crushed_stone	−8.35 × 10⁻¹
Shangluo_medium_sand	0.100	Fly_Ash	0.045	Sewage	0.026	Coarse_sand	−9.32 × 10⁻¹
Tongchuan_medium_sand	0.096	Tongchuan_medium_sand	0.036	Fly_Ash	0.022	Siliceous_compacting_agent	−9.50 × 10⁻¹
Brick_slag	0.063	Shangluo_medium_sand	0.024	5–10 mm_fine_stone	0.019	Datang_S95_mineral_powder	−1.04 × 10⁰
Fiber_expansion_agent	0.056	Siliceous_compacting_agent	0.019	ShengWei_P·O42.5	0.006	Tongchuan_medium_sand	−1.05 × 10⁰
Expansion_agent	0.044	Commercial_coarse_sand	0.018	Fiber_expansion_agent	0.000	Fly_Ash	−1.12 × 10⁰
Siliceous_compacting_agent	0.034	5–25 mm_crushed_stone	0.016	Expansion_agent	0.000	5–10 mm_fine_stone	−1.40 × 10⁰
Commercial_coarse_sand	0.031	Fiber_expansion_agent	0.014	Siliceous_compacting_agent	0.000312	Sewage	−2.40 × 10⁰
ShengWei_P·O42.5	0.01	Brick_slag	0.001	Brick_slag	0.00011	Commercial_coarse_sand	−3.08 × 10⁰

Table 4. Model hyperparameter.

ANN		SVM		RF
Model parameter	Value	Model parameter	Value	Model parameter	Value
hidden_layer_sizes	30	kernel	Rbf	n_estimators	100
random_state	0	C	100	min_samples_split	2
max_iter	2000	gamma	0.1	min_samples_leaf	1
activation	Relu	epsilon	0.1	random_state	0
solver	Adam	degree	3	min_impurity_decrease	0
alpha	0.0001	coef0	0	min_weight_fraction_leaf	0
learning_rate_init	0.001	tol	1 × 10⁻³	ccp_alpha	0
power_t	0.5	max_iter	−1	max_features	Auto
tol	0.0001	-	-	max_samples	None
validation_fraction	0.1	-	-	-	-
beta_1	0.9	-	-	-	-
beta_2	0.999	-	-	-	-
epsilon	1 × 10⁻⁸	-	-	-	-
n_iter_no_change	10	-	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mo, Y.; Li, B.; Yan, C.; Hu, X. Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models. Buildings 2026, 16, 1537. https://doi.org/10.3390/buildings16081537

AMA Style

Mo Y, Li B, Yan C, Hu X. Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models. Buildings. 2026; 16(8):1537. https://doi.org/10.3390/buildings16081537

Chicago/Turabian Style

Mo, Yelan, Bixiong Li, Chengcheng Yan, and Xiangxin Hu. 2026. "Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models" Buildings 16, no. 8: 1537. https://doi.org/10.3390/buildings16081537

APA Style

Mo, Y., Li, B., Yan, C., & Hu, X. (2026). Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models. Buildings, 16(8), 1537. https://doi.org/10.3390/buildings16081537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Influence of Data Structure on Prediction Error in Machine Learning-Based Concrete Compressive Strength Models

Abstract

1. Introduction

2. Data and Feature Structure Analysis

2.1. Dataset and Structural Characteristic Analysis

2.2. Methods for Feature Space Structure Analysis

2.2.1. Correlation

2.2.2. Partial Correlation

2.2.3. Information Entropy

2.2.4. Relief

2.3. Analysis of Feature Structure Differences Among Different Datasets

3. Modeling Methods

3.1. Artificial Neural Network Model

3.2. Support Vector Machine Model

3.3. Random Forest Model

3.4. Evaluation Indicators and Experimental Procedure

4. Influence of Data Feature Structure on Prediction Error

4.1. Relationship Between Feature Size and Prediction Error

4.2. Influence of Data Size and Strength Range on Prediction Error

4.3. Three-Factor Coupling Mechanism Analysis

5. Discussion

5.1. Relationship Between Feature Structure Differences and Concrete Material System

5.2. Influence of Data Structure on Model Error

5.3. Interpretation of the Empirical Error Model

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI