Next Article in Journal
Electrodegradation of Selected Water Contaminants: Efficacy and Transformation Products
Previous Article in Journal
Influence of Playing Position on the Match Running Performance of Elite U19 Soccer Players in a 1-4-3-3 System
Previous Article in Special Issue
Mathematical Study of a Product-Gripping Mechanism for Industrial Transportation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evolutionary Polynomial Regression Algorithm with Uncertain Variables: Two Case-Studies in the Field of Civil Engineering

1
Department of Architecture, Construction and Design (ArCoD), Politecnico di Bari, 70125 Bari, Italy
2
Department of Structural, Geotechnical and Building Engineering, Politecnico di Torino, 10129 Torino, Italy
3
DICATECh, Politecnico di Bari, 70125 Bari, Italy
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(15), 8432; https://doi.org/10.3390/app15158432
Submission received: 24 June 2025 / Revised: 26 July 2025 / Accepted: 27 July 2025 / Published: 29 July 2025

Abstract

Data-driven approaches and calibration techniques for mathematical models, starting from observed data, are attracting more and more interest in the field of civil engineering. Among them, evolutionary polynomial regression (EPR) is an artificial intelligence (AI) technique that combines genetic algorithms (GAs) and regression strategies. However, the difficulties and uncertainties inherent in the method have pointed out how the implementation of proper computational methods together with the use of recent and qualified databases of experimental data are essential to carry out reliable formulations. In this framework, this paper explores a new robust EPR approach able to remove potential outliers and leverage points often occurring in biased dataset and simultaneously accounting for the effects of probabilistic uncertainties. Uncertainties are incorporated in the EPR methodology by adopting the direct perturbation method. In particular, it is shown the importance to set the parameters representative of experimental and analytical dispersions on the basis of the characteristics of the database in terms of homogeneity. With this purpose, two different case-studies are analyzed, dealing with the shear capacity of RC beams without stirrups and the compressive strength of cement-based mortar specimens, respectively. Finally, the best capacity equations are selected and discussed.

1. Introduction

Data-driven methodologies are attracting great interest in all engineering fields for modeling natural phenomena and predicting physical quantities. A growing number of researchers have recently advanced the use of computational methods with the aim of managing and processing large amounts of data [1]. Recent studies have proved that data-driven methods can be particularly beneficial in the development of capacity equations, allowing us to preserve the physical meaning of the resisting mechanism and to extract useful information about the examined phenomenon at the same time.
Over the past few decades, artificial intelligence (AI) methods have been utilized to extract models from observed datasets, particularly in areas such as real-time analysis, large-scale data interpretation, and optimization tasks. Among the various numerical approaches, Artificial Neural Networks (ANNs) and Evolutionary Algorithms (EAs) stand out as particularly effective for data analytics applications.
ANNs work similarly to the human brain’s neural network and are formed by hidden units interconnected by weights that can be updated according to quality parameters, which evaluate the proximity between the critical response and the one obtained [2]. The networks are organized in layers; the first layer takes an input, the last layer produces outputs while the middle layers have no connection with the external world and for this reason are called hidden layers. Within this structure, ANNs consist of a backpropagation feed-forward neural network.
Although ANNs represent accurate predictors, the technique suffers from several drawbacks, including its black box nature, that is its incapability of returning explicit formulae or equations. On the contrary, in order to find the best mathematical solution within a certain search space, EAs use an analogy with Darwin’s theory, according to which the driving force behind natural evolution is the capability of a population of individuals to reproduce and deliver new populations of individuals, which are better fit to their environment [3]. One of the important features of evolutionary computation is that a large population of solutions is considered at once and then only the better solutions are allowed to “have children”, while the worse ones are quickly eliminated. In this way the search is performed by evolving solutions, based on recombination and/or mutation and characterized by higher quality [4,5]. After a number of generations, the best selected models are substantially better than their initial long-dead ancestors.
Among the various AI methodologies, this paper explores the capabilities of the Evolutionary Polynomial Regression (EPR) modeling strategy. EPR is a data-driven hybrid approach that merges the advantages of Genetic Algorithms (GAs) with traditional numerical regression techniques [6], aiming at simple and interpretable mathematical expressions. In this method, GAs are employed to identify the structure and exponents of symbolic formulas, while Ordinary Least Squares (OLSs) is used to estimate the model parameters. The EPR approach generates a set of symbolic expressions that best fit the training data and remain accessible and understandable to practitioners [7]. Each resulting model can be further validated by assessing its consistency with physical mechanics, thus generalizing its practical application beyond the original dataset.
The EPR technique has been successfully applied in the field of civil engineering [7,8,9,10,11,12,13]. Current EPR techniques include both conventional and robust regression approaches to estimate the model parameters. The first approach involves minimizing the Root-Mean-Square Error (RMSE) between observed and predicted data by adopting a linear function, or alternatively, maximizing the likelihood function [14]. However, the OLS method yields consistent results only when certain assumptions are met—specifically, when the regressors are exogenous and the errors are normally distributed and homoscedastic. In practice, these assumptions are frequently violated when experimental datasets include outliers and/or leverage points. As a consequence, a robust regression can be necessary to limit the effects of the anomalous data points in the evaluation of model parameters [15]. Several efforts have been made to overcome the inconsistencies of anomalous observations and, with this aim, several robust approaches have been implemented to obtain the model parameters [16,17,18,19,20].
Both conventional and robust EPR methods rely on a frequentist estimation framework, which treats model parameters as fixed but unknown quantities. This approach is particularly appropriate when the disturbances are normally distributed and homoscedastic. However, a key limitation of the frequentist approach is that it is not able to quantify the uncertainty around the estimates. To overcome these limitations, a perturbation-based approach could be adopted in order to include uncertainty in the analysis.
The issue of the uncertainties inherent in the prediction of the response of structures has been widely discussed in the literature [21,22,23,24,25]. Within this framework, the direct perturbation method is an effective approach to calculate the effect of parameter uncertainty, by approximating the response as a polynomial in the uncertain parameters [26]. The polynomial form is generally a Taylor series about the mean value of the uncertain parameter. One major advantage of this approach is the possibility to estimate the statistical moments of the capacity model without choosing a specific probability distribution for the uncertain parameters. On the other hand, a higher-order perturbation approach could be necessary to improve the estimation of the statistical properties when the relationship between the output and the uncertain variables is highly nonlinear. Thus, as a further attempt to achieve robust and accurate models, this paper implements a novel EPR technique including uncertainty. The optimal polynomial structure is selected by using GAs. The model parameters are assumed to be Random Variables (RVs), with a normal probability distribution, that are independent from each other. The proposed approach provides an advancement to traditional EPR techniques by introducing a general system identification algorithm capable of accurately identifying the optimal mathematical model.
The present research explores, for the first time, the benefits of using the direct perturbation method within an EPR framework. This paper is organized as follows. A technical overview of conventional EPR methodology is firstly furnished. Then, the proposed approach, based on the direct perturbation method, is described in detail. Finally, the robust EPR procedure is applied to predict (i) the shear capacity of RC beams with rectangular cross section without stirrups, starting from an experimental database [13]; (ii) the compressive strength (CS) of cement-based mortar specimens, by adopting an extensive database recently provided in the literature [27].

2. Overview of the EPR Technique

In this paper, an advanced data modeling strategy, based on the evolutionary polynomial regression (EPR) approach, has been applied. EPR is carried out in two primary stages: an evolutionary process and a conventional numerical regression step. The first stage employs a multi-objective Genetic Algorithm (GA) to explore and identify the polynomial structure of the expressions, while the second stage utilizes the Ordinary Least Squares (OLS) method to estimate the optimal values of the model parameters [7,12]. The general model structure that EPR can manage is given by:
Y = a 0 + j = 1 m a j X 1 E S ( j , 1 ) X k E S ( j , k ) f ( X 1 E S ( j , k + 1 ) X k E S ( j , 2 k ) )
where m is the number of additive terms; a1, …, am are numerical parameters to be estimated; X1, …, Xk represent the candidate explanatory variables; and ES (j, z) (with z = 1, …, k) is the exponent of the z-th input within the j-th term. The exponents ES (j, z) are selected from a user-defined set of candidate values as well as f ( X 1 E S ( j , k + 1 ) X k E S ( j , 2 k ) ) which is a user-selected function among some possible alternatives (including the option of no function). The product among the candidate variables and the selected function gives the transformed variable:
Z j = X 1 E S ( j , 1 ) X k E S ( j , k ) f ( X 1 E S ( j , k + 1 ) X k E S ( j , 2 k ) )
The method starts from a number n of available observations representing the dataset size, by performing a random initialization of the population of explanatory variable exponents. Next, the model parameters are estimated using OLS by solving a linear inverse problem, which involves minimizing the squared errors between the observed and predicted data. This process guarantees a unique correspondence between each model structure and its associated parameters. Furthermore, EPR provides a nonlinear mapping of the data with a limited number of coefficients, thus simplifying the identification of the optimal pseudo-polynomial model. Thus, common over-fitting issues can be prevented [28]. Naturally, model accuracy tends to improve as complexity increases, which is reflected in the number of pseudo-polynomial terms and/or the number of input variables included in the model. Therefore, identifying the optimal model structure involves an optimization process aimed at maximizing or minimizing a predefined Objective Function (OF). The most accurate models are generated by the application of genetic operators: crossover and mutation. Crossover works by randomly recombining the exponent vectors of the best-performing solutions, while mutation introduces randomness by altering one or more exponent values. These operators iteratively produce new populations of exponent vectors, addressing the search for improved model structures. The search is completed when the algorithm returns the required value of the OF.
A flowchart summarizing the EPR procedure is depicted in Figure 1.
The database is constituted of n observations described through k candidate input variables. The symbolic expression is composed of m parameters, defined by setting the error distribution. The distribution of the errors between the estimated and observed values follows a given Probability Density Function (pdf) with mean μ and standard deviation σ. The errors are usually assumed to be normally distributed with a null mean and given variance σ. In order to limit the negative effects of extreme experimental values, the EPR model parameters are evaluated by adopting a robust regression approach. It consists of assigning a weight ψi to each observation, that is set as inversely proportional to the residual εi obtained from the estimates. So, robust regression requires a prior definition of an influence function ψ related to the residuals. Contrarily, GA is used for selecting the exponents to attribute to the explanatory variables appearing in the symbolic models. Among the whole set of solutions, the first S models characterized by the highest values of OF are selected to switch to the next generation of models. The crossover acts recombining the exponents assigned to two selected models, so creating a new set of exponents. The mutation consists of modifying one or more exponents of the newly generated model.

3. Direct Perturbation Method

The direct perturbation method is based on the main idea that the response can be expressed through a Taylor-series expansion in terms of a set of zero mean random variables [26]. Let μ and σ2 be the mean and variance operators, respectively. If all the uncertain variables Xi (i = 1…k) of the model are normally distributed and independent of each other, then the first-order approximation gives:
Y l i n = Y X ¯ + i = 1 k Y X i X ¯ X i μ X i
where X = (X1, X2, …, Xk) is the set of the uncertain variables, and X ¯ is the design point (with the restriction X ¯ i = μ X i ). It is so possible to rewrite Equation (3) in function of the so-called sensitivity factor β l as follows:
Y l i n = Y X ¯ + i = 1 k β i X i μ X i
with β i = Y X i X ¯ . Thus, the mean value and the variance of Ylin become:
μ Y l i n = Y X ¯
σ Y l i n 2 = i = 1 k l = 1 k β i β l C o v X i , X l
Under the further hypothesis that the elements of vector X ¯ are fully uncorrelated, Equation (6) is simplified to:
σ Y l i n 2 = i = 1 k β i 2 σ X i 2
This parameter reflects the analytical uncertainty of the model. Furthermore, one major advantage of this approach is its ability to estimate the first- and second-order statistical moments of the response variable without requiring the assumption of a specific probability distribution for the uncertain parameters. However, a limitation of the perturbation method is that its linear approximation may fail to accurately capture the nonlinear relationships between the model and the input parameters of vector X ¯ [29], a drawback that becomes especially significant in the presence of large uncertainties.

4. Combination of Objective Functions

Many optimization problems make use of Single Objective (SO) functions which are well suited to the global optimal search. Usually, these functions are associated with the predicted variance/errors such as the Root Mean Square Error (RMSE) and coefficient of determination (R2). Most of the engineering practices reveal that parameters optimization is intrinsically high dimensional, nonlinear and combinatorial [30]. Furthermore, each SO function refers to specific behavioral errors in the model. Therefore, the optimal model search needs to be performed by combining different objective functions where each of them places emphasis on a particular error’s feature. The combination of objective functions leads to an improvement in the best local search of the model and provides the highest accuracy.
In this paper, the analytical and experimental nature of the model’s uncertainties are analyzed. The analytical uncertainty is expressed in Equation (7) by the term σ Y l i n 2 . It reflects the variability in the predicted model due to the first-order derivatives of the observed quantity with respect to each explanatory variable. The experimental uncertainty refers to the variance in the model’s error, calculated as shown in Equation (8):
σ 2 = i = 1 n Y ˇ i Y i 2 n 1
where Yi refers to the i-th observed value and Y ˇ i is the fitted i-th value obtained through the EPR procedure. The term σ2 is evaluated based on the observed data points; therefore, its experimental nature is clear. The experimental variance is adopted in the definition of common objective functions, such as the R2 or RMSE.
A combination between the two described error-based functions is proposed as a novel OF suitable for EPR-based technique. The basic idea lies in modifying the common mathematical formulation of R2 by replacing the model’s variance with a combination between experimental and analytical variance, according to Equation (9):
R N 2 = 1 n 1 σ 2 i = 1 n Y i i = 1 n Y i / n 2 ( 1 f ( σ Y l i n ) )
where f ( σ Y l i n ) is a function of the analytical standard deviation, while the denominator represents the mean square errors of the data points. The combination is expressed as:
f ( σ Y l i n ) = δ σ Y l i n μ β
δ = (0;1)
The function f ( σ Y l i n ) is defined as a polynomial expression of the ratio between the analytical standard deviation ( σ Y l i n ) and the experimental mean value ( μ ) of the observed variable. This ratio represents the coefficient of variation of the analytical model, while β is the exponential coefficient of Equation (10) and δ is a dummy variable assuming 0–1 values. The influence of the analytical dispersion on the optimal search procedure is given by β in the case that δ = 1. Moreover, when the dummy variable is equal to zero, Equation (9) coincides with the standard formulation of the R2. The model that shows the best trade-off between the new proposed formulation of accuracy (RN2) and complexity (number of model parameters) will be selected as the optimal one.

5. Case Studies

The selection of homogeneous and reliable databases is essential for training the proposed EPR approach and accounting for uncertainties. With this aim, two different case studies have been considered herein. The first refers to a homogeneous dataset of the shear capacity of reinforced concrete (RC) beams without stirrups [13]. The second case study deals with a non-homogeneous dataset of compressive strength [27,31]. The two case studies are introduced in detail in the next subsections.

5.1. Problem Description 1

The first experimental dataset refers to the shear capacity of RC beams with rectangular cross section without stirrups and was recently adopted by the authors to develop the first data-driven model for the shear strength based on a new robust hybrid methodology [13]. It includes a total of 374 samples. The candidate explanatory variables within the considered database are given by: effective depth d, web width bw, geometrical ratio of the tense longitudinal reinforcement (in percent) ρl, shear-span to depth ratio a/d, shear compressive strength fc, and yielding stress of longitudinal reinforcement fyt. All forces and lengths are expressed in N and mm, respectively, whereas the experimental values of shear strength are in kN. The compressive strength of concrete fc ranges from 12.00 to 105.4 MPa, the yielding stress of longitudinal reinforcement fyl ranges from 283 to 1779 MPa, the longitudinal steel percentage ρl ranges from 0.139 to 6.635%, the shear span to depth ratio a/d ranges from 2.4 to 8.1, the effective depth d ranges from 110 to 2000 mm, and the web width bw ranges from 90 to 1000 mm.
The frequency of the variables occurring within the experimental database is provided in Figure 2. The number of input data corresponding to each specified range of variation shows a quite homogeneous distribution as well as all input values are greater than zero. Nevertheless, the histogram of the shear strength does not reveal a perfect Gaussian distribution, being the dispersion of data non symmetric with respect to the median value [32,33]. So the observations further from the median value may be considered as leverage points and could affect the accuracy of the regression analyses. As a consequence, a robust regression approach could be appropriate for estimating the parameters of the mathematical models [13,14,15,16].

5.2. Problem Description 2

The considered database is the one recently described in [27]. It includes 424 experimental results about the compressive strength (CS) of cement-based mortar specimens and was obtained by collecting 20 published research works on the behavior of cement-based mortars, with and without metakaolin, as well as superplasticizers.
The candidate explanatory variables are the following: the age of specimen (AS), the cement grade (CG), the metakaolin percentage in relation to total binder (MK/B), the water-to-binder ratio (W/B), the superplasticizer (SP), which is the percentage of the addition of superplasticizer in relation to the total binder (%w/w), and the binder-to-sand ratio (B/S). The AS ranges from 0.67 to 91 days, the CG ranges from 32 to 53.5 MPa, the MK/B ranges from 0 to 30%w/w, the W/B ranges from 0.3 to 0.6 w/w, the SP ranges from 0 to 2.35%w/w, the B/S ranges from 0.33 to 0.51 w/w, and the CS of specimens ranges from 4.1 to 115.25 MPa (Figure 3).
The database consists of four input variables that assume values that are always greater than zero. On the contrary, the parameters MK/B and SP assume null values. These parameters are intended as sensitivity variables aimed at capturing the effects on the compressive strength by their use. Therefore, the MK/B and SP parameters cannot be used as explanatory variables since they lack regular dependency on the compressive strength. Based on these observations, the EPR technique has been performed by considering the AS, CG, W/B, and B/S as explanatory variables. This choice appears reasonable and this is also why in [27] it is demonstrated that the MK/B and SP are the parameters with the lowest influence on the CS.
The amount of input data associated with each specified range of variation reveals a high heterogeneity in the experimental input distributions. Furthermore, the histogram of the cement-based mortar compressive strength does not reflect a perfect Gaussian distribution. The majority of the experimental values range between 10 and 70 MPa, while some observations are far from the median value; therefore, they may be considered as leverage points. These extreme values can affect the regression analysis, causing false estimates.
A deeper insight into the problem is given in Figure 4, where the relationship between the compressive strength values and each explanatory variable is depicted. Figure 4 shows how the variability of each explanatory variable is limited. In the case of the parameters CG and B/S, only five different values are observed. Instead, a wider range of variability is observed for the parameters AS and W/B. The limited range of the explanatory variables may result in a significant dispersion of the observed variable [28]. In addition to that, the presence of leverage points in the dataset may lead to vertical extreme values which affect the accuracy of the regression estimates. To avoid these problems, a robust regression approach is needed to estimate the parameters of the regression model.

5.3. Robust EPR Settings

Both selected databases have been randomly divided into training (70%) and testing (30%) sets in order to be representative of the complete database. The training set was only used to identify the capacity equations, while the testing set was used to verify if the model’s fit for the training dataset also maintained an adequate accuracy.
The entire EPR computational flow is based on the genetic operators listed in Table 1.
A robust regression model has been employed to estimate the model’s parameters [13]. Based on the observations given in the previous section, the Cauchy weight function has been adopted in the robust regression since it performs better than the other available algorithms in the case of vertical extreme observations [34]. The robust EPR technique has been iteratively performed by selecting different models comprising a number of parameters ranging from one to four. This leads to exploring the experimental and analytical variance of each model with increasing complexity. The variability of the exponents’ values has been set to [−2, 2], while a step variability of 0.01 has been assumed. The maximum number of iterations of 1000 has been fixed, while a termination condition has been set to avoid inefficient loops and limit the computational effort. Based on that, the iterative procedure ends when the accuracy variability does not considerably vary (±5‰) for at least one-third of the maximum number of iterations.

6. Results

The proposed procedure based on the EPR technique, including both analytical and experimental dispersions, is herein applied to the databases introduced in the previous section. The best models are selected on the basis of their accuracy, as defined by Equation (9). The procedure was repeated several times, setting the parameter δ = 1 and varying the parameter β as indicated in Table 2. The case with δ = 0 was also analyzed, corresponding to the procedure that analyzes only the experimental dispersion in the evaluation of accuracy. After fixing the values of δ and β, the EPR technique was implemented considering models with increasing complexity (from m = 1 to m = 4).

6.1. Problem 1

The results in terms of experimental dispersion versus analytical dispersion and of iteration versus accuracy are reported in Figure 5 and Figure 6 for the first case study.
It can be observed that, for each level of complexity, the absolute value of accuracy grows while the value of β increases. For example, in the case of β = 0.2 the maximum accuracy achieved is 0.30, while for β = 50 an accuracy equal to 0.95 is reached.
This result can be explained by an analysis of Equation (10), where the parameter β represents the exponent of the ratio between the analytical dispersion and the experimental mean value. Since this ratio is generally less than one in the case of quite homogeneous datasets, for increasing values of the exponent β, the value of the function f (σYlim) will tend towards zero. In other words, the analytical variability will tend to become negligible; so for β tending towards infinity, the accuracy RN2 will assume the same value as the classical accuracy defined as a function of just the experimental dispersion.
The above findings are confirmed by the results obtained and represented in Figure 5 and Figure 6. For small values of β, the analytical variance assumes values smaller than those corresponding to large values of β. At the same time, a completely opposite trend is observed for the experimental variance, that is, it diminishes as β grows.
So it is important to underline how the choice of the parameter β can affect the model performance and that, even if the value of RN2 increases moving from low to high values of β, a suitable trade-off between analytical and experimental dispersions already occurs for values of β quite close to 1.
From a deeper analysis of the results, it emerges that, for values of β ≤ 1, the models with reduced complexity present a higher analytical dispersion than the ones with increasing complexity. Conversely, as the model complexity increases, the experimental dispersion also grows. This effect is more marked for the model with m = 1, whose analytical dispersion is significantly higher than that of the models with more terms (m > 1). In the cases with β > 1, there is a better uniformity between the analytical and experimental dispersions, for different levels of complexity.
In the following models, the unit for Vopt is [kN] while all the mechanical and geometrical quantities are expressed in [N] and [mm], respectively.
By considering only the experimental dispersion (δ = 0), it can be observed that there is no substantial difference in terms of dispersion. In such cases, therefore, it is convenient to use the one-parameter model (m = 1), with this being the least complex and the most accurate:
V o p t = 0.00107 · d 0.7 · b w 1.2 · ρ l 0.6 · a d 0.7 · f c 0.2 · f t y 0.1         R N 2 = 0.943
Conversely, in all the other cases, the best model has to be selected taking into account not only the complexity but also the right trade-off between experimental and analytical dispersions, embedded into the accuracy parameter RN2.
The best accuracy together with acceptable experimental and analytical dispersions are achieved using the following model, corresponding to the parameters δ = 1, β = 50, m = 1:
V o p t = 0.003075 · d 0.7 · b w 0.9 · ρ l 0.3 · a d 0.1 · f c 0.2 · f t y 0.1       R N 2 = 0.947
By accepting a lower accuracy, the following expressions, characterized by lower dispersions, are also proposed, corresponding to the parameters δ = 1, β = 7, m = 1 and m = 2, respectively:
V o p t = 0.023331 · d 0.4 · b w 0.8 · ρ l 0.2 · a d 0.5 · f c 0.1 · f t y 0.3       R N 2 = 0.849
V o p t = 0.1903 · d 0.6 · b w 0.8 · ρ l 0.1 · a d 0.2 · f c 0.3 · f t y 0.1 + 6.42 · 10 5 · d 1.1 · b w 0.3 · ρ l · a d 0.3 · f c 1.9 · f t y 0.6 R N 2 = 0.899
With reference to the above models, the reported accuracies concern the whole database and are practically coincident with the ones for the training and validation sets; Figure 6b illustrates the relationship between experimental and predicted shear capacities.
The obtained accuracies are comparable with the ones achieved in [13], but additionally, the proposed robust formulations allow for uncertainties to be accounted for.
It is worth noting that Equations (12)–(15) allow a physical insight into the studied phenomenon. Consistently with the recognized shear transfer mechanisms and with the main building code formulations [35,36,37,38,39,40,41,42], the proposed shear resistance models are directly proportional to the material strengths and the amount of longitudinal reinforcement, while they are inversely proportional to the shear span-to-depth ratio.
Many experimental investigations have in fact demonstrated that the shear strength of RC beams without stirrups tends to decrease with an increase in the effective depth d, a phenomenon commonly referred to as the size effect. Some researchers have attributed this behavior to the observation that deeper members tend to develop wider cracks, which in turn significantly reduce the efficiency of shear transfer along the cracked interface.
The size effect has also been interpreted through the principles of fracture mechanics, particularly by considering the role of aggregate interlocking in resisting shear forces. Furthermore, it is well established that a portion of the shear force can be transmitted directly to the support via an inclined compression strut—commonly known as arch action. This mechanism becomes relevant when considering deep beams, and is therefore influenced by the shear span-to-depth ratio (a/d).
Thus, the inverse proportionality of the shear resistance from the shear span-to-depth ratio is representative of both the aggregate interlock and the arch effect. In addition, it has been observed that dowel action is weakened with a low reinforcement ratio. Therefore, the longitudinal tensile reinforcement ratio ρl takes into account the dowel action.
By observing the EPR models, it can be noted that Equations (12)–(14) clearly reflect the above mechanism, while Equation (15) is more complex. So, in order to demonstrate this physical coherence, in Figure 7 some graphs are reported depicting Vopt versus fc and a/d. They are obtained by maintaining constant (equal to the mean value) explanatory variables except for the analyzed input parameter and clearly show the consistency of the proposed EPR model. As for the dependence of Vopt on fty in Equation (15), by repeating the same procedure, it was found that Vopt decreases slowly by increasing fty until it is practically constant, and it does not make any significant contributions to the model itself.
Finally, Equations (12)–(14) are characterized by a low complexity, making them particularly useful for practical purposes.

6.2. Problem 2

The results achieved for the second case study are summarized in Figure 8 and Figure 9.
Differently to the previous case which was characterized by a homogeneous database, herein there is neither a progressive decrease in the experimental dispersion nor a significant increase in the analytical dispersion as β rises. This is due to the fact that the value of the function f (σYlim) will not tend towards zero while the parameter β increases, since the inhomogeneity of the database implies that the ratio σ Y l i n μ results in a value greater than 1. So, for this dataset, the best trade-off between experimental and analytical dispersions is observed for values of β close to 1. Furthermore, the maximum accuracy, equal to about 0.8, is already achieved for values of β close to 1. Therefore, for inhomogeneous datasets, it is convenient to select the model with a value of β close to 1 and a complexity as low as possible.
In the models constructed, the unit used for the CSopt and the CG is [MPa] while all the other input parameters are dimensionless. The highest accuracy together with acceptable experimental and analytical dispersions are achieved by the following model, corresponding to the parameters δ = 1, β = 50, m = 4:
C S o p t = 0.129 · A S 0.1 · C G 1.5 · W B 0.7 · B S 0.6 + 4.9187 A S 0.8 · C G 1 · W B 1.3 · B S 1.6 0.9903 · A S 1.2 · C G 0.8 · W B 0.8 · B S 0.7 + 22.8565 A S 0.1 · C G 0.3 · W B 1.5 · B S 0.7 R N 2 = 0.823
Nevertheless, the best compromise between accuracy, complexity and analytical/experimental dispersions is obtained by the following models, characterized by the parameters δ = 1, β = 7, m = 1, 2, respectively:
C S o p t = 0.33786 · A S 0.2 · C G 0.9 · W B 1.4 · B S 0.2 R N 2 = 0.704
C S o p t = 3.0173 · A S 1.4 · C G 0.6 · W B 0.3 · B S 0.7 + 0.35213 · A S 0.1 · C G 1.2 · W B 0.8 · B S 0.5 R N 2 = 0.814
Conversely, by choosing δ = 0, the best solution is given by Equation (18), with m = 2:
C S o p t = 0.0011 · A S 0.7 · C G 1.7 · W B 1.2 · B S 0.9 + 0.18981 · A S 0.2 · C G 1.2 · W B 0.9 · B S 0.3 R N 2 = 0.806
Also in this case, the indicated accuracies refer to the whole database; for each model, Figure 9b depicts the relationship between experimental and estimated concrete strengths.
It is important to underline that the proposed formulations allow some physical interpretations of the studied phenomenon. In particular, the attention is focused on the input variables CG and W/B since evidence from the literature shows that a significant relationship can only be recorded between mortar CS and the two selected parameters [27]. So, consistently with the experimental results in the literature and with the main building code prescriptions [35,36], the proposed mortar CSopt models are directly proportional to the CG, while they are inversely proportional to the W/B ratio. This behavior can be immediately recognized in Equations (17)–(19) while it is less evident in Equation (16). For this reason, in Figure 10 some graphs are reported depicting CSopt versus CG and W/B. They are obtained by maintaining constant (equal to the mean value) all the explanatory variables except for the input parameter under consideration, clearly reflecting the coherence of Equation (16).
Finally, the presented robust method provides slightly lower accuracies than the ones obtained in [27] by applying machine learning techniques such as AdaBoost and RF, but conversely it provides simple analytical formulae for the numerical estimation of the mortar CS. Clearly the selected optimum EPR models exhibit reliable predictions that also allow the effects of probabilistic uncertainties to be controlled.

7. Conclusions

This paper elaborated upon a novel robust and well-conditioned EPR technique able to remove potential outliers and leverage points lying in a dataset and simultaneously to account for the effects of probabilistic uncertainties.
A robust approach, based on the Cauchy algorithm, was implemented to define the model’s parameters. Uncertainties were included in the analysis by adopting the direct perturbation method and by introducing a new mathematical formulation of the coefficient of determination R2, embedded with experimental and analytical variances. EPR was then used to combine an evolutionary approach with multivariate regression techniques to find an optimal polynomial model.
The selected case-studies concerned the shear capacity of RC beams without stirrups and the compressive strength of cement-based mortar specimens; they illustrated how to set some introduced parameters representative of experimental and analytical dispersions, δ and β, on the basis of the homogeneity properties of the database, in order to obtain the best compromise between reliability, accuracy and complexity. In particular, for a quite homogeneous dataset, the model’s accuracy grew while the value of the main parameter β increased; conversely, for an inhomogeneous dataset, it was convenient to select a value of β close to 1. Each dataset is in fact different from the other and, within this premise, the two analyzed case-studies show that the choice of a suitable value of β, leading to a higher or lower accuracy/complexity, is available to the technician/researcher, who, for a new database, should analyze the results for different values of β. In addition, the two proposed datasets are representative of some uncertain characteristics that are recurrent in civil engineering problems and could thus guide the choice of the coefficient β.
Uncertainty is, on its own, a complex task and the proposed procedure represents a first attempt to derive models that include variables’ uncertainties. Similarly, in the developed examples, a first-order Taylor-series approximation was found to be sufficient, as proved by the consistency in the obtained models. Nevertheless, the investigation of a second-order perturbation, for highly nonlinear relationships, will be the next step in this research.
Finally, the numerical results have shown a satisfactory performance in terms of the accuracy and consistency of the new proposed model for any degree of complexity. The obtained formulae also allowed for a physical description of the studied phenomena and represent useful and practical tools which can assist in the calculation of shear capacity and mortar compressive strength; notably, simplified formulae that can be used to directly predict mortar compressive strength are missing from both the literature and technical codes.

Author Contributions

Methodology, A.F., S.M. and R.G.; Software, S.M.; Validation, A.F.; Formal analysis, A.F. and S.M.; Writing—original draft, A.F.; Supervision, R.G.; Project administration, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work is framed within the research projects: PRIN PNRR 2022 “Artificial Intelligence for ENVIronmental impact minimization of SEismic Retrofitting of Structures (AI-ENVISERS)”; PRIN 2022 “Digitalized life-cycle management of historic bridges by an integrated monyitoring and modeling CDE platform—HBridgeIM (Historic Bridge Information Modelling)” (2022744YM9); PRIN 2022 “REtrofitting Historic Architecture foR Zero Emissions (REHARZE)”.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wu, X.; Zhu, X.; Wu, G.-Q.; Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 2013, 26, 97–107. [Google Scholar] [CrossRef]
  2. Yegnanarayana, B. Artificial Neural Networks; PHI Learning Pvt. Ltd.: Delhi, India, 2009. [Google Scholar]
  3. Smits, G.F.; Kotanchek, M. Pareto-front exploitation in symbolic regression. In Genetic Programming Theory and Practice II; Springer: Berlin/Heidelberg, Germany, 2005; pp. 283–299. [Google Scholar]
  4. Box, G.E.; Tiao, G.C. A further look at robustness via Bayes’s theorem. Biometrika 1962, 49, 419–432. [Google Scholar] [CrossRef]
  5. Fonseca, T.C.; Ferreira, M.A.; Migon, H.S. Objective Bayesian analysis for the Student-t regression model. Biometrika 2008, 95, 325–333. [Google Scholar] [CrossRef]
  6. Jacquier, E.; Polson, N.G.; Rossi, P.E. Bayesian analysis of stochastic volatility models with fat-tails and correlated errors. J. Econom. 2004, 122, 185–212. [Google Scholar] [CrossRef]
  7. Fiore, A.; Berardi, L.; Marano, G.C. Predicting torsional strength of RC beams by using evolutionary polynomial regression. Adv. Eng. Softw. 2012, 47, 178–187. [Google Scholar] [CrossRef]
  8. Giustolisi, O.; Simeone, V. Optimal design of artificial neural networks by a multi-objective strategy: Groundwater level predictions. Hydrol. Sci. J. 2006, 51, 502–523. [Google Scholar] [CrossRef]
  9. Marano, G.C.; Quaranta, G.; Greco, R. Multi-objective optimization by genetic algorithm of structural systems subject to random vibrations. Struct. Multidiscip. Optim. 2009, 39, 385–399. [Google Scholar] [CrossRef]
  10. Ahangar-Asr, A.; Faramarzi, A.; Mottaghifard, N.; Javadi, A.A. Modeling of permeability and compaction characteristics of soils using evolutionary polynomial regression. Comput. Geosci. 2011, 37, 1860–1869. [Google Scholar] [CrossRef]
  11. Altomare, C.; Laucelli, D.B.; Mase, H.; Gironella, X. Determination of Semi-Empirical Models for Mean Wave Overtopping Using an Evolutionary Polynomial Paradigm. J. Mar. Sci. Eng. 2020, 8, 570. [Google Scholar] [CrossRef]
  12. Marasco, S.; Cimellaro, G.P. A new evolutionary polynomial regression technique to assess the fundamental periods of irregular buildings. Earthq. Eng. Struct. Dyn. 2021, 50, 2195–2211. [Google Scholar] [CrossRef]
  13. Marasco, S.; Fiore, A.; Greco, R.; Cimellaro, G.P.; Marano, G.C. Evolutionary Polynomial Regression Algorithm Enhanced with a Robust Formulation: Application to shear strength prediction of rc beams without stirrups. J. Comput. Civ. Eng. 2021, 35, 04021017. [Google Scholar] [CrossRef]
  14. Hutcheson, G.D. Ordinary least-squares regression. In The SAGE Dictionary of Quantitative Management Research; Moutinho, L., Hutcheson, G.D., Eds.; SAGE Publications: Thousand Oaks, CA, USA, 2011; pp. 224–228. [Google Scholar] [CrossRef]
  15. Lange, K.L.; Little, R.J.; Taylor, J.M. Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef]
  16. Li, G. Robust regression. In Vol. 281 of Exploring Data Tables, Trends, and Shapes; U340; Wiley: Hoboken, NJ, USA, 1985. [Google Scholar]
  17. Fox, J.; Monette, G. An R and S-Plus Companion to Applied Regression; SAGE: Los Angeles, CA, USA, 2002. [Google Scholar]
  18. Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; Wiley: New York, NY, USA, 2005. [Google Scholar]
  19. Andersen, R. Modern Methods for Robust Regression; SAGE: Los Angeles, CA, USA, 2008. [Google Scholar]
  20. Mohsenijam, A.; Siu, M.-F.; Lu, M. Modified stepwise regression approach to streamlining predictive analytics for construction engineering applications. J. Comput. Civ. Eng. 2017, 31, 04016066. [Google Scholar] [CrossRef]
  21. Ding, X.H.; Luo, B.; Zhou, H.T.; Chen, Y.H. Generalized solutions for advection–dispersion transport equations subject to time-and space-dependent internal and boundary sources. Comput. Geotech. 2025, 178, 106944. [Google Scholar] [CrossRef]
  22. Yin, Q.; Xin, T.; Zhenggang, H.; Minghua, H. Measurement and analysis of deformation of underlying tunnel induced by foundation pit excavation. Adv. Civ. Eng. 2023, 2023, 8897139. [Google Scholar] [CrossRef]
  23. Niu, Y.; Wang, W.; Su, Y.; Jia, F.; Long, X. Plastic damage prediction of concrete under compression based on deep learning. Acta Mech. 2024, 235, 255–266. [Google Scholar] [CrossRef]
  24. Zhang, H.; Xiang, X.; Huang, B.; Wu, Z.; Chen, H. Static homotopy response analysis of structure with random variables of arbitrary distributions by minimizing stochastic residual error. Comput. Struct. 2023, 288, 107153. [Google Scholar] [CrossRef]
  25. Chen, Y.; Zhang, L.; Xu, L.; Zhou, S.; Luo, B.; Ding, K. In-situ investigation on dynamic response of highway transition section with foamed concrete. Earthq. Eng. Eng. Vib. 2025, 24, 1–17. [Google Scholar] [CrossRef]
  26. Fiore, A.; Greco, R. Influence of Structural Damping Uncertainty on Damping Reduction Factor. J. Earthq. Eng. 2022, 26, 1899–1920. [Google Scholar] [CrossRef]
  27. Asteris, P.G.; Koopialipoor, M.; Armaghani, D.J.; Kotsonis, E.A.; Lourenço, P.B. Prediction of cement-based mortars compressive strength using machine learning techniques. Neural Comput. Appl. 2021, 33, 13089–13121. [Google Scholar] [CrossRef]
  28. Jin, L.; Kuang, X.; Huang, H.; Qin, Z.; Wang, Y. Study on the overfitting of the artificial neural network forecasting model. Acta Meteorol. Sin. 2005, 19, 216–225. [Google Scholar]
  29. Igusa, T.; Der Kiureghian, A. Response of Uncertain Systems to Stochastic Excitation. J. Eng. Mech. 1988, 114, 812–832. [Google Scholar] [CrossRef]
  30. Huo, J.; Liu, L.; Zhang, Y. Comparative research of optimization algorithms for parameters calibration of watershed hydrological model. J. Comput. Methods Sci. Eng. 2016, 16, 653–669. [Google Scholar] [CrossRef]
  31. Luo, B.; Su, Y.; Ding, X.; Chen, Y.; Liu, C. Modulation of initial CaO/Al2O3 and SiO2/Al2O3 ratios on the properties of slag/fly ash-based geopolymer stabilized clay: Synergistic effects and stabilization mechanism. Mater. Today Commun. 2025, 47, 113295. [Google Scholar] [CrossRef]
  32. Andrews, D.F.; Mallows, C.L. Scale mixtures of normal distributions. J. R. Stat. Soc. Ser. B (Methodol.) 1974, 36, 99–102. [Google Scholar] [CrossRef]
  33. Abanto-Valle, C.A.; Dey, D.K. State space mixed models for binary responses with scale mixture of normal distributions links. Comput. Stat. Data Anal. 2014, 71, 274–287. [Google Scholar] [CrossRef]
  34. Polat, E. The effects of different weight functions on partial robust M-regression performance: A simulation study. Commun. Stat.-Simul. Comput. 2020, 49, 1089–1104. [Google Scholar] [CrossRef]
  35. ACI (American Concrete Institute). Building Code Requirements for Structural Concrete (ACI 318-08) and Commentary; ACI: Farmington Hills, MI, USA, 2008. [Google Scholar]
  36. BSI (British Standards Institution). Eurocode 2: Design of Concrete Structures. Part 1-1: General Rules and Rules for Buildings; BSI: London, UK, 2004. [Google Scholar]
  37. Bažant, Z.P.; Yu, Q. Designing against size effect on shear strength of reinforced concrete beams without stirrups: I. Formulation. J. Struct. Eng. 2005, 131, 1877–1885. [Google Scholar] [CrossRef]
  38. Jeong, C.-Y.; Kim, H.-G.; Kim, S.-W.; Lee, K.-S.; Kim, K.-H. Size effect on shear strength of reinforced concrete beams with tension reinforcement ratio. Adv. Struct. Eng. 2017, 20, 582–594. [Google Scholar] [CrossRef]
  39. Oreta, A.W.C. Simulating size effect on shear strength of RC beams without stirrups using neural networks. Eng. Struct. 2004, 26, 681–691. [Google Scholar] [CrossRef]
  40. Rebeiz, K.S. Shear strength prediction for concrete members. J. Struct. Eng. 1999, 125, 301–308. [Google Scholar] [CrossRef]
  41. Russo, G.; Venir, R.; Pauletta, M. Reinforced concrete deep beams-shear strength model and design formula. ACI Struct. J. 2005, 102, 429–437. [Google Scholar]
  42. Zararis, P.D.; Papadakis, G.C. Diagonal shear failure and size effect in RC beams without web reinforcement. J. Struct. Eng. 2001, 127, 733–742. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the EPR procedure.
Figure 1. Flowchart of the EPR procedure.
Applsci 15 08432 g001
Figure 2. Histograms of input and target parameters, database 1.
Figure 2. Histograms of input and target parameters, database 1.
Applsci 15 08432 g002
Figure 3. Histograms of input and target parameters, database 2.
Figure 3. Histograms of input and target parameters, database 2.
Applsci 15 08432 g003
Figure 4. Relationships between CS and each considered explanatory variable.
Figure 4. Relationships between CS and each considered explanatory variable.
Applsci 15 08432 g004
Figure 5. Problem 1. Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for: δ = 1 and β = 0.2, 0.5, 1, 2, 7, 50.
Figure 5. Problem 1. Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for: δ = 1 and β = 0.2, 0.5, 1, 2, 7, 50.
Applsci 15 08432 g005
Figure 6. Problem 1. (a) Experimental versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for δ = 0; (b) experimental versus predicted shear capacity for Equations (12)–(15) (results for the whole database).
Figure 6. Problem 1. (a) Experimental versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for δ = 0; (b) experimental versus predicted shear capacity for Equations (12)–(15) (results for the whole database).
Applsci 15 08432 g006
Figure 7. Problem 1. Shear strength Vopt by Equation (15) versus fc and a/d, respectively.
Figure 7. Problem 1. Shear strength Vopt by Equation (15) versus fc and a/d, respectively.
Applsci 15 08432 g007
Figure 8. Problem 2. Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for: δ = 1 and β = 0.2, 0.5, 1, 2, 7, 50.
Figure 8. Problem 2. Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for: δ = 1 and β = 0.2, 0.5, 1, 2, 7, 50.
Applsci 15 08432 g008
Figure 9. Problem 2. (a) Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for δ = 0; (b) experimental versus predicted concrete strength for Equations (16)–(19) (results for the whole database).
Figure 9. Problem 2. (a) Experimental dispersion versus analytical dispersion and iteration versus accuracy for m in the range 1 to 4 and for δ = 0; (b) experimental versus predicted concrete strength for Equations (16)–(19) (results for the whole database).
Applsci 15 08432 g009
Figure 10. Problem 2. Mortar CSopt by Equation (16) versus W/B and CG, respectively.
Figure 10. Problem 2. Mortar CSopt by Equation (16) versus W/B and CG, respectively.
Applsci 15 08432 g010
Table 1. Setting of the genetic operators.
Table 1. Setting of the genetic operators.
Population Size (P)
[-]
Selection Rate (SR)
[%]
Crossover Rate (CR)
[%]
Mutation Rate (MR)
[%]
1000305020
Table 2. Setting of the parameters δ and β.
Table 2. Setting of the parameters δ and β.
Implementationδβ
110.2
20.5
31
42
54
610
750
80-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fiore, A.; Marasco, S.; Greco, R. Evolutionary Polynomial Regression Algorithm with Uncertain Variables: Two Case-Studies in the Field of Civil Engineering. Appl. Sci. 2025, 15, 8432. https://doi.org/10.3390/app15158432

AMA Style

Fiore A, Marasco S, Greco R. Evolutionary Polynomial Regression Algorithm with Uncertain Variables: Two Case-Studies in the Field of Civil Engineering. Applied Sciences. 2025; 15(15):8432. https://doi.org/10.3390/app15158432

Chicago/Turabian Style

Fiore, Alessandra, Sebastiano Marasco, and Rita Greco. 2025. "Evolutionary Polynomial Regression Algorithm with Uncertain Variables: Two Case-Studies in the Field of Civil Engineering" Applied Sciences 15, no. 15: 8432. https://doi.org/10.3390/app15158432

APA Style

Fiore, A., Marasco, S., & Greco, R. (2025). Evolutionary Polynomial Regression Algorithm with Uncertain Variables: Two Case-Studies in the Field of Civil Engineering. Applied Sciences, 15(15), 8432. https://doi.org/10.3390/app15158432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop