Applications of Gene Expression Programming and Regression Techniques for Estimating Compressive Strength of Bagasse Ash based Concrete

: Compressive strength is one of the important property of concrete and depends on many factors. Most of the concrete compressive strength predictive models mainly rely on available literature data, which are too simple to consider all the contributing factors. This study adopted a new approach to predict the compressive strength of sugarcane bagasse ash concrete (SCBAC). A vast amount of data from the literature study and ﬁfteen laboratory tested concrete samples with di ﬀ erent dosage of bagasse ash, were respectively used to calibrate and validate the models. The novel Gene Expression Programming, Multiple Linear Regression and Multiple Non-Linear Regression were used to model SCBAC compressive strength. The water cement ratio, bagasse ash percent replacement, quantity of ﬁne and coarse aggregate and cement content were used as an input for models development. Various statistical indicators, i.e., NSE, R 2 and RMSE were used to assess the performance of the models. The results indicated a strong correlation between observed and predicted values with NSE and R 2 both above 0.8 during calibration and validation for the Gene Expression Programming (GEP). The outcomes from GEP outclassed all the models to predict SCBAC compressive strength. The validity of the model is further veriﬁed using data of ﬁfteen tests conducted in the laboratory. Moreover, the cement content in the mix was revealed as the most sensitive parameter followed by water cement ratio form sensitivity analysis. The GEP fulﬁlled all the criteria for external validity. The simple formulae derived in this study could be used reliably for the prediction of SCBAC compressive strength.


Introduction
A large amount of waste is generated due to rapid industrial development. Generation and disposal of large amounts of waste poses serious environmental issues [1,2]. The construction industry is responsible for 30% of global carbon dioxide emissions and also consumes maximum energy [3]. Concrete is the most consumed material and production of 1 ton of concrete releases approximately 0.05-0.13 tons of CO 2 into the environment [4][5][6]. Therefore, researchers are focusing on green technologies for sustainability [7][8][9]. Meanwhile, different types of wastes and by-products of industries were used as supplementary cementitious material for the production of green concrete. Such materials are considered as low-carbon alternatives to ordinary portland cement [10,11]. Incorporating these materials in concrete production can lead to durable and sustainable concrete. Hence, the production and use of green concrete is indispensable for reducing adverse environmental impacts [12,13], Sugarcane Bagasse Ash (SCBA) is agricultural waste produced in the sugar industry from the cogeneration process [14]. The use of SCBA as an alternative cementitious material in a concrete mix is considered to be advantageous to condense the environmental impact of the construction industry and preserve the natural environment [1].
Compressive strength is the fundamental property of concrete and indicates the ability to resist compression loads [15]. Concrete properties mainly depend on many factors such as mix design, type of material, testing procedure and mainly the concrete ingredients [16,17]. The properties and amount of supplementary material, curing techniques and concrete density [18] also effect the mechanical properties [19,20]. Therefore, knowing the relationship between mechanical properties, percentage replacement and water cement ratio used is indispensable before the frequent practice. For the assessment of concrete properties, the laboratory experiments are quite time-consuming, labor-intensive and costly. Therefore, models could be employed to reduce the experimental workload for assessment of mechanical properties [21]. Moreover, because of the small test record and restrictions in the parametric ranges, the accessible models could not provide accurate and desired results. Such a database, which covers a variety of parameters, is needed to create accurate equations which can easily predict multiple properties [1]. Developing strong prediction models for some key properties leads to time saving and making an effective mixture [22]. For this purpose, statistical regression techniques were employed by many researchers [23,24]. Some notable drawbacks were observed while using regression for empirical modeling. Firstly, the structure of the model should be definite and well defined by a linear or nonlinear equation for executing regression analysis. Secondly, the assumption of normality of residuals is another problem in regression analysis [25,26]. There are also concerns regarding the available equations in codes and standards for forecasting compressive strength because such equations are developed based on concrete tests without supplementary cementitious material [26].
Recent progress in the field of artificial intelligence (AI) techniques has resulted in the development of accurate and reliable models for the structural engineering problems [1]. The advancement and progress in the field of AI made it easy to produce models in order to adapt to difficulties related to concrete properties [1]. For solving problems specifically in Civil Engineering, AI has been in focus in both empirical and academic fields over the last decade [27]. The Artificial Neural Network (ANN) is the most widely used AI method to assess different concrete properties [28][29][30], although ANN is considered to be a black box model since it doesn't provide information about the anticipated prediction principles. The formulations based on neural networks are often too complex to be used because of the empiric formulization [1]. Moreover, the ANN is unable to provide practical prediction equations [22]. Genetic programming (GP) is another type of AI technique that automatically generates models which are totally based on genetic evaluation. GP is an influential optimization technique based on neural and regression techniques. However, GP is much more advantageous than conventional techniques. Without assuming the base form, GP has the ability to produce simple expressions. Some functions should be defined for regression and then analysis of the defined function is performed. There is no need to predefine the functions in GP, i.e., considering the formulations that fit the experimental outcomes, and GP itself adds or deletes various combinations of parameters. In this regard, Gene Expression Programming (GEP) is considered superior to other techniques [16]. The output of GEP is characterized by simple mathematical equations which are more applicable with high accuracy. GEP has a unique and multi-genic nature which permits the development of more complex programs. Recently, GEP has been used for commonly predicted method explicitly in the field of Civil Engineering [31][32][33].
Various researchers have modelled different properties of concrete using GEP. Abdollahzadeh et al. [16] predicted the 28-day compressive strength of concrete with the help of two GEP models considering the amount of cement, silica fume, CA, FA, water and superplasticizers. From literature, data of 159 concrete mixes were taken. The authors concluded that the GEP models can predict the compressive strength of high-strength concrete for each testing phase rendered to different statistical parameters. Gholampour A., et al. [1] used GEP to model the mechanical properties of recycled aggregate concrete utilizing data from literature. The authors reported a close agreement of the model results with the test results. Saridemir, M., and Bilir, T. [19] studied the elastic modulus of fly ash concrete using GEP models. The data for models were collected from 132 concrete mixes available in literature. The elastic modulus was predicted form the compressive strength of fly ash concrete. In the second model, prediction of the elastic modulus was made through the amount and strength of fly ash concrete. Iqbal, M.F., et al. [34] predicted the mechanical properties of waste foundry sand based concrete with the help of GEP. Extensive data from literature was used and four different parameters were used as input. They confirmed the high accuracy and prediction capacity of the GEP model. Mousavi, S.M., et al., [22] modeled the compressive strength of high performance concrete with GEP. A database was generated from literature study, and they observed that GEP is an effective technique to predict compressive strength of high performance concrete.
However, most of the above mentioned studies mainly depends on data collected from literature and it is very difficult to consider all the factors responsible for compressive strength. No organized and detailed study was conducted that considered both the combination of available literature data for model calibration and experimental testing of concrete for models validation. This study is devoted to apply the novel Gene Expression Programming (GEP), Multiple Linear Regression (MLR) and Multiple Non-linear Regression (MNLR) techniques to predict the compressive strength of Sugarcane Bagasse Ash Concrete (SCBAC). The literature data was used for models calibration and compressive strength results derived from laboratory tests were used to validate the model. SCBA was used to replace Portland cement in quantities of 0% 10%, 20%, 30% and 40% by weight of cement. The formulation of such models that accurately estimate the concrete compressive strength by utilizing the minimum number of parameters and then validating the model results through practical laboratory test results significantly enhances the utility and building trust in the forecasting models in many research areas.

Genetic Algorithm and Gene Expression Programming (GEP)
The genetic algorithm (GA) was first introduced by J. Holland [35] which was inspired by Darwin's theory of evolution. The biological evolution process is simply shown in the form of GAs and the solution is signified in fixed chromosome form. Similarly, Genetic programming (GP) was first proposed by Cramer [36] and further enhanced by Koza [37,38]. GP is basically an extension of GA, a type of machine learning that automatically generates models which are totally based on genetic evaluation. GP is an influential optimization technique based on neural and regression techniques. The computer-based program grows into a problem-independent solution based on the Darwinian reproduction principle.
The modified version of GP was proposed by Ferreira, C. [39] based on the population evolutionary algorithm and is named gene expression programming (GEP). A simple linear fixed array of chromosomes is encoded in GEP which outclasses the GP where a tree-like structure with variable length was used [40]. In the evolutionary GEP algorithm, the linear fixed length chromosomes and the nonlinear expression tree were inherited from GA and GP, respectively. GEP is an excellent method because of the linear fixed width of genetic programming as well as the genetic algorithm. Moreover, due to the genetic process on the chromosome level, GEP uses the simplest criteria and further permits the development of complex and nonlinear programs due to multi-genic behavior. The whole GEP comprised of five sets-Function set, terminal set, fitness measure set, parameters set and the criteria set-To terminate the functions. Each individual is set as a fixed-size linear string in GEP, which is called a genome. Furthermore, during the reproduction stage, the modification in the chromosomes takes place by genetic operators [41,42]. The schematic diagram of GEP algorithm is presented in Figure 1. Considering the formulations that fit the experimental outcomes, GEP itself adds or deletes various combinations of parameters [16]. In this paper, the GEP algorithm was applied with the help of GeneXproTools 5.0 [44]. Firstly, a chromosome of fixed length was randomly created for each individual. In the second step, the expression trees are created representing the chromosomes and the fitness of each expression tree is evaluated in terms of statistical indicators. And finally, the reproduction process is initiated for each individual assessed by a fitness function (which was RMSE for the developed model).

Multiple Linear and Non-Linear Regression
The multiple linear regression (MLR) model developed the relationship between many independent and dependent variables. The model expresses the value of a predicted variable as a linear function of one or more predictor variables. The basic assumption of MLR is that the association between Y i and the p vector of regression X i is linear. In the multiple non-linear regression (MNLR) model, the relationship between the dependent and independent variable is considered to be non-linear. The MNLR technique estimates the model by creating a random nonlinear connection between dependent and independent variables. The fundamental difference in MNLR is that estimating the equation nonlinearly depends on input variables [45]. The following equations, Equation (1) and Equation (2), represent the MLR [46,47] and MNLR [48], respectively.
Where a is the intercept, β is the slope or coefficient and k is the number of observations. In this study, the MLR and MNLR models were developed using statistical analysis software called Statistical Package for Social Sciences (SPSS). It allows the user to perform a set of statistical analyses from basic to complex predictive functionalities. All the datasets were carefully analyzed before performing the linear and non-linear regression in SPSS.

Datasets
The dataset of SCBAC was collected from 21 published papers available in literature [14,[49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66]. The final database form literature comprised of 65 data points. Each record of the collected data contained information about water/cement ratio (W/C), SCBA percentage replacement (SCBA%), fine aggregate in the mix (FA), cement content (CC), slump value (S), coarse aggregate in the mix (CA), water absorption (WA), specific gravity of the mix (SG) and compressive strength (f ' c ). Numerous trials were performed to assess the consistency and validity of the data. The data points with up to 20% deviation from the global trend were not considered in the development and model performance evaluation [34]. The description of the collected data is shown in Table 1. The overall data from literature was used to develop and calibrate the GEP and regression models. For models validation, concrete specimens with varying dosages of SCBA were casted and then tested for compressive strength.

Model Development and Performance Evaluation
Prior to model development, the first and essential step is to choose the input parameters which can have an effect on the SCBAC properties. All the available parameters in the database were studied in detail and several preliminary runs were conducted in order to determine the most influential parameters for SCBAC to establish a generalized relationship. As a result, the compressive strength of SCBAC is considered to be a function of the following input parameters.
In order to assess the performance of the developed models, different indicators were used, such as Nash Sutcliff efficiency (NSE), correlation coefficient (R 2 ) and Root Mean Squared Error (RMSE) [67,68]. These statistical indicators were used to check the model simulation with the actual data. The NSE ranges between -∞ to 1, where 1 is a perfect match. An NSE value greater than 0.65 depicts a very good correlation [69,70]. The R 2 value lies between 0 and 1 and the higher values indicate fewer errors, while value 1 represents completely matched data [70]. The RMSE is an error index type indicator commonly used in modelling studies. A lower value for RMSE is optimum. The mathematical expressions for RMSE, NSE and R 2 are shown below as Equations (4)-(6), respectively.
where n is the number of input samples; M i and P i are the measured and predicted values, respectively. M i and P i are the average of measured and predicted values, respectively.

Mix Proportions and Specimen Designation
In order to validate the robustness of the developed GEP and regression models, fifteen concrete specimens were casted. Mixes were abbreviated as CM and BC. CM represents control mix without addition of ash; whereas BC represents cement replaced with bagasse ash. The specific designation 10BC indicates that 10 percent cement has been replaced with SCBA. Target design strength of 21 MPa was designed. SCBA was used to replace OPC in quantities of 0% 10%, 20%, 30% and 40% by weight of cement. Each mix had the same water to cementitious material ratio of 0.5 and overall cementitious contents of 366 Kg/m 3 were kept constant. Furthermore, the X-ray Fluorescence (XRF) was conducted for chemical composition of sugarcane bagasse ash. Results of XRF are shown in Table 2. It can be seen that the sum of silica (SiO 2 ), alumina (Al 2 O 3 ) and Iron oxide (Fe 2 O 3 ) is 77.47 % (>70%), which meets one of the requirements of pozzolan as per ASTM C618-05. Mix proportions of the concrete mixes are shown in Table 3. Each mix was tested in fresh as well as in hardened state. The fresh concrete tests, i.e., slump test and fresh concrete density were determined as per ASTM C143 and ASTM C138M-01, respectively. At the curing age of 28 days, the compressive strength of the SCBAC and control concrete were determined according to BS 1881: Part 116: 1983 standard.  Figure 2a,b, respectively. Result shows that with the addition of further ash, the consistency of concrete increases. The maximum and minimum slump values of 58mm and 29mm were found for 40BC and control mix, respectively. Similarly, the maximum fresh density achieved by the CM was 2308.4 Kg/m 3 . The fresh density decreased with the introduction of SCBA as cement replacement. In comparison to control concrete, the decrease in fresh concrete densities for 10BC, 20BC, 30BC and 40BC mixes was found to be 0.85%, 1.82%, 2.34% and 2.62%, respectively. The decrease in fresh density is due to the fact that the density is a function of specific gravity. Since the specific gravity of cement is more as compared to bagasse ash, therefore, the density of the CM mix is highest.

Sensitivity Analysis
The behavior of the developed models and its relationship with the input variables was measured using sensitivity analysis. It is a procedure to observe the most sensitive parameters as there are uncertainties related to model input values. The models can provide acceptable results during calibration and validation but it is not confirmed that they will also provide the same results on unknown data sets. Therefore, it is necessary to perform the sensitivity analysis to evaluate the relative contribution of each input parameter on model output. In the current study, the method developed by Gandomi et al. [71] was adopted. The authors used Equation (7) and Equation (8) to find out the contribution to output by each input variable.
The f max (x i ) and f min (x i ) are the max and min of the estimated output over the i th output.

SCBAC Mixes Compressive Strength from Laboratory Tests
For GEP and regression models validation, concrete specimens were casted with different percentages of SCBA and tested for compressive strength at the curing age of 28 days. The results are tabulated in Table 4. The results show the SCBAC (10BC) has a higher strength than all of the mixes. The difference in strength between 10BC and CM is found to be 3.58%. The compressive strength consistently decreases with the addition of SCBA. This may be attributed to the fact that cement is the primary binding material. The cement replacement of 30% and 40% reduces the lime content required for pozzolanic reaction. The optimum replacement level of SCBA with cement is found to be 10%.

GEP Based Formulation for Compressive Strength of SCBAC Mixes
The model to formulate the compressive strength of SCBAC, Equation (9), is proposed to predict the 28 th day compressive strength of SCBAC. The GEP model developed for compressive strength of SCBAC is shortlisted after running a set of GEP algorithms initiating from fundamental set function, small head size and a single gene chromosome. The important parameters involved for GEP setting are tabulated in Table 5. All the structural association of chromosomes and function sets were chosen before running the GEP algorithm. The capability of GEP is highly affected by parameters selection. Basic mathematical operators were used to assess the best and simplest model. Due to the number of potential outcomes and the complexity of estimating the test model, the three best populations of 50, 100 or 150 were considered.
where, For GEP model simulation, five parameters, i.e., W/C, SCBA%, FA, CC, CA were used as an input. It is evident from Figure 3a,b that the developed GEP model effectively considered the influence of all the input parameters for both model calibration and validation. The statistical indicators, i.e., NSE, R 2 and RMSE, are 0.83, 0.83, 6.67 during model calibration and 0.87, 0.85, and 4.57 for model validation, respectively. An R 2 value of more than 0.8 is adequate [31]. Moreover, lower values of RMSE and higher values of R 2 and NSE show that the developed models adequately simulated the results [40].   The MNLR model simulation results for SCBAC compressive strength are shown in Figure 5a,b for calibration and validation, respectively. The NSE, R 2 and RMSE are found to be 0.55, 0.58 and 11.05, respectively, for calibration and 0.58, 0.56 and 5.52 for validation of the MNLR model. The absolute error between actual and predicted data for all the models is shown in Figure 6.

Sensitivity Analysis
The contribution of the most relevant parameters (W/C, SCBA%, CC, FA, and CA) in the model was evaluated through sensitivity analysis. Various sensitive parameters were identified that are essential for modelling the compressive strength of SCBAC. The results are graphically shown in Figure 7. The results revealed that the cement content (CC) in the mix is the most sensitive parameter for SCBAC compressive strength with a relative contribution of 55.73%. The same results were observed during experimental testing of SCBAC where the strength of the concrete decreased consistently with the addition of SCBA up to 40% (reduction in cement content). So it is obvious that the model results are much in line with experimental results which further enhances the robustness of the developed model. The second and third most sensitive parameters for compressive strength of SCBAC turned out to be W/C with 17.14% and the CA aggregate content in the mix with 16.98% relative contribution to compressive strength. Furthermore, the compressive strength is least affected by FA content in the mix.

External Validation of the Models
Various researchers have suggested that the performance of the proposed model considerably depends on the ratio of data points to the number of inputs [72,73]. In order to check the suitability of data points for attaining the connection between selected variables, the ratio should be more than 5 for the ideal model [73]. In the current study, the suggested ratio is 13 which fulfill the criteria set by researchers. Furthermore, other criteria for external verification of the models proposed by Golbraikh and Tropsha [74] suggested that the slope of the regression line should be approximately near to 1. Roy and Roy [75] introduced a new indicator (R m ) for external predictability of the proposed models and suggested the R m value of more than 0.5 for a satisfied model. Alavi et al. [76] suggested that the squared correlation coefficients R o 2 and R o ' 2 between the observed and predicted values should be near to 1. In this paper, the performances of the developed models were evaluated by the above mentioned criteria and the results are summarized in Table 6. The novel GEP model strongly fulfills all the aforementioned criteria.

Comparison of GEP, MLR and MNLR Models
To the authors' best knowledge, no detailed study was conducted that considered the regression as well as novel GEP technique to predict the compressive strength of SCBAC and also to validate the model results by laboratory concrete testing. Therefore, GEP, MLR and MNLR models were developed to predict the compressive strength of SCBAC and the results are compared. The NSE and R 2 values are highest for GEP followed by MNLR. Similarly, the RMSE is lowest for the GEP model during both calibration and validation. Summarizing the results and performance of all the models, the GEP outclassed all other models during calibration and validation. Moreover, the GEP-developed equation can easily be used to predict the compressive strength of SCBAC. The performance of both MLR and MNLR reduced (in terms of statistical indicators) during validation. This may be considered as one of the shortcomings of the regression-based models. The average errors between actual and predicted data are shown in Figure 6 and were found to be 0.79%, 2.12% and 4.52% for GEP, MLR and MNLR models, respectively.

Conclusions
The present study reported the applications of the novel gene expression programming and regression methods, i.e., multiple linear and non-linear regression for modeling and predicting the compressive strength of sugarcane bagasse ash concrete. A new method was proposed where the already available data from literature study was used to develop and calibrate the proposed models. To validate the models, laboratory specimens of concrete with different dosages of bagasse ash were casted and simultaneously the test results were used for validation. Regardless of the large number of factors responsible for variation in the compressive strength, the developed models were successfully calibrated and validated. The goodness of fit for the models was estimated in terms of NSE, R 2 and RMSE. A good correlation was observed in both the calibration and validation period between actual and predicted values. However, it was observed that gene expression programming outperformed all the other techniques and is capable of forecasting the compressive strength of sugarcane bagasse ash concrete for a known set of input parameters. The empirical equation developed by gene expression programming could be used frequently to estimate the compressive strength. Gene expression programming played a major role in evaluating the suitable relations required for the qualitative depiction of the processes responsible for compressive strength. This technique does not assume any prior solution and establishes a suitable relation between dependent and independent variables, thus making the techniques superior to others. The performance of the multiple linear and non-linear regression models reduced during model validation and may be considered as one of the drawbacks of the regression based models. The novel gene expression programming applied in this study could serve as a baseline to model and predict the mechanical properties of concrete with high accuracy and the minimum number of parameters.