Predicting Marshall Flow and Marshall Stability of Asphalt Pavements Using Multi Expression Programming

: The traditional method to obtain optimum bitumen content and the relevant parameters of asphalt pavements entails time-consuming, complicated and expensive laboratory procedures and requires skilled personnel. This research study uses innovative and advanced machine learning techniques, i.e., Multi-Expression Programming (MEP), to develop empirical predictive models for the Marshall parameters, i.e., Marshall Stability (MS) and Marshall Flow (MF) for Asphalt Base Course (ABC) and Asphalt Wearing Course (AWC) of ﬂexible pavements. A comprehensive, reliable and wide range of datasets from various road projects in Pakistan were produced. The collected datasets contain 253 and 343 results for ABC and AWC, respectively. Eight input parameters were considered for modeling MS and MF. The overall performance of the developed models was assessed using various statistical measures in conjunction with external validation. The relationship between input and output parameters was determined by performing parametric analysis, and the results of trends were found to be consistent with earlier research ﬁndings stating that the developed predicted models are well trained. The results revealed that developed models are superior and efﬁcient in terms of prediction and generalization capability for output parameters, as evident by the correlation coefﬁcient (R) (in this case >0.90) for both ABC and AWC.


Introduction
The parameters for the Mix Design of Asphalt Concrete are used to grade characteristics of aggregates and the types and percentages of bitumen used. The MS and MF of asphalt concrete have been modeled widely using the distinctive features of most common technique of AI, i.e., ANN [6,8,[18][19][20][21][22][23][24][25][26][27][28][29][30][48][49][50][51][52][53]. These algorithms, with their abilities to recognize patterns, result in simplified engineering problems that are complex in nature [11,[54][55][56]. NN are labeled as Black-Box Algorithms (BBA) and can perform well only over a specific set of problems considered for optimization. BBA do not take into consideration any physical phenomenon or data information of the problem under evaluation [57][58][59]. The majority of ANN procedures slack in a way that complex numerical expressions are created for the prediction of output parameters based on input parameters. ANN-based modeling is considered as a correlation of input parameters with that of output parameters, and the relationship in the developed models is either linear or is based on a predefined base function [60]. Acceptable performance of ANN models is the main attraction for their application, but one of the shortcomings is the problem with generalization for traditional ANN-based models [50]. The development of models by ANN have the potential to overfit the data. Moreover, the utmost difficult task to carry out in studies using ANN is to identify the ideal number of neurons and layers in hidden layers with a trial-and-error approach to identify optimal network architecture [8,51]. In the recent decade, the majority of research studies have placed their focus on GEP and NN in order to model the MS and MF, the output parameters of M2DM, although MEP has certain advantages over similar algorithms. Usually, a large database is utilized to model MS and MF of M2DM. In GP, the utilization of a genetic tree cross-over operator results in the creation of a parse tree with a large population, which in turn, leads to increased simulation time and a requirement for large memory [42]. Additionally, the non-linear structure of GP works as a phenotype and genotype, which makes it challenging for the algorithm to devise a suitable mathematical expression needed for the desired properties. [61]. However, the MEP's inclusion of a linear variant enables it to easily differentiate between the phenotype and genotype of an individual [34]. There is a threshold limit in the rate of success rate in GP, by increasing the number of genes in the chromosomes. Overfitting is likely to appear beyond the threshold, limiting the applications of the model in the construction industry [62][63][64]. However, when the intricacy of the targeting model expression is undefined, which is major problem in material engineering, MEP is predominantly more beneficial, where a slight variation in input parameters has a significant impact on the output parameters [46]. In MEP, the encoding of numerous solutions in a single chromosome and the linearity in the chromosomes allows the model to search in a broader space for the prediction of output parameters [59,65]. The obvious benefits of MEP over EAs mentioned above would result in the creation of more precise models in the field of pavement engineering. Despite the fact that MEP has significant advantages over other approaches, it has been hardly been utilized in civil engineering tasks, and in the field of pavement engineering, its applications are near to none in predicting the output parameters of M2DM, i.e., MF and MS, despite its obvious advantages.
In the current research study, to predict the output parameters of M2DM, i.e., MS and MF, models have been developed utilizing the MEP technique. The modeling is combined with detailed statistical analysis and external validation, in conjunction with parametric study to warrant the accuracy, precision and effectiveness of the model. The availability of reliable and consistent models will endorse the utilization of the MEP technique in the construction sector, in general, and the pavement industry specifically, as it will bypass the hectic and time-consuming experimental procedures used for M2DM. This would contribute toward the reduction of time for testing and promote the use of the MEP technique. Additionally, the current methodology for modeling will pave the way for similar and accurate complex modeling of engineering phenomena.

Experimental Database
An enormous database was collected from the construction industry in Pakistan. The data for M2DM was compiled from various construction companies working on various road projects in Pakistan. The comprehensive dataset was compiled, using values from

Data Division and Preprocessing
A dataset of 253 samples for ABC and 343 for AWC was collected for the development of models employing the MEP approach. The model's efficiency is dependent on the distribution of the datasets [66]. The accuracy of newly constructed models for prediction mainly depends on (a) size of the data, (b) characteristics of the data, and (c) the relationship between input and output parameters [67]. From the literature, it is observed that if input parameters are considered in excess, which have low correlation with output parameters, then it can escalate the complexity of the model, placing adverse effects on the performance of the models [68]. Finally, eight input parameters were selected for the development of the MEP models. For training of MS and MF of ABC, 70% of the data (177 data points) was used, while 15% (38 data points) was used for testing and 15% (38 data points) was used for validation of the developed models, whereas for training of MS and MF of AWC, 70% of data (241 data points) was used, while 15% (51 data points) was used for testing and 15% (51 data points) was used for validation of the developed models.

Multi-Collinearity
The multi-collinearity problem, which developed owing to the inter-dependence of the input parameters, is a predominant issue in the applications related to machine learning algorithms [12,69,70]. It has the ability to raise the strength of relationships between variables, thus dropping the efficiency of the models being developed. It has advocated that the R between two input values of parameters should be less than 0.8 to prevent this problem [71]. R is calculated for all input parameter combinations, and Tables 1 and 2 show that whether positive or negative, R is smaller than the stipulated limit, i.e., 0.8, indicating that there would be no risk of multi-collinearity among input parameters during modeling. The only parameters that have direct relations with each other have values of 1 or near to 1, e.g., the relationship between Ps (%) and Pb (%).

Data Statistical Information
The generalization ability of the developed models is influenced by the distribution of their input variables. The frequency histograms are provided in Figures 1 and 2 for the ABC and AWC datasets, respectively. The distribution of input parameters is not uniform, and the frequencies of the input parameters are adequately high, as shown in Figures 1 and 2. It is vital to recall that if variables have high frequencies, the chances of obtaining a better model are increased. Tables 3 and 4 provide a statistical range of input data for the datasets ABC and AWC, respectively, in order to make the presentation of the data more comprehensible. Tables 3 and 4 show the units of the parameters, the data center (mean and median), most frequent values (mode), dispersion (standard deviation, sample variance and coefficient of variance), data extremes (minimum and maximum), and shapes of the distribution (kurtosis and skewness), making data interpretation relatively straightforward. Tables 3 and 4 show that Pb (%) ranges from 2.5-5.0 and from 2.5-5.5 for ABC and AWC, respectively. The statistics of the datasets demonstrate that the proposed machine learning models are applicable to a wide range of data inputs, enhancing their utility. Only few research studies have predicted MS and MF separately for ABC and AWC. As a result, distinct datasets have been compiled for ABC and AWC and have been considered for the respective development of models.

Model Development and Evaluation Criteria
The selection of influential input parameters for the prediction of output parameters is the starting step in the development of a model. In order to develop the model, the parameters affecting MS and MF were selected. Numerous trials were made, and the results were calculated to assess the best and simplest influencing parameters for the development of the model. Several fitting input parameters are required in MEP, which need to be identified before model development for a generalized and robust model. The fitting input parameters are carefully chosen, considering the previous recommendations and the trial-anderror approach [72]. The size of the population specifies the number of programs that are to be evolved. A model developed with large population size might be relatively accurate but it would be more complex and might take longer time to converge. Although, as size increases beyond a threshold limit, issues may arise regarding the overfitting of the model.
At the beginning, a subpopulation size of 10 and 100 generations were considered for

Model Development and Evaluation Criteria
The selection of influential input parameters for the prediction of output parameters is the starting step in the development of a model. In order to develop the model, the parameters affecting MS and MF were selected. Numerous trials were made, and the results were calculated to assess the best and simplest influencing parameters for the development of the model. Several fitting input parameters are required in MEP, which need to be identified before model development for a generalized and robust model. The fitting input parameters are carefully chosen, considering the previous recommendations and the trial-and-error approach [72]. The size of the population specifies the number of programs that are to be evolved. A model developed with large population size might be relatively accurate but it would be more complex and might take longer time to converge. Although, as size increases beyond a threshold limit, issues may arise regarding the overfitting of the model.
At the beginning, a subpopulation size of 10 and 100 generations were considered for the initiation of the project, with basic mathematical parameters, i.e., subtraction, addition, division, and multiplication. The parameters, including subpopulation and number of generations, were gradually increased in trials by addition of mathematical parameters in the models to reduce error size. The final selection of parameters for the four models, based on an acceptable error range, is shown in Table 5. The accuracy level that the model's algorithm should achieve is determined by the number of generations prior to its termination. The larger the number of generations in a run, the less the statistical errors will be. Likewise, crossover and mutation rates indicate the offspring's probability of undergoing these genetic operations. The range for the rate of crossover lies between 50-95%. Various combinations of the settings shown in Table 5 were tested on the data sample, and the optimum set of combinations was chosen based on the model's overall performance attributes, which are shown in Table 5. Overfitting of the data is one of major challenges in AI-based modeling. The model's efficiency is high, when using the original data, however, the efficiency reduces considerably when unseen data is used. To avoid this issue, it is recommended that the training model be tested on a testing or unseen dataset [73,74]. Consequently, the entire dataset was randomly divided into sets of training, validation and testing. In modeling, datasets of training and validation were processed. The validation model was then tested on the testing dataset, which was not included in the development of the model. The data distribution of all three datasets was assured to be consistent. For the current research study, 70%, 15%, and 15% of the data were used as training, testing, and validation, respectively. On all three datasets, the final developed models outperformed the competition. MEPX v 2021.08.05.0-beta, a commercially available computing tool, was used to implement the MEP algorithm.
The algorithm begins by creating a population of the most possible solutions. The process of the algorithm is iterative, and with each generation, it arrives closer to the solution. Within the solution population, each generation's fitness is assessed. The algorithm of MEP continues to advance until the function for pre-specified fitness, such as root mean squared error (RMSE) or R, remains unchanged. For each trained model, the objective function (OF) was also assessed in this research study because it reflects the influence of R, RMSE and frequency of data points to quantify the overall efficiency. If the results of the model for all datasets, i.e., training, validation, and testing, are not accurate, the process is repeated, increasing the size and number of subpopulations incrementally. After that, the model is finalized based on the minimum value OF. It was determined that certain models performed better on the dataset of training than on the testing set, indicating overfitting of the proposed model, which ought to be avoided. The number of generations it takes for a model to evolve has an in impact on the mode's accuracy. A model would keep evolving indefinitely in these types of algorithms due to induction of additional variables into the system. All the models in this research study were terminated at 1000 generations.
Furthermore, an ideal model should meet the criteria for several performance indicators, as elaborated in the following discussion.
The effectiveness of the models was assessed by calculating several statistical error metrics. These statistical errors included for the assessment of the models were R, mean absolute error (MAE), RMSE, relative squared error (RSE), relative root mean square error (RRMSE), and the performance index (ρ). Moreover, another strategy to prevent the model's overfitting was to choose an optimal model by reducing the value of OF, as advocated by Azim et al. [69], which is referred to as the fitness function. These statistical measures have the following Expressions (1)- (7): where p i , x i , p i and x i , denote the ith predicted, experimental, mean predicted and mean experimental values, respectively, and the symbol n denotes the total number of values in the dataset used for the development of the models. The training and testing sets are denoted by abbreviations T and TE, respectively. An accurate model has a high R value, while the statistical errors are low. R has been recommended by the researchers to assess the linear dependency among input and output parameters [75], with a value greater than 0.8 indicating a decent correlation between experimental and predicted values [41,76]. Due to the insensitivity of R with the division and multiplication of outputs with a constant, it could not be considered solely as a measure for overall model efficiency. The average magnitude of errors can be measured using MAE and RMSE. Both parameters, however, have their own implication. In RMSE, errors are squared before average is estimated, giving larger errors more weight. A high RMSE value indicates that the amount of high-error predictions is significantly more than the expected and should be excluded. MAE, however, gives large errors a low weight and is always smaller than RMSE. Likewise, Despotovic et al., (2016) recommended that a model is considered to be outstanding if RRMSE values are between 0 and 0.10 and fair if the value is between 0.11 and 0.20 [77]. The range of values for OF and ρ is 0-infinity. If the values of ρ and OF are less than 0.2, the model can be considered as good [66]. While using OF, it considers three factors at the same time, i.e., R and RRMSE with relative percentage of data in various datasets (training and testing). As a result, a low value of OF indicates that the model's overall performance is superior. As stated previously, numerous trial runs were carried out, and the models with the lowest values of OF stated in this research study. Additionally, the validation of developed models was also carried out using criteria suggested by various researchers, which are described in Table 6. Table 6. External validation criteria.

Discussions
The models for MS and MF of ABC and AWC, respectively, were developed using MEP programming in MEPX v 2021.08.05.0-beta as already explained in Section 3. The results obtained from the equations obtained from the developed models and their analysis is described in this chapter.

Formulation of Mechanical Properties
The decoded mathematical equations for the calculation of corresponding properties based on eight input parameters are taken from MEP for ABC and AWC. The explicit expressions for MS and MF of ABC and MS and MF of AWC, respectively, are shown in Equations (8) where a = Ps (%) + cos(VMA (%)) b = cos(VMA (%)) + cos(VMA (%))  Figures 3 and 4 below for all of the datasets, i.e., training, validation and testing. Additionally, regression line expression is also displayed in these graphs. For an ideal scenario, the slope of the regression line should be close to 1. It can be considered from Figures 3 and 4 below that the developed models have a significant correlation between the values of predicted and experimental data as manifested by the slopes of 0.9742, 0.9759, and 0.9624 for training, validation, and testing, respectively, for MS of ABC; 0.9742, 0.9759, and 0.9624 for training, validation, and testing, respectively, for MF of ABC; 0.9530, 0.9714, and 0.9029 for training, validation, and testing, respectively, for MS of AWC; and 0.9581, 0.9783, and 0.9727 for training, validation, and testing, respectively, for MF of AWC. The developed models have been trained thoroughly on the input parameters in order to effectively predict the values of MS and MF for ABC and AWC, respectively. Furthermore, for the data points for all three datasets, the results are quite comparable to one another and close to an ideal fit, showing that the models have been well trained and possess strong generalization capability, i.e., the performance of the models on unseen data will be equally well. The models have performed exceptionally well on the testing data. This demonstrates that the problem regarding the overfitting of the data points employed for modeling has been greatly removed in all the models. Moreover, the number of data points employed for such empirical models has a significant impact on the applicability and accuracy of such models [78]. The largest number of points, i.e., 241, have been chosen in the collated database for MS and MF of AWC; hence, minimum errors with high accuracy have been achieved. for MF of ABC; 0.9530, 0.9714, and 0.9029 for training, validation, and testing, respectively, for MS of AWC; and 0.9581, 0.9783, and 0.9727 for training, validation, and testing, respectively, for MF of AWC. The developed models have been trained thoroughly on the input parameters in order to effectively predict the values of MS and MF for ABC and AWC, respectively. Furthermore, for the data points for all three datasets, the results are quite comparable to one another and close to an ideal fit, showing that the models have been well trained and possess strong generalization capability, i.e., the performance of the models on unseen data will be equally well. The models have performed exceptionally well on the testing data. This demonstrates that the problem regarding the overfitting of the data points employed for modeling has been greatly removed in all the models. Moreover, the number of data points employed for such empirical models has a significant impact on the applicability and accuracy of such models [78]. The largest number of points, i.e., 241, have been chosen in the collated database for MS and MF of AWC; hence, minimum errors with high accuracy have been achieved.   for MS of ABC; 0.9742, 0.9759, and 0.9624 for training, validation, and testing, respectively, for MF of ABC; 0.9530, 0.9714, and 0.9029 for training, validation, and testing, respectively, for MS of AWC; and 0.9581, 0.9783, and 0.9727 for training, validation, and testing, respectively, for MF of AWC. The developed models have been trained thoroughly on the input parameters in order to effectively predict the values of MS and MF for ABC and AWC, respectively. Furthermore, for the data points for all three datasets, the results are quite comparable to one another and close to an ideal fit, showing that the models have been well trained and possess strong generalization capability, i.e., the performance of the models on unseen data will be equally well. The models have performed exceptionally well on the testing data. This demonstrates that the problem regarding the overfitting of the data points employed for modeling has been greatly removed in all the models. Moreover, the number of data points employed for such empirical models has a significant impact on the applicability and accuracy of such models [78]. The largest number of points, i.e., 241, have been chosen in the collated database for MS and MF of AWC; hence, minimum errors with high accuracy have been achieved.

Performance Evaluation of MEP Models
The volume of datasets used in developing the models are also important because the reliability of developed models is dependent on these. The literature recommends a ratio of greater than 5 for the number of data points to the number of input parameters used in both unseen and training (testing and validation) stages [66,79,80]. For the training stage, the ABC and AWC models have a ratio of 22.13 and 30.13, respectively. While in the testing stage, the ABC and AWC models have values of 5.75 and 6.38, respectively. Statistical measures such as R, MAE, RSE, RMSE, RRMSE, OF, and ρ are used to evaluate the performance of the developed models, as explained in Section 3. Table 7 below shows the results of these statistical measures for training, testing, and validation for the ABC and AWC models. It is seen from Table 7  The results of ρ is less than 0.20 for all three datasets for all four developed models, demonstrating that all four developed models are reliable and have the capability to predict output parameters.
For the ABC and AWC models, the values of OF are 0.033 and 0.046, respectively. These values are almost near to zero, verifying overall performance and demonstrating that the problems regarding the overfitting for the proposed models has been addressed properly.
To analyze the statistics of absolute errors, the datasets for all four developed models are plotted in Figures 5-8 below, together with absolute errors in their respective points of the datasets. The mean error of predicted values for ABC-MF and AWC-MF is 0.62 and 0.35, with a maximum error of 3.59 and 1.64, respectively. Out of 253 data points for ABC-MS, only four points have a value greater than 100 kg, which accounts for 1.58% of the total data points, while for AWC-MS, only eight points have a value greater than 80 kg out of 343 data points, which accounts for 2.33% of the total data points. According to the findings, it was concluded that for ABC-MS, ABC-MF, AWC-MS, and AWC-MF, 85% of the results have errors less than 67 kg, 1.05 (0.25 mm) and 48 kg, 0.63 (0.25 mm), respectively. MS, only four points have a value greater than 100 kg, which accounts for 1.58% of the total data points, while for AWC-MS, only eight points have a value greater than 80 kg out of 343 data points, which accounts for 2.33% of the total data points. According to the findings, it was concluded that for ABC-MS, ABC-MF, AWC-MS, and AWC-MF, 85% of the results have errors less than 67 kg, 1.05 (0.25 mm) and 48 kg, 0.63 (0.25 mm), respectively.   total data points, while for AWC-MS, only eight points have a value greater than 80 kg out of 343 data points, which accounts for 2.33% of the total data points. According to the findings, it was concluded that for ABC-MS, ABC-MF, AWC-MS, and AWC-MF, 85% of the results have errors less than 67 kg, 1.05 (0.25 mm) and 48 kg, 0.63 (0.25 mm), respectively.       Table 8 below shows the results of the additional criteria utilizing external validation of the developed models. The slopes of the regression lines (k or k') running from the origin have been suggested to be close to one [81]. A confirmed indicator (Rm) was proposed by Roy and Roy (2008) as a measure of the model's external predictability. When Rm > 0.5, this criteria is satisfied [82]. Table 8 below indicates the all four models have satisfied the criteria considered in the external validation of the models, indicating that the all the developed models are credible, and these are not a mere correlation between input and output parameters.

Parametric Analysis
In the case of model development based on AI, numerous analyses are required to guarantee that models are strong and that their performance is well across a variety of  Table 8 below shows the results of the additional criteria utilizing external validation of the developed models. The slopes of the regression lines (k or k ) running from the origin have been suggested to be close to one [81]. A confirmed indicator (Rm) was proposed by Roy and Roy (2008) as a measure of the model's external predictability. When Rm > 0.5, this criteria is satisfied [82]. Table 8 below indicates the all four models have satisfied the criteria considered in the external validation of the models, indicating that the all the developed models are credible, and these are not a mere correlation between input and output parameters.

Parametric Analysis
In the case of model development based on AI, numerous analyses are required to guarantee that models are strong and that their performance is well across a variety of datasets. A higher performance of the current datasets, i.e., training, testing, and validation, does not imply that the models are superior overall. Numerous research has advocated for the use of parametric analysis, and it is used in this research study to determine whether the models are well trained and not only a correlation of the input and output parameters. All the input parameters are set to their mean value, and the output variation is plotted against the change in one of the input parameters over its entire range. This procedure is carried out for each of the input parameters separately. The results of the parametric analysis for MS of ABC and AWC and MF of ABC and AWC for their respective developed models are shown in Figures 9 and 10, respectively.
tive developed models are shown in Figures 9 and 10, respectively.
The equations of MS and MF of ABC and AWC as developed by MEP does not use all the input parameters in each equation. The MEP algorithms have chosen the input parameters that have given the best results. Based on the developed equations of MEP models, only the parameters drawn in Figures 9 and 10 are used in the equations developed by MEP and have a significant impact on the output parameters.   tive developed models are shown in Figures 9 and 10, respectively.
The equations of MS and MF of ABC and AWC as developed by MEP does not use all the input parameters in each equation. The MEP algorithms have chosen the input parameters that have given the best results. Based on the developed equations of MEP models, only the parameters drawn in Figures 9 and 10 are used in the equations developed by MEP and have a significant impact on the output parameters.    Figure 9 shows that as Ps (%), Pb (%), and Va (%) increases, MS increases to a point and subsequently drops. Additionally, it has been witnessed in the collected datasets that when Ps (%), Pb (%), and Va (%) increase, MS first increases and then drops. MS is likewise seen to decrease linearly when Gsb and VMA (%) increase. The collected dataset also shows that as Gsb and VMA (%) increases, MS decreases. MS increases linearly as VFA (%) increases. In addition, it has been observed in the collected datasets that when VFA (%) increases, MS increases as well. Previous research studies have found similar trends in the parametric analysis of MS [8].

Marshall Flow Analysis
MF increases with increasing Pb (%), Gmb, and VFA (%), as shown in Figure 10. In the collected datasets, it was also discovered that when Pb (%), Gmb, and VFA (%) increases, MF increase as well. MF decreases linearly when Gsb, Gmm, Va (%), and VMA (%) increase. The collected datasets also show that as Gsb, Gmm, Va (%), and VMA (%) increase, MF decreases. Previous research studies have found similar trends in the parametric analysis of MS [8].

Conclusions
This research study utilizes MEP, an innovative AI technique in the area of pavement engineering to develop predictive models for MS and MF of M2DM for the ABC and AWC of flexible pavements. For this reason, wide and comprehensive datasets were produced from various projects in Pakistan. The researchers developed high-precision models, and the following are the main conclusions of this research study:

•
The developed models have produced results that are consistent with the experimental data and function equally well for unknown data.

•
Various performance measures such as R, RRMSE, RSE, RMSE, MAE were used to assess the reliability and correction of the developed models. Furthermore, OF and ρ showed highly generalization capability of the developed models, with the issue of overfitting effectively addressed. The results of the statistical parameters validated the accuracy of the proposed MEP developed models. The developed models also met a number of external validation criteria taken from the literature.

•
The models developed have successfully incorporated input parameters and have the capability to predict the trends of MS and MF for flexible pavements, as revealed from the parametric study. • It is convincing from the modeling approach being proposed, i.e., MEP in conjunction with validation parameters, that MEP can be utilized for predicting the Marshall parameters.
It is suggested that different AI techniques such as SVM, Ensemble random forest regression, eXtreme gradient boosting (XGBoost), GEP, and ANFIS should be used to predict MS and MF and should then be compared to each other to see which AI technique is more efficient in predicting MS and MF.
It is recommended that bitumen with different penetration grades such as 85/100 and 45/50 be tested on AI-based Marshall parameter modeling.
The most influential parameter in M2DM is the grading of aggregates, whereas the impact of grading on M2DM has to be discussed by various researchers. Hence, finding the influence of different types of grading on Marshall parameters using various AI methods is also recommended.