Stiffness Moduli Modelling and Prediction in Four-Point Bending of Asphalt Mixtures: A Machine Learning-Based Framework

: Stiffness modulus represents one of the most important parameters for the mechanical characterization of asphalt mixtures (AMs). At the same time, it is a crucial input parameter in the process of designing ﬂexible pavements. In the present study, two selected mixtures were thoroughly investigated in an experimental trial carried out by means of a four-point bending test (4PBT) apparatus. The mixtures were prepared using spilite aggregate, a conventional 50/70 penetration grade bitumen, and limestone ﬁller. Their stiffness moduli (SM) were determined while samples were exposed to 11 loading frequencies (from 0.1 to 50 Hz) and 4 testing temperatures (from 0 to 30 ◦ C). The SM values ranged from 1222 to 24,133 MPa. Observations were recorded and used to develop a machine learning (ML) model. The main scope was the prediction of the stiffness moduli based on the volumetric properties and testing conditions of the corresponding mixtures, which would provide the advantage of reducing the laboratory efforts required to determine them. Two of the main soft computing techniques were investigated to accomplish this task, namely decision trees with the Categorical Boosting algorithm and artiﬁcial neural networks. The outcomes suggest that both ML methodologies achieved very good results, with Categorical Boosting showing better performance (MAPE = 3.41% and R 2 = 0.9968) and resulting in more accurate and reliable predictions in terms of the six goodness-of-ﬁt metrics that were implemented.


Introduction
In the civil engineering domain, transportation infrastructure represents one of the major application fields, and flexible pavements prepared with asphalt concretes are still the most used technological solutions in every road network all over the world [1].To evaluate whether an asphalt mixture (AM) is suitable for civil engineering applications, road agencies charged with pavement maintenance and construction often refer to performance indicators related to the mechanical behavior of the mixture under consideration.A failure to reach the standard thresholds related to the parameters considered could result in unsuitable mechanical behavior, causing the formation of typical fatigue or low-temperature cracks that would inevitably reduce the service life of the pavement.Serviceability would then be compromised, thus causing serious safety-related issues for every road user.For this reason, it is crucial to properly characterize the mechanical behavior of asphalt mixtures from a performance-based perspective [2][3][4].
The conventionally adopted approach is experimental: it allows the mechanical performances of the investigated AMs to be accurately evaluated, but it is time-consuming, and very expensive laboratory tests are often required.Furthermore, this approach has other drawbacks, such as (i) the need to repeat the experimental campaign when any compositional variable of the investigated mixture changes, with resulting impacts on time and costs; and (ii) the presence of skilled laboratory technicians that must be familiar with all the test protocols [5][6][7][8][9][10].
To overcome these drawbacks, the efforts of the scientific community have been focused on the development of mathematical models that allow each parameter involved in the mechanical characterization of a mixture to be individually handled.This achievement involved identifying complex constitutive equations that, embedded within specialized materials mechanics software, allow the performance parameters of a mixture to be accurately predicted [11][12][13][14][15][16][17][18].These advanced constitutive models provide a detailed understanding of the mechanical responses of bituminous mixtures, coupled with useful methodologies for pavement monitoring [19].
However, in recent years, alternative non-physically-based models which use soft computing techniques have gained popularity within the academic community.Unlike constitutive equations, these machine learning (ML) models do not depend on the nature of a problem, and they possess the outstanding capability of being able to model even very different phenomena [20].They do not require a priori knowledge of the relationships between inputs and their corresponding outputs, but they still allow accurate, fast, and reliable predictions to be produced [21][22][23][24].
Some of the most commonly used soft computing techniques are based on artificial neural networks (ANNs) or decision trees (DTs).The functioning of the former is intended to mimic that of the human biological nervous system by using simple elementary units named artificial neurons.Being highly interconnected and organized in successive layers, artificial neurons allow neural networks to model even highly nonlinear phenomena by producing both fast and accurate predictions in terms of several output variables, namely permeability, interface shear stiffness modulus, compressive strength, stress at failure, and so on [25][26][27][28][29][30][31][32][33][34].However, the difficulty of interpreting neural models coupled with the challenges associated with the optimization of the multiple hyperparameters involved [35] make them somewhat difficult to handle.
In contrast, simple decision rules allow DT-based algorithms to produce equally accurate predictions [36,37], making them competitive with those produced by ANN models, but without the interpretability issues.This factor makes them preferable for solving many regression or classification problems [38].Therefore, two different soft computing techniques were investigated and implemented in the present study: the former involved the development of an ANN-based model, whereas the latter was based on a DT algorithm known in the scientific literature as CatBoost.By way of example, in recent years, CatBoost was successfully implemented to predict the Pavement Condition Index values of asphalt concrete overlays [39].
The main purposes of the present study are (i) to mechanically characterize two different asphalt mixtures for pavement construction by investigating a fundamental behavioral parameter, such as stiffness modulus (SM); and (ii) to model and predict the performance of each mixture using ANN-and CatBoost-based ML algorithms.The two investigated mixtures were designed for binder and base layers and were prepared using spilite aggregate, a conventional 50/70 penetration grade bitumen, and limestone filler.An extensive four-point bending test (4PBT) experimental campaign was carried out to determine the stiffness modulus values of the mixtures under 11 loading frequencies and 4 testing temperatures, ranging from 0.1 to 50 Hz and from 0 to 30 • C, respectively.Three to five specimens were tested for each condition, and the results were averaged to obtain the dataset subsequently used to train and test the two different ML models.
State-of-the-art procedures were implemented in both the developed methodologies, including k-fold cross-validation, overfitting detection, and extensive grid searches to find the best hyperparameter sets for each algorithm.Six different performance metrics were investigated to determine the accuracy of the developed models and to evaluate their generalization capabilities, namely mean absolute error, mean absolute percentage error, mean squared error, root mean squared error, Pearson correlation coefficient, and determination coefficient.
The results obtained emphasize that the developed ML model can provide accurate and reliable stiffness modulus predictions, thus allowing these values to then be exploited within the well-established design procedures.This represents the main contribution provided by the present study to the existing scientific literature.In addition, the sensitivity analysis that was carried out confirms that ML models are able to understand the functional relationships between the variables that were investigated, thus inspiring future scientific applications.
The remainder of the paper is organized as follows: Section 2 describes the volumetric characterization of the prepared mixtures and the experimental campaign that was carried out.Section 3 provides an overall explanation of the ML framework, explaining all the pre-processing and resampling techniques that were implemented in the developed models.Section 4 describes the obtained predictive results and compares the performances of the CatBoost and ANN models.Finally, Section 5 outlines the main conclusions and points out future developments.

Materials and Methods
Two asphalt mixtures were investigated in the present study.They were designed for binder and base course layers, respectively.The former involved the utilization of spilite aggregate with a nominal maximum aggregate size of 16 mm (AML16) obtained from the Bělice quarry (Benešov, Czech Republic).The latter was also prepared using spilite aggregate, but the nominal maximum aggregate size was 22 mm (AMP22).The bitumen used to prepare both the mixtures was a conventional one, with a penetration grade (PG) at 25 • C ranging between 50 and 70 mm/10, meeting the technical specifications set by the European standard EN 12591 [40] for paving grade binders (see Table 1).The binder was obtained from the Litvinov refinery (Litvínov, Czech Republic).Finally, limestone filler from the Velke Hydcice quarry was used.The mix design followed the requirements set by Czech technical standard CSN 73 6121 [41].The grading curves of both mixtures is given in Table 2.A volumetric characterization of both the mixtures was carried out, and the results are summarized in Table 3.The following additional requirements were also tested: (i) moisture susceptibility according to EN 12697-12 [42] determined at 15 • C; and (ii) stiffness tested on 6 cylindrical test specimens via repeated indirect tensile stress tests (IT-CY) according to EN 12697-26 [43] at 15 • C.An experimental trial using a four-point bending test (4PBT) apparatus was performed under several testing conditions.The test method principles and general conditions related to the testing apparatus and the test specimens that were used are given in Annex B. The resulting stiffness values were investigated by exposing the prismatic specimens to different temperatures and frequencies.According to EN 12697-26 [43], the dimensions of the test specimens should be 405 × 50 × 50 mm.To determine stiffness, eleven different loading frequencies (from 0.1 to 50 Hz) and four different testing temperatures (from 0 to 30 • C) were selected.As for the AMP22 mixture, five testing temperatures were investigated since the stiffness moduli were also tested at 15 • C. The technical standard does not specifically set a fixed number of frequencies to be tested, and, similarly, particular temperatures are not prescribed.For national products, the standards may specify additional requirements, e.g., one particular temperature, which is then linked to the mechanistic modelling of the performance of the pavement structure.According to existing good practice in asphalt mix testing, several temperatures are usually used to either simulate the behavior response of the mixture to the conditions occurring in the pavement-e.g., 0 • C for winter and 30 • C for summer-or at least three different temperatures are selected to plot the so called master curve, in which the resulting stiffness values are shifted by applying the time-temperature superposition principle to one selected temperature.This curve is then used to interpret the behavior of the asphalt mix under the different conditions which can occur on a road, such as low or high temperatures and different traffic loading intensities.The latter effect is simulated in the test by the used frequencies, i.e., very low or very high frequencies simulate either low or fast-moving traffic, and the pavement is put through cyclic loading and resting periods.The frequency is changed according to the speed and intensity that is being simulated.To dynamically simulate this traffic loading effect, test methods such as 4PBT are able to determine a large variability in the responses of the specimen depending on the loading frequency and temperature.The test setup and instrumentation for 4PBT is shown in Figure 1.For each temperature, and for the whole range of frequencies between 0.1 Hz and 50 Hz, three to five specimens were tested, and the average results are reported in Table 4.These outcomes constituted the dataset that was later used to train and test the developed ML algorithms.
CivilEng 2023, 4, FOR PEER REVIEW 5 The test setup and instrumentation for 4PBT is shown in Figure 1.For each temperature, and for the whole range of frequencies between 0.1 Hz and 50 Hz, three to five specimens were tested, and the average results are reported in Table 4.These outcomes constituted the dataset that was later used to train and test the developed ML algorithms.With the aim of obtaining a detailed understanding of the collected dataset, a preliminary exploratory analysis was carried out by diagramming the stiffness modulus (SM) as a function of the testing temperature and loading frequency.This practice is useful in identifying potential existing relationships and correlations between different pairs of features [44].The marked trends between the considered variables can be observed in Figure 2. Since the data came from the analysis of two different mixtures, cyan and red triangles were used to represent the data collected during the tests of the AML16 and AMP22 mixture specimens, respectively.With the aim of obtaining a detailed understanding of the collected dataset, a preliminary exploratory analysis was carried out by diagramming the stiffness modulus (SM) as a function of the testing temperature and loading frequency.This practice is useful in identifying potential existing relationships and correlations between different pairs of features [44].The marked trends between the considered variables can be observed in Figure 2. Since the data came from the analysis of two different mixtures, cyan and red triangles were used to represent the data collected during the tests of the AML16 and AMP22 mixture specimens, respectively.
To quantify the correlations between SM and the influencing variables under consideration, Pearson correlation factors were computed, thus creating a so-called Pearson correlation matrix (Figure 3).This tool provides a measure of the strength of an estimated linear correlation between two given variables [45].This strength can vary within a range from −1 to +1.The minus and plus signs represent inverse and direct proportionality, respectively.The closer the absolute value of the Pearson correlation factor is to unity, the stronger the correlation between the variables considered.
It can be observed that the testing temperature showed the strongest negative correlation with the stiffness modulus (R = −0.92),while the loading frequency displayed a small positive correlation (R = +0.28).A categorical variable was assigned to distinguish between the two mixtures, and it was subsequently encoded according to alphabetical order to determine its Pearson correlation factor with the SM values.As such, 0 and +1 were assigned to identify the AML16 and AMP22 mixtures, respectively.However, the Pearson correlation between the encoded categorical variable and the stiffness modulus was comparatively low (R = −0.04).No correlation could be found between the loading frequency, the testing temperature, and the categorical variable, thus allowing these three variables to be considered independently and therefore making them usable as inputs in subsequent predictive modeling steps [46].To quantify the correlations between SM and the influencing variables under consideration, Pearson correlation factors were computed, thus creating a so-called Pearson correlation matrix (Figure 3).This tool provides a measure of the strength of an estimated linear correlation between two given variables [45].This strength can vary within a range from −1 to + 1.The minus and plus signs represent inverse and direct proportionality, respectively.The closer the absolute value of the Pearson correlation factor is to unity, the stronger the correlation between the variables considered.To quantify the correlations between SM and the influencing variables under consideration, Pearson correlation factors were computed, thus creating a so-called Pearson correlation matrix (Figure 3).This tool provides a measure of the strength of an estimated linear correlation between two given variables [45].This strength can vary within a range from −1 to + 1.The minus and plus signs represent inverse and direct proportionality, respectively.The closer the absolute value of the Pearson correlation factor is to unity, the stronger the correlation between the variables considered.

Categorical Boosting
This section focuses on the development of a predictive model called Categorical Boosting (or CatBoost) which is capable of predicting the mechanical behavior of AMs in terms of SM.The methodology is based on a decision tree ensemble approach in which each generated decision tree sequentially learns from previous ones to fine-tune its learning and improve its predictive performance [47].
In 2017, Yandex (Moscow, Russia) engineers proposed CatBoost as an advanced machine learning algorithm.Unlike standard gradient-boosting techniques, CatBoost employs innovative ordered boosting to address target leakage and prediction shift issues [48].Furthermore, one of the most significant advantages of this algorithm lies in its ability to deal with various data formats and sizes, unlike other conventional ML techniques.CatBoost can automatically operate with categorical variables, encoding them without showing any conversion issues.
Overall, the main advantages of CatBoost can be summarized as follows [49,50]: (a) effectiveness in dealing with small-scale datasets and with categorical features; (b) high stability and efficiency; (c) less parameter tuning; and (d) oblivious tree building reduces overfitting, improving accuracy and generalizability.

Artificial Neural Network
Unlike decision tree-based models, a neural model is based on an artificial neural network whose structure is similar to that of the human nervous system.Such mathematical models are characterized by multiple and interconnected artificial neurons which are typically organized in sequences of three different layers called input, hidden, and output [51].The first layer is composed of as many neurons as there are pieces of input information, namely the variables influencing the SM.The second layer serves to process the input information, and the number of neurons or which it is composed determines the computational power of the neural network.Finally, the last layer provides the predicted value of the target variable, here referred to as the SM.Both the hidden and the output layers are equipped with activation functions.To deal with nonlinearities, the former usually employs a nonlinear activation function, while the latter usually employs a simple linear transformation.
Each connection is weighted, and the strength of this methodology lies in the variability of the connection weights, which are iteratively determined based on the loss function results achieved during the training phase.In general, for each iteration of the training algorithm, the output parameter estimation provided by a neural model (Equation (1)) can be described as follows: where X denotes the input information vector, f A denotes the activation function of the hidden layer, and W 1 and W 2 denote the matrices of the weights related to the connections between the input and hidden layers and between the hidden and output layers, respectively.The mathematical formulations by which the weight matrices are accordingly adjusted are known as training algorithms, and the main ones can be deeply explored via the relevant scientific literature [52-54].
Over the years, machine learning models based on artificial neural networks have proven able to successfully approximate even highly nonlinear functions, returning outstanding performance in several pavement engineering applications [55][56][57].

Grid Search and k-Fold Cross-Validation
To enhance both the speed of convergence and the accuracy of the model, each feature underwent min-max normalization.This pre-processing procedure ensured that the observation values were scaled to a range between 0 and +1, according to Equation (2): where, for each variable, x norm is the normalized value, x is the observed value, and x min and x max represent its minimum and maximum values, respectively.Subsequently, roughly the 75% of the dataset was used to train the models (75 observations out of 99).A fivefold cross-validation technique was employed to fairly evaluate the resulting training and validation performances of both models (CatBoost and ANN), and to subsequently optimize their parameters according to a grid search [58].Detailed descriptions of the hyperparameters and search ranges of each model are provided in Table 5.Once the best hyperparameter combination was identified, the best-calibrated model was tested on an independent testing set made from the remaining 24 observations (roughly 25% of the dataset).This allowed several goodness-of-fit measures to be determined, including mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), Pearson correlation coefficient (R), and determination coefficient (R 2 ) [47].
Their mathematical formulations (Equations ( 3)-( 8)) are presented below: where SM T i is the actual i-th value of the stiffness modulus, SM P i is the model-predicted stiffness modulus value for the i-th observation, i is the observation index, n is the total CivilEng 2023, 4 1091 number of observations, and µ and σ are the mean and the standard deviation values, respectively.An overfitting detection algorithm was also implemented to avoid any potential overfitting phenomena.During the training phase, before proceeding with the subsequent iteration, both ML models determined their corresponding best loss value coupled with the number of iterations since this optimal value was achieved.If the latter overtook a predetermined upper threshold (here set as 20), the algorithm automatically stopped the training phase.

Results and Discussion
To justify the choice of the CatBoost approach and highlight its better performance compared with that of the ANN model, an in-depth comparative analysis was carried out in the present study.It is important to mention that both models were calibrated and subsequently tested using the same observations to ensure a fair performance comparison in terms of the same goodness-of-fit measures.
A one-to-one visual comparison between the experimental target vector, the outputs predicted by CatBoost, and those predicted by the ANN can be observed in  To gain a deeper comprehension of the capabilities of each model, regression diagrams for the two outlined models can be observed in Figure 5.The observed experimental values and the corresponding model predictions are plotted on x-axis and the yaxis, respectively.The 45-degree black line represents perfect accuracy, i.e., 100% accurate predictions.Training, validation, and testing observations are denoted by red triangles, green crosses, and blue circles, respectively.A testing Pearson correlation coefficient for each model is displayed at the top of the corresponding regression diagram.It can be observed that the predictions of both models are closely aligned to the perfect accuracy line, which further confirms the excellent results that we obtained.Again, the CatBoost predictions showed a slightly higher accuracy, highlighted by a higher Pearson correlation coefficient (0.9990).To gain a deeper comprehension of the capabilities of each model, regression diagrams for the two outlined models can be observed in Figure 5.The observed experimental values and the corresponding model predictions are plotted on x-axis and the y-axis, respectively.The 45-degree black line represents perfect accuracy, i.e., 100% accurate predictions.Training, validation, and testing observations are denoted by red triangles, green crosses, and blue circles, respectively.A testing Pearson correlation coefficient for each model is displayed at the top of the corresponding regression diagram.It can be observed that the predictions of both models are closely aligned to the perfect accuracy line, which further confirms the excellent results that we obtained.Again, the CatBoost predictions showed a slightly higher accuracy, highlighted by a higher Pearson correlation coefficient (0.9990).
predictions.Training, validation, and testing observations are denoted by red triangles, green crosses, and blue circles, respectively.A testing Pearson correlation coefficient for each model is displayed at the top of the corresponding regression diagram.It can be observed that the predictions of both models are closely aligned to the perfect accuracy line, which further confirms the excellent results that we obtained.Again, the CatBoost predictions showed a slightly higher accuracy, highlighted by a higher Pearson correlation coefficient (0.9990).This condition is further highlighted in Figure 6, in which only the results obtained during the testing phase are shown.It can be observed that, based on the same testing   A detailed comparison of the results achieved by the two outlined models can be found in Table 6, described in terms of the six goodness-of-fit measures.As was previously mentioned, to ensure a fair comparison, both ML models were calibrated using the same observations, and both employed the same pre-processing and resampling algorithms.A detailed comparison of the results achieved by the two outlined models can be found in Table 6, described in terms of the six goodness-of-fit measures.As was previously mentioned, to ensure a fair comparison, both ML models were calibrated using the same observations, and both employed the same pre-processing and resampling algorithms.
Based on the results obtained during the testing phase, it can be observed that the difference between the two models in terms of prediction accuracy is roughly an order of magnitude for each goodness-of-fit metric.The CatBoost model displayed the lowest error metric values (MAE, MAPE, MSE, and RMSE) and the highest correlation metric values (R, and R 2 ), as is shown in Table 6.
Although the performance of each model was remarkable, it can be concluded that the most reliable and successful one was CatBoost.For this reason, we decided to perform a sensitivity analysis of the results obtained in terms of SM with respect to the input variables of the model, namely a categorical variable identifying the mixture, loading frequency, and testing temperature.Determining the influence of a specific feature on the predictions of an ML model can be difficult.To address this challenge, an algorithm for the feature importance calculation was implemented.This algorithm helps determine how much the CatBoost predictions change on average when the values of a particular feature are modified.A higher importance value determines a greater prediction variation when that particular feature is modified.The importance values were also normalized to return a total importance value equal to 100%.As can be observed in Figure 7, the testing temperature had the highest importance value (79.82%), followed by the loading frequency (18.93%) and the categorical variable (1.25%).
CivilEng 2023, 4, FOR PEER REVIEW 12 modified.The importance values were also normalized to return a total importance value equal to 100%.As can be observed in Figure 7, the testing temperature had the highest importance value (79.82%), followed by the loading frequency (18.93%) and the categorical variable (1.25%).The results that were obtained show how the SM values are mainly dependent on the testing temperature (as was expected from the Pearson correlation matrix results), and secondarily dependent on the loading frequency.The categorical variable introduced to distinguish the two mixture types is of minor importance, probably because of the quite similar mechanical behavior exhibited by the two investigated mixtures.
In this respect, it should be mentioned that all the results described in the present study refer to the experimental trial under consideration.Therefore, any other application that differs from the proposed one will mandatorily require new calibration efforts and new research to identify the best model hyperparameters.

Conclusions
To produce high-performance pavement for road infrastructure, the stiffness modulus (SM) of the asphalt mixture needs to be determined during the mix design procedures since it represents a crucial mechanical parameter.Expensive and time-consuming experimental trials are usually carried out to this end, but machine learning-based methodologies could provide useful tools to reduce laboratory efforts.If properly developed and implemented, reliable SM predictions of asphalt mixture performance could be provided by these soft computing techniques, thus helping pavement engineers during data analysis procedures.
The current paper aimed to discuss two different approaches to modeling and predicting the stiffness modulus in four-point bending tests (4PBTs).The first approach is The results that were obtained show how the SM values are mainly dependent on the testing temperature (as was expected from the Pearson correlation matrix results), and secondarily dependent on the loading frequency.The categorical variable introduced to distinguish the two mixture types is of minor importance, probably because of the quite similar mechanical behavior exhibited by the two investigated mixtures.
In this respect, it should be mentioned that all the results described in the present study refer to the experimental trial under consideration.Therefore, any other application that differs from the proposed one will mandatorily require new calibration efforts and new research to identify the best model hyperparameters.

Conclusions
To produce high-performance pavement for road infrastructure, the stiffness modulus (SM) of the asphalt mixture needs to be determined during the mix design procedures since it represents a crucial mechanical parameter.Expensive and time-consuming experimental trials are usually carried out to this end, but machine learning-based methodologies could provide useful tools to reduce laboratory efforts.If properly developed and implemented, reliable SM predictions of asphalt mixture performance could be provided by these soft computing techniques, thus helping pavement engineers during data analysis procedures.
The current paper aimed to discuss two different approaches to modeling and predicting the stiffness modulus in four-point bending tests (4PBTs).The first approach is based on decision trees and is named Categorical Boosting (CatBoost), and the second relies on artificial neural networks (ANNs).Both methodologies were developed using Python 3.9.12[60].The experimental campaign involved two selected mixtures: one developed for binder layers and the other for base layers.Both were prepared using standard 50/70 PG bitumen, spilite aggregate, and limestone filler.Three to five specimens were tested for each of the eleven loading frequencies (from 0.1 to 50 Hz) and four testing temperatures (from 0 to 30 • C).The obtained results were averaged and used to train and subsequently test the developed ML models.The following conclusions can be drawn:

•
Based on mixture composition and testing conditions, both the models were able to reliably predict the resulting stiffness modulus of each mixture, properly balancing accuracy and generalizability.This was ensured by the careful optimization of the hyperparameters of both models using three different algorithms, namely an extensive grid search, a five-fold cross-validation, and an overfitting detection.

•
The optimal CatBoost model was characterized by a maximum tree depth of 3, a learning rate of 0.01, and a maximum number of training iterations of 5000.Conversely, the optimal ANN model involved the Adam solver, and its architecture was characterized by 38 hidden neurons, a ReLU activation function, and a maximum number of training iterations of 1000.

•
Based on six goodness-of-fit metrics, CatBoost proved to be the most suitable algorithm to model the phenomena under investigation, outperforming the ANN.Its predictions were characterized by outstanding accuracy, expressed by MAE, MAPE, and R 2 values equal to 300.49MPa, 3.41%, and 0.9968, respectively.The corresponding ANN error metrics were roughly an order of magnitude higher, resulting in a comparatively lower prediction accuracy.

•
A sensitivity analysis carried out on the CatBoost model revealed that the testing temperature had the strongest influence on the SM predictions (79.82% of total importance), followed by the loading frequency (18.93%) and the categorical variable (1.25%).
The outlined methodology represents an encouraging starting point, and it could help pavement engineering professionals enhance mix design procedures.
As for future developments, the present research could be expanded in many ways: (i) the generalization capabilities of the CatBoost model could be improved by collecting more observations from several alternative mixtures; (ii) different optimization algorithms could be further investigated to find new potential optimal solutions for the best hyperparameter combinations; and (iii) other alternative modelling variables (such as fatigue or permanent deformation resistance) could be implemented as output, thus allowing a comprehensive mechanical characterization to be properly carried out.

Figure 3 .
Figure 3. Pearson correlation factors between SM and its influencing variables.Figure 3. Pearson correlation factors between SM and its influencing variables.

Figure 3 .
Figure 3. Pearson correlation factors between SM and its influencing variables.Figure 3. Pearson correlation factors between SM and its influencing variables.
Figure 4.The histogram plot displays black bars representing the test set values, grey bars representing the CatBoost predictions, and cyan bars representing the ANN predictions.The horizontal axis shows the IDs of the 24 observations from which the test set was composed.Both the ANN and CatBoost predictions closely match the observed SM experimental data, with CatBoost performing slightly better than the ANN.This finding can be considered significant since it highlights the accuracy and the reliability of both developed models.CivilEng 2023, 4, FOR PEER REVIEW 10

Figure 4 .
Figure 4. Observed SM values and those predicted by CatBoost and the ANN.

Figure 4 .
Figure 4. Observed SM values and those predicted by CatBoost and the ANN.

Figure 5 .
Figure 5. Regression plots for both the CatBoost (left) and the ANN (right) models.

Figure 5 .
Figure 5. Regression plots for both the CatBoost (left) and the ANN (right) models.This condition is further highlighted in Figure6, in which only the results obtained during the testing phase are shown.It can be observed that, based on the same testing vector, the predictions made by the CatBoost model (grey circles) are closer to the perfect accuracy line with than those made by the ANN model (cyan circles).

CivilEng 2023, 4 ,
FOR PEERREVIEW  11    vector, the predictions made by the CatBoost model (grey circles) are closer to the perfect accuracy line with than those made by the ANN model (cyan circles).

Figure 6 .
Figure 6.Direct comparison between the CatBoost and ANN model predictions.

Figure 6 .
Figure 6.Direct comparison between the CatBoost and ANN model predictions.

Table 4 .
Outcomes of the 4PBT experimental trial carried out on AML16 and AMP22 mixtures.

Table 4 .
Outcomes of the 4PBT experimental trial carried out on AML16 and AMP22 mixtures.

Table 5 .
Grid search for the best hyperparameters for each model.

Table 6 .
Training and testing error metrics of the investigated ML models.

Table 6 .
Training and testing error metrics of the investigated ML models.