Machine Learning-Based Identiﬁcation Strategy of Fuel Surrogates for the CFD Simulation of Stratiﬁed Operations in Low Temperature Combustion Modes

Surrogates the CFD Simulation of Stratiﬁed Operations in Low Abstract: Many researchers in industry and academia are showing an increasing interest in the deﬁnition of fuel surrogates for Computational Fluid Dynamics simulation applications. This need is mainly driven by the necessity of the engine research community to anticipate the effects of new gasoline formulations and combustion modes (e.g., Homogeneous Charge Compression Ignition, Spark Assisted Compression Ignition) to meet future emission regulations. Since those solutions strongly rely on the tailored mixture distribution, the simulation and accurate prediction of the mixture formation will be mandatory. Focusing purely on the deﬁnition of surrogates to emulate liquid phase and liquid-vapor equilibrium of gasolines, the following target properties are considered in this work: density, Reid vapor pressure, chemical macro-composition and volatility. A set of robust algorithms has been developed for the prediction of volatility and Reid vapor pressure. A Bayesian optimization algorithm based on a customized merit function has been developed to allow for the efﬁcient deﬁnition of surrogate formulations from a palette of 15 pure compounds. The developed methodology has been applied on different real gasolines from literature in order to identify their optima surrogates. Furthermore, the ‘unicity’ of the surrogate composition is discussed by comparing the optimum solution with the most different one available in the pool of equivalent-valuable solutions. The proposed methodology has proven the potential to formulate surrogates characterized by an overall good agreement with the target properties of the experimental gasolines (max relative error below 10%, average relative error around 3%). In particular, the shape and the end-tails of the distillation curve are well captured. Furthermore, an accurate prediction of key chemical macro-components such as ethanol and aromatics and their inﬂuence on evaporative behavior is achieved. The study of the ‘unicity’ of the surrogate composition has revealed that (i) the unicity is strongly correlated with the accuracy and that (ii) both ‘unicity’ and accuracy of the prediction are very sensitive to the high presence of aromatics.


Introduction
Nowadays the need to reduce the fossil fuel consumption and to accomplish the increasingly stringent emissions standard required to place gasoline-powered vehicles on market is leading to dramatical changes in the framework of internal combustion engines. In the last decades Direct Injection Spark Ignition (DISI) engines have become a well-established technology to fulfill fuel economy and tailpipe emissions thanks to support devices and strategies such as downsized turbo-engines, increasing injection pressure (300-400 bar are available in the current market, 700 bar can be achieved for specific applications e.g., the Magneti Marelli injection system implemented in the Mazda's Skyactiv-X [1]). However, nitrogen oxides (NOx) emissions and engine efficiency remain a concern strictly connected to the use of classical spark plug devices. In this scenario, mixtures of different well characterized pure components. The number and type of the pure components comprised in the surrogate strongly depend on which aspects of the real gasoline are under analysis.
In the current literature the study of algorithms for the design of comprehensive fuel surrogates for practical gasolines is widely discussed. Most of the published works (e.g., [7][8][9][10][11]) are focused on the use of different searching techniques e.g., regression models and genetic algorithms, for the formulation of surrogates capable to mimic the chemical aspects or both chemical and physical aspects of real gasolines. It must be underlined that, as reported in [12], the latter approach suffers from the current poor chemical kinetics characterization of different essential pure compounds, for which kinetic mechanisms need to be developed and validated yet. On the other hand, a wide number of reliable physical properties of the same pure compounds are commonly available. As shown by literature results [7][8][9][10][11] the mimic of physical and chemical properties at the same time with a single surrogate is extremely complex regardless of the searching strategy. In [7][8][9][10][11] the Authors achieved surrogates with values of liquid density, carbon-to-hydrogen ratio, chemical composition in reasonable agreement with experimental data while poor representations of the end-tails of the distillation curve and octane rating were presented. To the best knowledge of the present Authors, a very few works have approached the fuel surrogate formulation by using machine learning techniques or by placing emphasis on only the physical properties. In [13] Pati et al., adopted a genetic algorithm to search surrogates representative of the distillation and vapor pressure curves of four gasolines with different levels of ethanol blending. Two different sizes of the initial palette were tested, and surrogates comprising 5 and 7 components were achieved. The results showed that the proposed methodology was capable to capture with reasonable accuracy the vapor pressure curve and the early distillation behavior (20% to 40% distilled volume). However, significant mismatches were reported for the very early distillation behavior (0% to 10% distilled volume) and for values in the right end-tail of the distillation curve, indeed, temperature differences between 10-25 • C were detected in those ranges. In [14] Muelas et al., aimed to capture the evaporative behavior and the soot tendency of a reference heating oil representative of the physicochemical complexity of fuels. A multi-objective regression algorithm was used to capture liquid density, carbon-to-hydrogen ratio, molecular weight, soot tendency and distillation curve by means of 2-components surrogates. Errors around 15% were detected in reproducing distillation curve, soot tendency and molecular weight likely due to the complexity of the multi-property optimization associated to the 2-component approximation. In [15] Zhu et al., applied a Bayesian multi kernel learning model to define surrogates capable to predict the soot tendency of fuels based on an in-house developed extensive experimental database. The work was focused on jet, diesel and bio-diesel fuels. The soot tendency of the surrogates presented by the Authors were in line with experiments, indeed coefficients of correlation over 0.95 were reported.
It must be underlined that according to the standard industrial practice in threedimensional CFD engine simulation, two different gasoline surrogates can be used during the same calculation: (i) firstly, a n1-components surrogate is adopted to simulate the mixture formation i.e., spray formation, spray evaporation, spray-wall impingement, liquid film evaporation; (ii) then, the spatial distributions of pressure, temperature and mixture are given as input to correlations or models for laminar flame speed and ignitiondelay time whose tuning parameters are taken from the chemical kinetic simulation of a different n2-components surrogate designed to simulate the combustion behavior of the same gasoline.
This work proposes an advanced in-house developed machine-learning based methodology to define fuel surrogates capable to mimic the evaporative properties of nowadays gasoline blends. The components of the surrogate are selected from a palette of 15 pure compounds carefully chosen by the present Authors after a literature review. A Bayesian algorithm is trained to provide the user with a multi-component fuel mixture representative of the evaporative behavior of the target fuel. The provided composition can be imple- mented in CFD codes when the liquid fuel spray setup is required in the pre-processing phase. The algorithm performs a multi-objective optimization considering two types of constraints: (i) primary constraints, which are the fuel properties that the code aims to pursue during the optimization task; (ii) secondary or light constraints, which are fuel properties that are not included in the multi-objective task, instead they simply must fall in an acceptability range with respect to experimental value. In this work the primary constraints (optimization targets) are liquid density, Reid Vapor Pressure (RVP), volatility (distillation TX, temperature at X% recovered volume), chemical composition. Research Octane Number (RON) and Motor Octane Number (MON) are used as light constraints. The experimental data of four different gasolines from literature are used as validation benchmark. As mentioned before, published works focused on techniques to design surrogates capturing liquid and liquid-vapor equilibrium physical properties are quite rare in the current literature. Moreover, in the few works that have approached the present topic, the fuel properties emulation by surrogate formulation is not in line with the current accuracy requirements from modeling. Besides investigating a rare topic with greater accuracy and increased complexity with respect to similar literature works, the original contribution of the present work is the use novel advanced techniques such as Bayesian machine learning algorithms instead of common practice regression models and genetic algorithms. Moreover, in contrast to literature works [7][8][9][10][11], the novelty revealed by the present methodology is the potential to define surrogates capable to capture both physical properties (distillation behavior, vapor pressure, density) and chemical information (RON, MON, composition) at the same time with almost the same accuracy. It is remembered that most of literature works are focused on the formulation of surrogates for Fuels for Advanced Combustion Engines (FACE) gasolines, whose octane rating was strongly lower than that of nowadays blends. Therefore, it is worthwhile that the present Authors have tested the methodology with success on nowadays high-octane rating gasolines, for which the simultaneous match of RON, MON and physical properties is more complex. A further original contribution of the present work is the here called 'unicity analysis', which consists in the comparison between the optimum surrogate and the surrogate in the same solution pool that differs the most by the optimum formulation, with the aim to assess if an almost unique composition is allowed. Otherwise, the characteristics of the real fuel that affect the composition variability of possible surrogates is studied and the sensitivity of the target properties to the variability is analyzed. The current literature lacks of analyses of that type and the results are expected to be helpful for the improvement of the present methodology and the development of new ones by increasing the knowledge in addressing the searching process towards specific subsets of the exploration space.

Base Palette
After a review of gas chromatography data in findings regarding gasolines for light duty application, a palette of 15 pure substances comprising one oxygenate, four n-paraffins, two i-paraffins, two olefins, two cycloalkanes and four aromatics was considered. The compounds and a set of their representative properties including liquid density (ρ), vapor pressure (p V ), boiling temperature (T b ), RON, MON, are listed in Table 1. Liquid density, vapor pressure and boiling temperature are taken from [16], values of RON and MON from [17] and [18], respectively.

Optimization Targets and Their Modeling
As pointed out in 1., the formulation of a proper gasoline surrogate depends on which characteristics of the real gasoline are needed. Since the main focus of this work is the reproduction of the evaporative behavior of nowadays gasolines, the following properties are selected as targets (primary constraints) for the multi-objective optimization task: (1) macroscopic chemical composition, given as EPIONA (Ethanol, n-Paraffins, i-Paraffins, Olefins, Naphthenes, Aromatics) volumetric composition; (2) liquid density (ρ) at 15 • C; (3) a selection of distillation curve data, namely temperatures from T10 to T90 with steps of 10% distilled volume; (4) the Reid Vapor Pressure. As described in the last paragraph on 1., RON and MON values are used as light constraints to check that the defined surrogates can be representative of nowadays octane rating. Further details on the use of RON and MON values are given in Section 2.3.

Macroscopic Chemical Composition
As mentioned before, the macroscopic chemical composition of the surrogate is given by the EPIONA volumetric composition. Once the surrogate attempt has been formulated, EPIONA is directly calculated from the composition of the latter as the sum of the volumetric contribution of each component to the corresponding chemical class (Ethanol, n-Paraffin, i-Paraffin, Olefin, Naphthene, Aromatic).

Liquid Density
The density of the surrogate mixture is calculated by means of the mass weighted average of the density values of each component. The density of the pure compounds is estimated according to the temperature-dependent correlations in [16] at 15 • C.

Volatility
Since evaporation properties are not additive, the calculation of distillation data and the RVP required dedicated sub-models. Based on the works of Slavinskaya et al. [19] and Abianeh et al. [20] the present Authors developed a zero-dimensional three steps distillation model in Python-3 environment. The model is provided with the surrogate composition and returns the distillate temperature both at the bottom of the kettle (in the liquid phase), which is called kettle distillation temperature (T K , Figure 1), and at the top of the kettle before the re-condensation pipe, which is called head distillation temperature (T H , Figure 1). The vapor re-condensation that may occur in the kettle neck was neglected. Energies 2021, 14, x FOR PEER REVIEW kettle before the re-condensation pipe, which is called head distillation tempera Figure 1). The vapor re-condensation that may occur in the kettle neck was negle The mentioned three steps of the model are: (1) modelling of the first boiling ature (T0) by means of a pure thermodynamic model; (2) modelling of the distilla cess (T0 < TX ≤ T90) by coupling the same abovementioned pure thermodynam with a mathematical model of no-reflux simple batch distillation operations; (3) tion of the energy balance to determine the head distillation temperature from t distillation temperature. It must be underlined that most of the experimental di rigs record only the head distillation temperature since the measure of the kettle tion temperature is quite complex due to the position of the thermocouple, wh fected by the continuous phase change. Therefore, the third modeling step is nec order to perform comparison with a sufficient number of real gasolines.
First modeling step: T0 is calculated according to the classical Vapor-Liquid rium (VLE) bubble point temperature problem. In order to take into account the n behavior of the mixture, the VLE is expressed by the equality between liquid an fugacity in place of the classical Raoult's law. The estimation of the mixture fu approached with the so called dual-fugacity method, which correlates the fugac fluid behavior) with the partial pressure (ideal fluid behavior) by means of the mensional fugacity coefficient (φ) for both phases (liquid (L) and vapor (V)) res Equation (1), where p is the pressure, x and y are the moles fractions of the i-th com in the liquid and in the vapor, respectively. Equation (1) is then adjusted by intr the so-called equilibrium ratio (k) so that y = k ‧ x for each component. The equ ratio is determined as the ratio between the liquid and the vapor fugacity coeffic The mentioned three steps of the model are: (1) modelling of the first boiling temperature (T0) by means of a pure thermodynamic model; (2) modelling of the distillation process (T0 < TX ≤ T90) by coupling the same abovementioned pure thermodynamic model with a mathematical model of no-reflux simple batch distillation operations; (3) application of the energy balance to determine the head distillation temperature from the kettle distillation temperature. It must be underlined that most of the experimental distillation rigs record only the head distillation temperature since the measure of the kettle distillation temperature is quite complex due to the position of the thermocouple, which is affected by the continuous phase change. Therefore, the third modeling step is necessary in order to perform comparison with a sufficient number of real gasolines.
First modeling step: T0 is calculated according to the classical Vapor-Liquid Equilibrium (VLE) bubble point temperature problem. In order to take into account the non-ideal behavior of the mixture, the VLE is expressed by the equality between liquid and vapor fugacity in place of the classical Raoult's law. The estimation of the mixture fugacity is approached with the so called dual-fugacity method, which correlates the fugacity (real fluid behavior) with the partial pressure (ideal fluid behavior) by means of the non-dimensional fugacity coefficient (ϕ) for both phases (liquid (L) and vapor (V)) resulting in Equation (1), where p is the pressure, x and y are the moles fractions of the i-th component in the liquid and in the vapor, respectively. Equation (1) is then adjusted by introducing the so-called equilibrium ratio (k) so that y = k · x for each component. The equilibrium ratio is determined as the ratio between the liquid and the vapor fugacity coefficient. The fugacity coefficients are calculated via equation of state as in Equation (2) by means of the Soave-Redlich-Kwong (SKR) cubic equation [21], where Z is the compressibility factor, A, B, α, β are coefficients that include constants, equilibrium pressure and temperature and the two SKR mixture parameters, namely a and b.
The two SKR parameters are calculated according to the Van der Waals mixing rules, namely the quadratic rule for a (Equation (3)) and the linear rule for b (Equation (4)). In Equations (3) and (4) z is the moles fraction (x if liquid, y if vapor), a i and b i are the SKR parameters for pure components, C ij is a tuning coefficient implemented by the original Authors to capture the enhancement of the interactions between pairs of molecules with significant differences in terms of size and polarity. The coefficients a i and b i are calculated according to the original formulation, the interested reader can check those equations in [21]. In this work, C ij was set at zero after internal tests and evaluations.
The compressibility factor required by Equation (2) is calculated by solving the cubic SKR equation reported in Equation (5). Once the pressure is set at the atmospheric pressure and the initial liquid composition is given, the solution is constrained by the vapor mass balance (Equation (6)).
Equations (1) and (6) are adjusted according to a root-finding problem and, together with Equation (5), are solved with the fsolve built-in root-finding multidimensional algorithm implemented in the SciPy-Optimize open library. Given the initial liquid composition, the initial solution array (T0, vapor composition, compressibility factor) is defined as follows: (i) for T0, the moles weighted average of the normal boiling temperatures of the components has been used; (ii) the first attempt vapor composition (y = k · x) has been estimated through Equation (7) by Wilson [22], where T is the temperature, the subscript C represents critical pressure and temperature, ω is the acentric factor; (iii) for Z, Equation (5) is solved twice by means of a standard root-finding built-in function. Solving for the liquid phase, the minimum real root is Z L ; solving for the vapor phase, the maximum real root is Z V .
Second modelling step: Once the value of T0 is returned by the first modelling step, the continuous liquid fuel evaporation under VLE conditions is modelled by means of the Rayleigh's differential equation (Equation (8)), where R is the moles amount of the residual liquid. Equation (8) is representative of a simple batch distillation operation with no reflux. Since no reflux and no re-condensation phenomena occur under the assumptions of the present work, the vapor composition at each step corresponds to the composition of the collected distillate. dR In order to integrate Equation (8) the distillation process is discretized into steps (distillate cuts) of given distillate fraction. Named r j the residual liquid step change between two consecutive cuts (r j = R j / R j-1 ) the integration of Equation (8) from T0 to the generic cut J leads to Equation (9). In Equation (9) the equilibrium ratio is taken at a reference temperature T J *, which is an intermediate temperature between the distillate temperature of the cut under calculation (J) and the one of the previous cut (J-1).
Since the distillation sub-model performs timeless computation, the distillate cuts are increased by 1% leading to 100 computation points. This fine discretization allowed the present Authors to assume that the j-th reference temperature in Equation (9) is the distillate temperature of the previous cut increased by 0.5 K. In this second modelling step, Equations (1), (5) and (9) are solved together with the liquid mass balance constraint (Equation (10)) by means of the same aforementioned fsolve algorithm.
The present model (first and second modelling steps) was validated against the experimental data of different characterized mixtures whose composition was known a-priori. The validation mixtures listed in Table 2 were taken from [23,24] where the kettle distillation temperature data were available.  In Figure 2 is visible that the model is capable to capture the main distillate features regardless of the mixture's complexity (i.e., number of components) even though for the 6-components mixture (Mix4) the T70-T90 range is slightly over-estimated. Figure 3 shows the absolute relative error (experimental against model) for each mixture and the mean absolute error. As visible in Figure 3, the model achieved the best performance at the end-tails of the distillation curve at which errors smaller than 5% are present. The capture of the end-tails is of primary importance for the present study since T10 and T90 are representative of the fuel early (T10) and delayed (T90) evaporation. Bell-shaped error curves are observed for binary and ternary mixtures, whilst the 6-components mixture shows an almost flat behavior in the intermediate temperature range. The 2-components mixture (Mix2) is characterized by the greater error value (around 8%) in the T50-T60 range likely due to the abrupt composition change from 50/50 n-pentane/n-heptane to almost pure n-heptane. With regards to the most complex mixture (Mix4) the error remains around or below 5%. Since it is common that surrogates of light duty gasolines are composed of 6-7 components, the error in predicting the distillation curve of the identified surrogates is expected to behave as seen for Mix4.   Third modeling step: Once the kettle temperature distillation curve has been mined, the energy balance in Equation (11) is applied to the fluid domain comprise the liquid-gas interface to the height at which the upper thermocouple is placed, u around the re-condensation pipe inlet ( Figure 1). According to [25], the present A assumed that the temperature drop of the fuel vapor from TK to TH along the f caused by the heat losses towards the wall of the flask and the air within. At the e rium, it is assumed that both air and walls are heated up from the head tempera the previous cut (TH1) to the temperature of fuel vapor recorded at the head in the p cut. In Equation (11) cp is the specific heat of gaseous species (V = fuel vapors, A calculated at TH1 according to the temperature-dependent correlations in [16], CW the heat capacity of the wall, m are the gas masses.  Third modeling step: Once the kettle temperature distillation curve has been d mined, the energy balance in Equation (11) is applied to the fluid domain comprised the liquid-gas interface to the height at which the upper thermocouple is placed, us around the re-condensation pipe inlet ( Figure 1). According to [25], the present Au assumed that the temperature drop of the fuel vapor from TK to TH along the fl caused by the heat losses towards the wall of the flask and the air within. At the eq rium, it is assumed that both air and walls are heated up from the head temperatu the previous cut (TH1) to the temperature of fuel vapor recorded at the head in the pr cut. In Equation (11) cp is the specific heat of gaseous species (V = fuel vapors, A calculated at TH1 according to the temperature-dependent correlations in [16], CW (J the heat capacity of the wall, m are the gas masses. At each cut: (i) the mass of fuel vapor involved in the heat transfer is calculated Third modeling step: Once the kettle temperature distillation curve has been determined, the energy balance in Equation (11) is applied to the fluid domain comprised from the liquid-gas interface to the height at which the upper thermocouple is placed, usually around the re-condensation pipe inlet ( Figure 1). According to [25], the present Authors assumed that the temperature drop of the fuel vapor from T K to T H along the flask is caused by the heat losses towards the wall of the flask and the air within. At the equilibrium, it is assumed that both air and walls are heated up from the head temperature of the previous cut (T H1 ) to the temperature of fuel vapor recorded at the head in the present cut. In Equation (11) c p is the specific heat of gaseous species (V = fuel vapors, A = air) calculated at T H1 according to the temperature-dependent correlations in [16], C W (J/K) is the heat capacity of the wall, m are the gas masses. At each cut: (i) the mass of fuel vapor involved in the heat transfer is calculated based on the evaporated species y i and the evaporated volume, which is constant under the assumption of uniform cuts of 1%/iter; (ii) the air mass is the sum of that in the flask's neck if it was fully filled with air, and that in the sphere; (iii) the heat capacity of the wall is estimated as the ratio between the specific heat and the solid mass involved in the heat exchange. For the solid wall, pyrex glass is considered as material (ρ = 2230 kg/m 3 , c p = 753 J/kg/K). The mass of air and solid are calculated as the product between density (perfect gas law at ambient pressure and T H1 for air, constant value for solid walls) and volume. Volumes are calculated according to the measures reported in Table 3, which are taken by the guidelines for the distillation experimental bench reported in the ASTM-D86 regulation. Both the air mass and the solid mass that are present in the spherical volume are updated at each distillation cut according to the liquid volume consumption. The RVP of a fuel mixture is defined as the value of the vapor pressure at 100 • F. In this work, the RVP is calculated by using the same pure thermodynamic model described for the first modeling step of the distillation curve. It is underlined that the workflow described to determine T0 at ambient pressure is adjusted to solve the classical bubble point pressure formulation. Thus, the pressure (RVP) is set as the output variable whilst the temperature is set at 100 • F.

Octane Rating
Currently, the prediction of RON and MON values of fuel mixtures by means of the moles weighted average is widely adopted. Despite the reasonable results achieved for the so-called Toluene Primary Reference Fuels (TPRF) i.e., mixtures of various composition of toluene, i-octane and n-heptane, the present Authors adopted a different model in order to take into account the non-linear Octane Number (ON) increasing with the addition of oxygenates. According to [26], the ONs are calculated as the sum of two contributions (Equation (12)): (i) the first is the ON moles weighted average including both oxygenates and hydrocarbons; (ii) the second accounts for the quadratic dependence on the oxygenate presence multiplied by the difference between the ON of the pure oxygenate (subscript OX) and the one of the unblended gasoline (subscript G, ON moles weighted average including only hydrocarbons). The second term can be adjusted by means of the tuning coefficient K g . The present Authors set the tuning coefficient at 0.45 for RON and at 0.94 for MON in line with the experimental guidelines given in [26].

Surrogate Optimization Strategy
Considering the large number of independent variables (equal to the size of the chosen palette) and the non-linearity of some of the functions needed to reconstruct the target values, the optimization process cannot be addressed by a standard root-finding algorithm. Thus, it needs to be performed with a more complex workflow. State of the art optimization approaches have been successfully applied in literature to this aim. However, a customized implementation of the distillation curve model, to predict the distillation curve of a generic mixture, largely increases the amount of computing time required to run a sufficient number of evaluations. To reduce this effect, a Bayesian optimization algorithm has been employed, which takes advantage of a surrogate model to guide the maximum search. The final algorithm is a combination of the Upper Confidence Bound (UCB) exploration strategy and the use of Gaussian Process Priors as 'surrogate model' for the target function. Differently from other algorithms, the Bayesian optimization approach relies on a probabilistic surrogate function, that is continuously updated, to prescribe the next point to be evaluated inside the domain. The reason why Gaussian Processes are usually employed in this application is that (in their classical form) they interpolate precisely the observations on which they were trained. Furthermore, they are a probabilistic algorithm that can simultaneously predict new values and their confidence intervals. Therefore, Gaussian Processes (GP) are suitable in combination with the upper confidence bound acquisition strategy, whose complete acquisition function α is defined in Equation (13), in which X is the evaluation point, K is a tuning hyperparameter indicating the level of 'thrust' that is given to the Gaussian Process prediction, µ and σ are the model predicted mean and standard deviation, respectively.
The complete algorithm, inspired by [27], can be resumed as follow: (i) a number of base combinations is initially evaluated, and the objective function is calculated for them; (ii) the Gaussian Process Priors are initially fit to those available data points; (iii) the objective function is evaluated in the point X MAX that optimizes the acquisition function over the pre-trained Gaussian Process model; (iv) if successive iterations are expected, the Gaussian Process model is updated with the newly evaluated point, else the point X MAX is returned.
The multi-objective function that the algorithm needs to maximize is defined in Equation (14), where for each i-th target property ε is the percentage absolute relative error between the predicted and the experimental value of the property and W is the weight assigned to the property. In this way, the optimized mixture is placed on a weighted pareto frontier with the minimum Euclidean distance from the optimal composition. The algorithm was implemented with the Bayesian-Optimization open source constrained global optimization tool for Python.
The fuel properties that are taken into account by the objective function, here called primary constraints, are liquid density, RVP, distillation temperatures from T10 to T90 with steps of 10%vol, volumetric chemical macro-composition (EPIONA). In order to limit the number of components, a further primary constraint called 'complexity' was added and weighted in the objective function as ε complexity = N/P, where N is the number of components of the surrogate attempt and P is the size of the initial palette, i.e., 15. As mentioned, RON and MON values are not included in the right-hand-side sum of Equation (14). Instead, they are used as light (secondary) in the following solution screening strategy: (i) an acceptability range of experimental ON ± 3 is defined as threshold; (ii) after each surrogate attempt (x = x i , i = 1, . . . ,15), RON and MON values are calculated for the surrogate as in Section 2.2.5; (iii) if the predicted values exceed the acceptability threshold, the objective function F is set at zero, whatever is the value achieved for that point. Once the algorithm has accomplished the optimization task, the screening of the viable solutions (F > 0) is performed, and the surrogate associated with the maximum F value is considered as the optimum (x OPT = x i,OPT , i = 1, . . . ,15).
In the surrogate formulation step, the exploration range for the presence of each component is 0-80%vol. Once the surrogate is defined, the components that are present in volume fraction below 4% are automatically excluded and the other components are renormalized according to the reduced composition. It is underlined that for non-oxygenate blends, the possibility to choose ethanol in the surrogate formulation is directly denied by the user i.e., the presence of ethanol is limited at 0%vol. For purely hydrocarbon target blends this is necessary in order to avoid ethanol-blending solutions that could have been promoted by the need to match the RON and MON values.
After internal convergence evaluations, the number of base combinations has been set to 50 to generate a sufficiently accurate model for initializing the search, whilst the maximum number of iterative optimization steps is set to 200. The value of the tuning hyperparameter K has been set to 0.5, with a decay term the more evaluations are performed. The optimized hyperparameters and setting of the machine learning model are listed in Table 4 in which ν is the smoothness controlling parameter of the function, α is the variance of the additional white noise on the training data. A general example of the searching process performed by the implemented algorithm is shown in Figure 4, where blue markers and red markers represent exploration and the following exploitation steps, respectively. In Figure 4 it is visible that the Root Mean Squared Error (RSME) approaches an asymptotic minimum value after around a half of the complete searching. In the second half the final solution is refined. The red peaks that are visible in the latter are due to sporadic attempts of further exploration during the refinement that are correlated to the value of the degree of thrust (K). After internal convergence evaluations, the number of base combinations has been set to 50 to generate a sufficiently accurate model for initializing the search, whilst th maximum number of iterative optimization steps is set to 200. The value of the tuning hyperparameter K has been set to 0.5, with a decay term the more evaluations are per formed. The optimized hyperparameters and setting of the machine learning model ar listed in Table 4 in which ν is the smoothness controlling parameter of the function, α i the variance of the additional white noise on the training data. A general example of the searching process performed by the implemented algo rithm is shown in Figure 4, where blue markers and red markers represent exploration and the following exploitation steps, respectively. In Figure 4 it is visible that the Roo Mean Squared Error (RSME) approaches an asymptotic minimum value after around half of the complete searching. In the second half the final solution is refined. The red peaks that are visible in the latter are due to sporadic attempts of further exploration dur ing the refinement that are correlated to the value of the degree of thrust (K).

Analysis of the Surrogate Unicity
An important point of this work is the evaluation of the 'unicity' of the winner sur rogate with respect to other minor viable composition solutions which show a differen

Analysis of the Surrogate Unicity
An important point of this work is the evaluation of the 'unicity' of the winner surrogate with respect to other minor viable composition solutions which show a different distribution of the accuracy over the properties of interest. Given the optimum surrogate, a further surrogate of the same solution set is selected for comparison as follow: (i) once the optimized surrogate (maximum value of F) has been identified, the solutions comprised in the 5% bound of the maximum F value are considered equally acceptable (F ≥ 0.95 max(F)); (ii) for the solutions that fall in the condition (i), an Euclidean distance from the optimum based on the light constraints (RON, MON) is computed, then the most distant point, here called bound surrogate, is selected for the comparison with the optimum surrogate (x B = x i,B , i = 1, . . . ,15). A schematic view of the workflow is reported in Figure A1.

Results
The implemented algorithm has been applied to the formulation of the surrogates for four real gasolines fully experimentally characterized from literature [28,29]. From [28], two Co-Optima gasoline fuels, namely Co-Optima Aromatic and Co-Optima Cycloalkane, have been selected. From [29], the EU6 compliant gasolines Fuel-15 (hydrocarbon blend with properties in line with the current commercial RON95 gasoline) and Fuel-11 (E10 gasoline-ethanol blend) are selected among the available fuels according to Author's defined criteria i.e., ethanol-blending ≤ E10; RON ≥ 95, olefins volume content ≤ 10%. The main properties of the validation gasolines collected from [28,29] are resumed in Table A1. It is underlined that the EPIONA of Fuel-11 and Fuel-15 are calculated by the present Authors based on the Detailed Hydrocarbon Analysis (DHA) available in [29]. For the sake of simplicity, in the followings Co-Optima Aromatic, Co-Optima Cycloalkane, Fuel-11, Fuel-15 will be named A, B, C, D, respectively. Figure 5 shows both the chemical classes, represented by colors, and the constituting compounds, represented by patterns, of the optima surrogates of the four validation gasolines. Furthermore, the experimental EPIONA composition (empty bars) is also placed at the left of the corresponding surrogate bar. Since surrogates may comprise different compounds that belong to the same chemical class, thus, they would be filled with the same color, a unique pattern is assigned to each compound of a certain chemical class in order to distinguish the stacks of the same chemical class.

Optima Surrogates Results
It can be observed that the surrogates of A and B gasolines lack of olefins. Since A and B real gasolines are poor in olefins (1.57%vol and 4.54%vol, respectively) the algorithm automatically removes the olefins available in the palette since their target presence is below the minimum threshold (4%vol). The optima surrogates of C and D gasolines are characterized by olefins over-estimation, also missing n-paraffins are observed in the surrogate of C gasoline. Those differences are due to the mutual compensation of RVP and volatility, whose match is searched with greater importance (Table 4). In Figure 5, for each couple of experimental-surrogate bars, an overall nice agreement is visible, in particular for i-paraffins, aromatics and ethanol. Despite the match of each chemical class is pursued with the same weight value, in the results analysis more importance is given to the capture of ethanol, i-paraffins and aromatics composition. This particular focus is due to the key role played by those chemical classes in determining fundamental features of the gasoline evaporation i.e., flattening of the distillation temperature profile in a significant distilled volume range due to oxygenates, average evaporative behavior mainly driven by i-paraffins, delayed evaporation with formation of rich pockets of mixture due to the presence of aromatics. With regards to the species comprised in optima surrogates, in Figure 5 it is visible that heptane and decane are not included in any surrogate formulation due to their non-positive contribution to RON and MON values. The components that are in common between the four optima surrogates are i-pentane, toluene and 1-2-4trimethylbenzene. It is highlighted that the representation of aromatics required the most various composition. The detailed composition of the optima surrogates is released in Table A2. Figure 5 shows both the chemical classes, represented by colors, and the constituting compounds, represented by patterns, of the optima surrogates of the four validation gasolines. Furthermore, the experimental EPIONA composition (empty bars) is also placed at the left of the corresponding surrogate bar. Since surrogates may comprise different compounds that belong to the same chemical class, thus, they would be filled with the same color, a unique pattern is assigned to each compound of a certain chemical class in order to distinguish the stacks of the same chemical class. Figure 5. Volumetric composition of the optimum surrogate (filled bars) for the four validation gasolines: the filling color represents the EPIONA composition; the filling pattern represents the constituting pure compounds. Experimental EPIONA volumetric composition (empty bars) is also provided.

Optima Surrogates Results
In Figure 6a, it can be seen that the shape of the distillation curve of the optima surrogates is in solid agreement with experiments along the whole distillation axis. Surrogates of A, C, D gasolines achieve a greater match while a general under-estimation is noticed for the surrogate of B gasoline. Figure 6b shows the temperature difference between the predicted distillation temperature of the surrogate and the experimental value associated to the real gasoline. Dashed lines are placed at ±7 • C, which are here considered as acceptable temperature differences in the validation. It can be observed that the surrogates of A and B gasolines lack of olefins. Since A and B real gasolines are poor in olefins (1.57%vol and 4.54%vol, respectively) the algorithm automatically removes the olefins available in the palette since their target presence is below the minimum threshold (4%vol). The optima surrogates of C and D gasolines are characterized by olefins over-estimation, also missing n-paraffins are observed in the surrogate of C gasoline. Those differences are due to the mutual compensation of RVP and volatility, whose match is searched with greater importance (Table 4). In Figure 5, for each couple of experimental-surrogate bars, an overall nice agreement is visible, in particular for i-paraffins, aromatics and ethanol. Despite the match of each chemical class is pursued with the same weight value, in the results analysis more importance is given to the capture of ethanol, i-paraffins and aromatics composition. This particular focus is due to the key role played by those chemical classes in determining fundamental features of the gasoline evaporation i.e., flattening of the distillation temperature profile in a significant distilled volume range due to oxygenates, average evaporative behavior mainly driven by i-paraffins, delayed evaporation with formation of rich pockets of mixture due to the presence of aromatics. With regards to the species comprised in optima surrogates, in Figure 5 it is visible that heptane and decane are not included in any surrogate formulation due to their non-positive contribution to RON and MON values. The components that are in common between the four optima surrogates are i-pentane, toluene and 1-2-4-trimethylbenzene. It is highlighted that the representation of aromatics required the most various composition. The detailed composition of the optima surrogates is released in Table A2.
In Figure 6a, it can be seen that the shape of the distillation curve of the optima surrogates is in solid agreement with experiments along the whole distillation axis. Surrogates of A, C, D gasolines achieve a greater match while a general under-estimation is noticed for the surrogate of B gasoline. Figure 6b shows the temperature difference between the predicted distillation temperature of the surrogate and the experimental value associated to the real gasoline. Dashed lines are placed at ±7 °C, which are here considered as acceptable temperature differences in the validation. Surrogates of A and D gasolines show temperature differences with experiments below 5 • C along the whole distillation axis. The surrogate of C gasoline as well is is characterized by very low temperature differences even though a predicting performance degradation is shown in the interval T50-T60, where border temperature differences are present. It must be underlined that experimental works in literature [30] have shown that the addition of oxygenates in light-duty gasolines particularly affects the range 30-60%vol. Therefore, it is worth mentioning that an overall accurate prediction (differences below 5 • C) is achieved in this range for the E10 surrogate (C). Thus, a good performance of the algorithm in predicting the evaporation of oxygenates can be expected. The steep increase of the temperature difference in the T50-T60 interval is caused by a performance degradation of the surrogate's EPIONA composition at the 'interface' between the evaporation of the residual ethanol and the first vapor of the next evaporating components. Border difference temperature values can be seen for the surrogate of B gasoline in particular around T50 and T90. This performance decrease is correlated to the high aromatic content of gasoline B (46%vol) with respect to that of A, C, D gasolines (30-35%vol). It is assumed that including more aromatic species in the palette could help to reduce those differences. However, it must be underlined that the aromatics content in nowadays commercial gasolines is limited at maximum values around those reported for A, C, D.
The radar charts in Figure 7 report the comparison between real gasolines and the optima surrogates in terms of light constraints (RON, MON), three key distillation temperatures i.e., T10, T50, T90, two key components from EPIONA i.e., ethanol and aromatics, density, RVP. A general solid agreement can be seen for the surrogates of A ( Figure 7a) and C (Figure 7c) gasolines. The properties of gasoline D as well are in a good shape with experimental measures despite the relative error around 10% in predicting the RVP (Figure 7d).
Considering the radar charts in Figure 7a,c,d it is worth mentioning that the bond between chemical (ethanol, aromatics composition) and physical (density, RVP, key distillation temperatures) properties is well captured, furthermore RON and MON values are reasonably reproduced. To the best knowledge of the Authors the current literature lacks of methodologies performing the simultaneous accurate prediction of high-octane gasolines properties such as composition and evaporation features. Thus, the proven capability of the present algorithm to perform this task is worthwhile, in particular for gasolines whose properties are in line with nowadays RON95 (D) and RON98 (A, C) gasolines. As reported for the distillation validation, the surrogate of B gasoline suffers from a general loss of prediction performance as visible in Figure 7b (MON, RVP, T10, T50, T90), confirming the sensitivity of the model to the aromatics content.
optima surrogates in terms of light constraints (RON, MON), three key distillation temperatures i.e., T10, T50, T90, two key components from EPIONA i.e., ethanol and aromatics, density, RVP. A general solid agreement can be seen for the surrogates of A ( Figure  7a) and C (Figure 7c) gasolines. The properties of gasoline D as well are in a good shape with experimental measures despite the relative error around 10% in predicting the RVP (Figure 7d).

Composition Unicity
In this section, the results of the 'winner' surrogate are compared with the results achieved by the most distant surrogate comprised in the subset of acceptable solutions in the same pool (objective function F ≥ 0.95 max(F)), which will be called bound surrogate. The comparison is performed in order to discuss the 'unicity' of the surrogate composition that is assessed by means of the parameter called compatibility, which is defined as in Equation (15), where P is the number of components in the initial palette, x i,OPT and x i,B represent the volume fractions of the i-th compound in the optimum surrogate and in the bound surrogate, respectively. Figure 8 reports the results of both the overall and the class-by-class compatibility. It is underlined that the optimum surrogate and the bound surrogate for A, B, D gasolines are fully compatible in ethanol a-priori since the presence of ethanol in those surrogates was constrained at zero by the user, being the real A, B, D gasolines lacking in oxygenates (Table A1). by-class compatibility. It is underlined that the optimum surrogate and the bound surr gate for A, B, D gasolines are fully compatible in ethanol a-priori since the presence ethanol in those surrogates was constrained at zero by the user, being the real A, B, gasolines lacking in oxygenates (Table A1).
• 100 (1  As visible in Figure 8, an almost full compatibility is detected for the two most different surrogates of gasolines A (magenta, 95%) and C (green, 80%), confirming the solidity of those surrogates. Therefore, it is observed that the need of accuracy in predicting the evaporative features of those gasolines converges towards an about unique possible composition. The high compatibility in ethanol (96%) achieved by the two most different surrogate solutions of C gasoline is a further sign that the algorithm clearly identifies the role of the ethanol addition. A lower overall compatibility is detected between the two surrogates of D gasoline (blue, 70.5%) due to slight differences in the compatibility of each single chemical class and a significant n-paraffinic incompatibility caused by the replacement of n-pentane with n-hexane (Table A2). On the other hand, the composition variability between the two surrogates representing B gasoline (red) is large, indeed the overall compatibility of the bound surrogate is around 43%. The latter is mainly due to the low aromatic compatibility, which is around 70%. As reported in Table A1, the properties of B gasoline and A gasoline are similar except for the aromatics content (45%vol against 38%vol), which in its turn affects the distillation temperature differences. The compatibility analysis suggests that the 'unicity' of the surrogate formulation as well is very sensitive to the presence of aromatics.
The sensitivity of the target properties to the different levels of compatibility is shown in the following parity plot figures (Figure 9), where the value of compatibility is reported next to the respective bound surrogate marker. 38%vol), which in its turn affects the distillation temperature differences. The compatibility analysis suggests that the 'unicity' of the surrogate formulation as well is very sensitive to the presence of aromatics.
The sensitivity of the target properties to the different levels of compatibility is shown in the following parity plot figures (Figure 9), where the value of compatibility is reported next to the respective bound surrogate marker.  As expectable, the lower is the level of compatibility, the larger is the bound-optimum difference in predicting the properties. However, different sensitivities to the variability of the composition are experienced by different properties. It is visible that regardless to the level of compatibility, an increasing composition sensitivity is detected for the RVP (Figure 9a Table 4, it is clear that for different surrogates of the same solution pool that reasonably accomplish the accuracy target, the higher is the importance (weight value) given to the target property, the lower is its sensitivity to the composition variability of the surrogate. It must be noticed that a high level of compatibility is a sign of robustness and reliability of the surrogate solution. From Figure 9 it is clear that the compatibility recorded in a pool of surrogate solutions is correlated to the prediction performance. Indeed, the higher levels of compatibility (green, 80% and magenta 95%) provide an improved prediction performance over the whole set of target properties, whilst lower levels of compatibility may result in good performance on some target properties and significant mismatch on others. As a result, one can assume that the high degree of accuracy equally distributed over the presented target properties relies on the 'unicity' of the surrogate composition (high compatibility). Whether the 'unicity' of the surrogate composition is always achievable remains an open quest that needs further validation and strongly depends on the size and the variety of the base palette, the maximum number of components allowed in the surrogate and the capability of the searching algorithm to target the proper space of solutions in affordable computation times.

Discussion
This work deals with the implementation of a machine-learning based methodology to identify the proper surrogate for nowadays and future gasolines to provide threedimensional CFD engine simulations under fuel stratification operations. Different inhouse developed sub-models and a customized searching algorithm have been developed. A palette of 15 different pure compounds is provided to the searching algorithm. A Bayesian optimization strategy is implemented to evaluate different compositions with the aim to achieve the optimum surrogate according to a merit function defined to pursue the minimum Euclidean distance between experimental and surrogate targets namely, density, Reid vapor pressure, chemical macro-composition, distillation properties, number of components. RON and MON of the surrogate are constrained at a maximum difference of 3 units with respect to experimental values. The methodology is applied to identify the optima surrogates of four different gasolines fully experimentally characterized in literature. The implemented algorithm has proven to be capable to perform a fine prediction of the target properties of two hydrocarbon-blend gasolines and one ethanol-gasoline blend, whose characteristics are in line with the nowadays requirements for commercial gasolines i.e., RON 95-98, olefins < 10%vol and aromatics < 35%vol. Performance losses are detected in predicting the RVP and the distillation curve of the fourth gasoline, which differs from the others by a higher aromatics content (45%vol). Therefore, it is found that the formulation of surrogates aimed to mimic the evaporative features of real gasolines are very sensitive to the presence of aromatics. The possibility to achieve a 'unique' surrogate composition to provide CFD codes is also discussed by comparing optima results with the most different surrogate characterized by almost the same value of the merit function (i.e., same overall accuracy). The comparison is performed by means of a synthetic index introduced by the present Authors, namely the compatibility, which reports the degree of likeness between the optimum solution and the most different acceptable solution. For the three gasolines characterized by the standard presence of aromatics, high levels of compatibility (70-95%) are observed, meaning that the composition capable to accomplish the emulation of the evaporative features properly is about unique. Considering the results for the gasoline characterized by the high presence of aromatics, a significant sensitivity of the compatibility (43%) to the aromatics composition is revealed. A further worth mentioning result is that the compatibility analysis has shown that the higher is the level of compatibility between the two most different surrogates of the solution pool, the higher is the level of accuracy in predicting each target property. Therefore, compatibility has proven to be an index of the reliability of the identified surrogate. As a result, four optima surrogates comprising 8-9 components are achieved based on the target accuracy. It is underlined that the present Authors have tested those surrogates in 3D-CFD simulations with multi-component spray injection and evaporation. Despite the fact that the number of components is slightly greater than that of common practice surrogates (6-7 components), the overall computation time is not affected significantly. Therefore, the surrogates identified in the present work can be used for three-dimensional CFD calculations with no concerns on the time spent.  Data Availability Statement: Data will be available upon reasonable requests.

Conflicts of Interest:
The authors declare no conflict of interest.   Figure A1. Main workflow of the implemented surrogate identification strategy.