The Electrical Conductivity of Ionic Liquids: Numerical and Analytical Machine Learning Approaches

: In this paper, we incorporate experimental measurements from high-quality databases to construct a machine learning model that is capable of reproducing and predicting the properties of ionic liquids, such as electrical conductivity. Empirical relations traditionally determine the electrical conductivity with the temperature as the main component, and investigations only focus on speciﬁc ionic liquids every time. In addition to this, our proposed method takes into account environmental conditions, such as temperature and pressure, and supports generalization by further considering the liquid atomic weight in the prediction procedure. The electrical conductivity parameter is extracted through both numerical machine learning methods and symbolic regression, which provides an analytical equation with the aid of genetic programming techniques. The suggested platform is capable of providing either a fast, numerical prediction mechanism or an analytical expression, both purely data-driven, that can be generalized and exploited in similar property prediction projects, overcoming expensive experimental procedures and computationally intensive molecular simulations.


Introduction
The investigation of complex materials has raised ever-growing interest among researchers in the area of fluid mechanics.Following an in-depth understanding of the internal atomic/molecular structure and the physics behind the imposed interaction mechanisms, advanced simulation techniques and experimental procedures are incorporated in order to extract the fluid properties and open the road to advances in the manufacturing and controlling of novel devices.The numerical modeling of such processes has always been an efficient, fast, and accurate choice for addressing these objectives, posing as the alternative to complex, time-consuming, and costly experiments.Among the proposed computational methods, machine learning (ML) techniques have now become a standard, showing remarkable efficiency, reduced processing time, and accuracy [1,2].
The existence of a certain number of data in a reliable database is a prerequisite for the adoption of ML.Data-driven approaches have been exploited to deal with complex physical processes, which are not described by analytical expressions and are mostly difficult to measure [3].In most studies, the research data are obtained via limited experimental conditions.For fluid and material research, experimental results may not be sufficient to meet the ML demands, limiting its further development.Even when the research data are enriched with simulation results, and therefore sufficient, there may also be inherent processing difficulties because of the large number of input features to extract the desired prediction [4].Therefore, high-quality training data production [5], along with ML adoption complimentary to simulation and experiments [6], can progress material discovery and investigation.
In material science and engineering, the field of application is enormous.Imaging data from microscopic studies and advanced informatic tools have been exploited for material characterization [7], and images from molecular dynamic (MD) simulations have been Fluids 2022, 7, 321 2 of 15 used to predict ice nucleation from ambient water [8].The construction of potential energy surfaces (PESs), which had traditionally been a demanding ab initio simulation task, has been boosted by the Gaussian process regression (GPR) methods [9,10].ML has also been successfully incorporated for the prediction of behavior from data in the fields of biological, biomedical, and behavioral sciences [11].Fluid research has much to profit from reduced order models, turbulence modeling, fluid property extraction, and potential map creation with ab initio accuracy [12,13].
All these applications are only a small percentage of the true potential data science and ML have to offer.The new research directions focus on integrating physics-oriented parameters and domain knowledge with the proposed ML models.Physics-informed techniques have been suggested, integrating the knowledge of fundamental physics inside an algorithmic procedure [14].Moreover, as a step toward explainable and generalizable ML, the method of symbolic regression (SR) has evolved, providing not only accurate predictions but also, more significantly, mathematical expressions that describe the phenomena under investigation, beyond classical regression methods [15,16].
In this paper, ML is approached from the perspective of ionic liquids (IL), a class of solvents that have lately attracted increasing attention due to their unique properties.Their important feature is that the melting point is so low that they remain liquid at ambient temperature, while common salts are usually solid at ambient temperature and melt at several hundred degrees Celsius [17].Their other characteristic properties include negligible vapor pressure, high thermal and chemical stability, high ion conductivity, and nonflammability [18].IL properties might as well be tuned for a specific application by the proper manipulation of anions and cations, from catalysis and electrochemistry to liquid crystal development, fuel production, and as electrolytes in lithium batteries, supercapacitors, and fuel cells [19].ILs may serve as unique solvents in electrochemical processes where the use of water is forbidden [20], such as in electroplating and the electrodeposition of metals.Moreover, they are capable of dissolving organic compounds of great biological and ecological importance, such as enzymes, proteins, and cellulose [21].
The experimental measurements of ILs' physical quantities, such as conductivity, viscosity, and density, as a function of temperature or pressure, are usually performed with optofluidic and microscopic techniques [22,23], and empirical relations have been drawn to guide the experiments [24].On the other hand, research efforts on property calculation have been mainly based on trial-and-error methods to bind anions and cations to constitute an IL of desired properties.Computational property estimation can bring research to the next level through the incorporation of novel ML methods [25].Recent studies refer to ML techniques for ILS property prediction, such as viscosity and electrical conductivity [26], CO 2 capture capability [27], density, heat capacity, and thermal conductivity [28], among others, trying to depict the relationship between property and molecular structure, and environmental conditions.
Next, we present ML data-driven methods to extract the electrical conductivity, σ, of ionic liquids, both in numerical and analytical form.The incorporated data and preprocessing methods are presented, the ML techniques are described, and the validity of the predictions is discussed.We conclude that the proposed ML-based method is able to reproduce the electrical conductivity values for complex ILs, taking into account the environmental conditions (temperature and pressure) and the molecular weight of the IL of interest.To our knowledge, there has not been another numerical or analytical method able to extract IL properties from these three input parameters, and it can provide a fast and efficient choice to replace/complement timely and costly experiments, especially when the experimental conditions are extreme.

The Electrical Conductivity of Ionic Liquids
The physical properties of ILs, such as viscosity, conductivity, and density, are vital for the characterization of salt as being appropriate for a given application or not [29].The electrical conductivity of ILs is primarily important for the understanding of their behavior and the applications that may profit from tuning their value.ILs can remain at a liquid state for a wide range of temperatures, and many electrochemical applications would incorporate them as solvents.Thus, it becomes clear that there is a need to define the possible parameters that affect electrical conductivity.In most of the studies in the literature, the temperature is the only parameter taken into account and is usually analyzed through the Vogel-Fulcher-Tammann (VFT) curves on the measured data [30,31].The empirical VFT equation is given by which is also examined in its linear form as where the maximum conductivity is σ ∞ , and the activation energy for conduction is where k B is the Boltzmann constant, both of which are derived from fitting the experimental measurements [32], and T 0 is the Vogel temperature.
Another empirical relation connects σ with the molecular volume V m , i.e., the sum of ionic volumes of the constituent ions as [33].
where c and d are the empirical constants of the best fit on the experimental data, while approaches that replace experimental measurements with computational models have been also proposed [34].

Electrical Conductivity Data
For the computational model adopted in this paper, we followed the steps shown in the flowchart in Figure 1.The modeling started with database creation.High-quality experimental data (2274 points) from the NIST IL-Thermo database [35,36] were gathered for pure ionic liquids, with electrical conductivity being the property of interest.The parameters that affect electrical conductivity, as shown from the experiments, are the temperature, T, the pressure, P, and the liquid molecular weight, M w .Table S1 (see Supplementary Material File S1) presents all the details for data origin and characteristics, while the IL database is provided in Supplementary Material File S2.The details on the incorporated experimental methods can be found in the respective references [32,37-60].

The Electrical Conductivity of Ionic Liquids
The physical properties of ILs, such as viscosity, conductivity, and density, are vital for the characterization of salt as being appropriate for a given application or not [29].The electrical conductivity of ILs is primarily important for the understanding of their behavior and the applications that may profit from tuning their value.ILs can remain at a liquid state for a wide range of temperatures, and many electrochemical applications would incorporate them as solvents.Thus, it becomes clear that there is a need to define the possible parameters that affect electrical conductivity.In most of the studies in the literature, the temperature is the only parameter taken into account and is usually analyzed through the Vogel-Fulcher-Tammann (VFT) curves on the measured data [30,31].The empirical VFT equation is given by which is also examined in its linear form as where the maximum conductivity is  , and the activation energy for conduction is  =  •  , where  is the Boltzmann constant, both of which are derived from fitting the experimental measurements [32], and T0 is the Vogel temperature.
Another empirical relation connects σ with the molecular volume , i.e., the sum of ionic volumes of the constituent ions as [33].
where c and d are the empirical constants of the best fit on the experimental data, while approaches that replace experimental measurements with computational models have been also proposed [34].

Electrical Conductivity Data
For the computational model adopted in this paper, we followed the steps shown in the flowchart in 1.The modeling started with database creation.High-quality experimental data (2274 points) from the NIST IL-Thermo database [35,36] were gathered for pure ionic liquids, with electrical conductivity being the property of interest.The parameters that affect electrical conductivity, as shown from the experiments, are the temperature, T, the pressure, P, and the liquid molecular weight, Mw.Table S1 (see Supplementary Material no.1) presents all the details for data origin and characteristics, while the IL database is provided in Supplementary Material no.2.The details on the incorporated experimental methods can be found in the respective references [32, .

Pre-Processing
It is common practice before entering the ML procedure, that data are normalized to restrict the input value range.
A correlation check was also performed in order to ensure that the input variables are not correlated to each other, and Figure 2 presents the correlation matrix.It is shown that no kind of correlation existed between the inputs, while the output was mostly correlated to temperature, T.

Figure 1.
Machine learning model for electrical conductivity prediction, providing both numerical and analytical output.

Pre-Processing
It is common practice before entering the ML procedure, that data are normalized to restrict the input value range.

𝑥̄= 𝑥 − 𝑥 𝑥 (4)
A correlation check was also performed in order to ensure that the input variables are not correlated to each other, and Figure 2 presents the correlation matrix.It is shown that no kind of correlation existed between the inputs, while the output was mostly correlated to temperature, T.

Machine Learning
A supervised machine learning algorithm accepts a number of input data, is trained by a percentage of the data, and enters a computational process to extract the predicted values for the model's output(s) [61].Data quality and quantity are important factors here.When representative data (uniformly distributed) existed, and their number was adequate to train the algorithm, the predicted output was obtained, as long as the incorporated algorithm was able to capture their behavior.The verification of the result was made by the remaining part of the input dataset (testing set).The training/testing set percentage on the total data points was taken here as 80/20.
Here, we incorporated six different numerical ML algorithms, namely the multiple linear regression (MLR), k-nearest neighbor (KNN), decision tree (DT), random forest (RF), gradient boosting regressor (GBR), and multi-layer perceptron (MLP) models, to propose the one that provided the best fit to our experimental data.These were implemented with aid of the respective functions from the sci-kit learn Python package [62].Moreover, the symbolic regression (SR) algorithm was constructed and adjusted from a Julia package [63], in order to provide an analytical expression exclusively extracted from the data and generalizable to electrical conductivity predictions for all input cases, even those outside the data range.

Machine Learning
A supervised machine learning algorithm accepts a number of input data, is t by a percentage of the data, and enters a computational process to extract the pre values for the model's output(s) [61].Data quality and quantity are important f here.When representative data (uniformly distributed) existed, and their numbe adequate to train the algorithm, the predicted output was obtained, as long incorporated algorithm was able to capture their behavior.The verification of the was made by the remaining part of the input dataset (testing set).The training/t set percentage on the total data points was taken here as 80/20.
Here, we incorporated six different numerical ML algorithms, namely the m linear regression (MLR), k-nearest neighbor (KNN), decision tree (DT), random (RF), gradient boosting regressor (GBR), and multi-layer perceptron (MLP) mod propose the one that provided the best fit to our experimental data.These implemented with aid of the respective functions from the sci-kit learn Python pa [62].Moreover, the symbolic regression (SR) algorithm was constructed and ad from a Julia package [63], in order to provide an analytical expression exclu extracted from the data and generalizable to electrical conductivity predictions input cases, even those outside the data range.

Multiple Linear Regression
Regression analysis refers to either a univariate method to analyze the relationship between a dependent variable and one independent variable or a model with one dependent variable and more than one independent variable, in which case it is called multiple linear regression (MLR) [64].In MLR (Figure 4a), we consider n independent input variables, linearly combined to extract the dependent variable Y as where w 1 , w 2 , and w 3 are the weights imposed on the three respective inputs X 1 = T, X 2 = P, X 3 = M w , and b is a bias term.

k-Nearest Neighbors
The k-nearest neighbor (k-NN) algorithm selects k training points over a local region of a data point x and labels neighboring points on the basis of the Euclidean distance (Figure 4b).Each sample is a pair including an input vector and the desired output.After grouping the calculated distances from the lowest to the highest, the most prevalent outcome from the first k rows is the predicted result [65].This algorithm is oftentimes Fluids 2022, 7, 321 6 of 15 accurate; however, there are cases where it may result in slow execution speed and large memory requirements [66].
,  =  , and  is a bias term.

k-Nearest Neighbors
The k-nearest neighbor (k-NN) algorithm selects k training points over a local region of a data point x and labels neighboring points on the basis of the Euclidean distance (Figure 4b).Each sample is a pair including an input vector and the desired output.After grouping the calculated distances from the lowest to the highest, the most prevalent outcome from the first k rows is the predicted result [65].This algorithm is oftentimes accurate; however, there are cases where it may result in slow execution speed and large memory requirements [66].

Decision Trees
The decision tree (DT) algorithm functions in the sense of a tree flowchart, with nodes, branches, and leaves.Each node represents a test on a feature, and each branch represents the result of that test [67].The DT mo'el's response is predicted by following the decisions from the start to the end node (the leaf), as shown by the dotted line in Figure 4c.The feature space is recursively partitioned based on the splitting attribute.Each final region is assigned a value to estimate the target output.The DT algorithm is considered easy to apply, although it might need contribution from other statistical methods to prevent overfitting [68].

Random Forest
A random forest (RF) algorithm consists of various DTs working in parallel (Figure 4d).Each tree outputs a different prediction, and their average is taken as the final prediction.Higher accuracy is usually obtained when the number of trees in the forest increase.In the literature, it has been shown that the random forest (RF) algorithm is much simpler to implement, than complex neural network structures, and has been the most accurate choice for fluid applications, such as slip length estimation [69] and the extraction of fluid transport properties [12,70].All the trees' outputs are averaged (b is the trees' number) by providing an even more accurate result than the single-tree structure, hence less prone to overfitting.

Gradient Boosting Regressor
The gradient boosting regressor (GBR) algorithm is another implementation of a decision tree algorithm that combines various simple functions (learners) that constitute an ensemble function.Initial learners may be weak, but when combined, they may form strong learners.GBR follows three main steps sequentially: It optimizes the loss function, spots the weaker learner, and improves it by adding more trees to increase accuracy [71].As shown in Figure 4e, the sequential DTs were incorporated, and the output of each one was weighted to enter the next DT.The weights were selected in a way to minimize the induced errors [72].

Multi-Layer Perceptron
The traditional perceptron, when presented in multiple layers, i.e., the input, the output, and a number of internal hidden layers, constitute the multi-layer perceptron (MLP) algorithm.The number of hidden layers is usually determined by trial and error, although there have also been various methods proposed, such as genetic programming [73].Here, we considered three hidden layers, each one with 20 nodes, with Adam stochastic solver [74] and a learning rate equal to 0.5 (Figure 4f).The data flow between neurons depends on the activation functions applied in every internal node and a weight function imposed on every input.These weights are adjusted so that the predicted output resembles the expected output with minimum error.The training of the MLP was performed iteratively, with backward computation capability.

Symbolic Regression
SR can also be represented by tree structures (Figure 4g); however, here, the tree nodes are mathematical operators, and leaf nodes correspond to input variables/constants [75].The algorithm begins by considering a random parent tree structure, calculates the mean squared error (MSE) of the specific implementation, and follows an iterative procedure, in which a node or a branch of nodes is substituted until the minimum MSE is achieved, with low complexity.Complexity refers to the number of leaves and nodes used in the proposed SR tree.The Julia-based SR algorithm by Cranmer et al. [63], which we have widely incorporated in similar works [16,76], accepts a set of mathematical operators {+, −, * , /, ˆ, e, log} and the input variables {T, P, M w } and creates an equation pool, from which it selects the best candidates that adhere to the Pareto front, i.e., those that present the minimum MSE values and small complexity, along with physical correspondence to the problem.
Although more computationally intensive and demanding, SR is capable of providing an analytical expression at hand, which, if it fits the dataset under investigation, is superior to other ML techniques, since it can be easily applied for a wide range of inputs.However, care has to be taken so that this expression remains simple and is bound to physical laws [76].

Metrics of Accuracy
A number of popular metrics were applied to every algorithmic result to determine which one best satisfies the accuracy criteria.These were the coefficient of determination, R 2 , the mean absolute error (MAE), the mean squared error (MSE), and the average absolute deviation (AAD) [77], as shown in Equations ( 7)-( 10): with y * exp. the mean value of the expected output: where 9)

Partial Dependence
To analyze the effect of each input parameter on the acquired electrical conductivity value, σ, a partial dependence plot was constructed (Figure 5).The partial dependence plot calculates the average marginal effect on the σ prediction when only one input variable changes its value, and, in parallel, the remaining inputs remain constant.The estimation of the partial dependence (normalized value) is shown on the vertical axis and the respective input on the horizontal axis.In Figure 5a, it is observed that the molecular weight significantly affected σ, especially on smaller values around 200-230.The effect of temperature was prominent, especially for the values above 270 K (Figure 5b).On the other hand, pressure had only a slight, inversely linear effect on σ for small pressure values, since partial dependence decreased as the pressure increased (Figure 5c).Furthermore, it seems that σ was practically unaffected by P for the values above 100 kPa.

Machine Learning Results
The results from the application of the numerical ML algorithms on the electrica conductivity dataset are gathered in Figure 6a-f, in identity plots that estimate the model's accuracy by fitting the experimental and predicted data on the 45° diagonal line The prediction is more accurate when the data points are set close to the line [78].
The linear regression method (MLR) in Figure 6a presented a rather poor fit for the ionic liquid data.This is somehow expected if we take into account the empirica relations from Equations ( 1)-( 3), where the electrical conductivity value seems to have logarithmic dependence on the temperature or molecular volume.Thus, we expected that nonlinear ML methods would achieve better results.The KNN algorithm in Figure

Machine Learning Results
The results from the application of the numerical ML algorithms on the electrical conductivity dataset are gathered in Figure 6a-f  The accuracy metrics for the fittings shown in Figure 6, such as R 2 , MSE, MAE, and AAD, are shown in Table 1.The table values confirmed our observations that the threebased algorithms achieved the best fit on the data, as the coefficient of the determination reached values close to unity (R2 = 0.99), while a minimum number of errors were expressed by MAE and MSE, compared with the remaining algorithms.However, the AAD values differed significantly.The AAD value expresses the average sum of the errors derived from the distance of the predicted data around the experimental base data.From Table 1, it is evident that the GBR method is superior to RF, followed by DT.Let us now turn our attention to finding the most important input feature that controls the internal mechanism of the algorithmic decisions for the GBR.The feature importance plot in Figure 7a presents an estimation of the importance of each input variable on the prediction of the electrical conductivity value.Temperature, T, was found to be the most important parameter that guided the decisions between the DTs and the branches of the GBR architecture.This is in agreement with the widely used empirical Equations ( 1)-( 2), where T is the only parameter that affects electrical conductivity.The next important feature was the molecular weight, Mw, as it was the main parameter in the proposed model that differentiated between the various types of incorporated ILs.Pressure, P, had only a small effect on the final result.As is also shown The linear regression method (MLR) in Figure 6a presented a rather poor fit for the ionic liquid data.This is somehow expected if we take into account the empirical relations from Equations ( 1)-( 3), where the electrical conductivity value seems to have logarithmic dependence on the temperature or molecular volume.Thus, we expected that nonlinear ML methods would achieve better results.The KNN algorithm in Figure 6b showed better performance than MLR.Nevertheless, the three tree-based algorithms that follow in Figure 6c-e, i.e., DT, RF, and GBR, respectively, fit well on the experimental data, as it seems that their tree structure was better suited to the problem.The neural network (NN) architecture in Figure 6f did not achieve adequate prediction capability for the specific implementation (three hidden layers of 20 nodes each).We also tested different architectures with trial-and-error procedures but did not manage to obtain better results.However, NNs are a distinct field of investigation, and further investigation is needed to find the optimal architecture, which is beyond the scope of this paper.Conclusively, it was shown that most of the algorithms investigated here (except for MLR) achieved acceptable prediction performance on the available dataset.
The accuracy metrics for the fittings shown in Figure 6, such as R 2 , MSE, MAE, and AAD, are shown in Table 1.The table values confirmed our observations that the threebased algorithms achieved the best fit on the data, as the coefficient of the determination reached values close to unity (R 2 = 0.99), while a minimum number of errors were expressed by MAE and MSE, compared with the remaining algorithms.However, the AAD values differed significantly.The AAD value expresses the average sum of the errors derived from the distance of the predicted data around the experimental base data.From Table 1, it is evident that the GBR method is superior to RF, followed by DT.Let us now turn our attention to finding the most important input feature that controls the internal mechanism of the algorithmic decisions for the GBR.The feature importance plot in Figure 7a presents an estimation of the importance of each input variable on the prediction of the electrical conductivity value.Temperature, T, was found to be the most important parameter that guided the decisions between the DTs and the branches of the GBR architecture.This is in agreement with the widely used empirical Equations ( 1)-( 2), where T is the only parameter that affects electrical conductivity.The next important feature was the molecular weight, M w , as it was the main parameter in the proposed model that differentiated between the various types of incorporated ILs.Pressure, P, had only a small effect on the final result.As is also shown in the partial dependence plot of Figure 5, P affected σ only for small values (around P = 100 kPa), and no effect was observed above this limit.
Fluids 2022, 7, x FOR PEER REVIEW in the partial dependence plot of Figure 5, P affected σ only for small values (arou 100 kPa), and no effect was observed above this limit.Another important outcome to aid in the interpretation of the ML model learning curve diagram in Figure 7b, which reveals if the proposed algorithm wa ciently trained on the dataset.This is connected to the ability of the algorithm to new predictions.Here, we observe that the cross-validation score increased as the ber of training data points increased, reaching the highest value after about 150 points.This is evidence that the dataset used in this model (2274 data points) is c of providing accurate predictions that could be generalized in research cases insid outside the range of the parameters that constitute the dataset.

Obtaining an Analytical Expression
Symbolic regression is capable of providing analytical expressions to fit the d under investigation, without a priori knowledge of the system.This means that algorithm starts with the random construction of expressions and iteratively search the best candidate equation.The proposed equations are of various complexity and the choice of a simple or a more complicated one depends on the applicatio the desired accuracy.Here, we present three possible expressions that describe th trical conductivity, σ, of ionic liquids, with input variables T, Mw, and P. Table 2 pr the mathematical expressions, along with calculated metrics.Another important outcome to aid in the interpretation of the ML model is the learning curve diagram in Figure 7b, which reveals if the proposed algorithm was efficiently trained on the dataset.This is connected to the ability of the algorithm to make new predictions.Here, we observe that the cross-validation score increased as the number of training data points increased, reaching the highest value after about 1500 data points.This is evidence that the dataset used in this model (2274 data points) is capable of providing accurate predictions that could be generalized in research cases inside and outside the range of the parameters that constitute the dataset.

Obtaining an Analytical Expression
Symbolic regression is capable of providing analytical expressions to fit the dataset under investigation, without a priori knowledge of the system.This means that the SR algorithm starts with the random construction of expressions and iteratively searches for the best candidate equation.The proposed equations are of various complexity levels, and the choice of a simple or a more complicated one depends on the application and the desired accuracy.Here, we present three possible expressions that describe the electrical conductivity, σ, of ionic liquids, with input variables T, M w , and P. Table 2 presents the mathematical expressions, along with calculated metrics.The SR algorithm proposed three different classes of solutions, namely an exponential form (σ 1 ), a nonlinear fractional form (σ 2 ), and a combined fractional/square root form (σ 3 ).These forms appeared at most in the output expression pool.We have to note here that the SR output included a total of 20 equations, with increasing complexity Comp.= 1-20, per iteration run, for 40 parallel runs (for more details refer to [76]), i.e., 800 candidate expressions.
We observe that Equation σ 1 captured the exponential behavior shown in empirical Equation 1; however, one could not directly compare the two equations since Equation σ 1 considered M w and P apart from T. Nonetheless, this is a simple equation that captures the ionic liquid physical behavior, with satisfying error metrics, but its disadvantage is the high AAD value, denoting increased distance from the real experimental values.The increased complexity of Equations σ 2 and σ 3 yielded better error metrics, with Equation σ 3 reaching R 2 = 0.883 and Equation σ 2 having the smallest AAD = 149,183.5.

Conclusions
Ionic liquid research is a field of investigation mainly based on experimental measurements, and fundamental information is hard to obtain.The need for incorporating novel computational techniques has opened the road to ML techniques that can assist in this direction.
We incorporated various ML algorithms in this paper that showed a good fit on the employed ionic liquid dataset for the electrical conductivity prediction.The best fit was obtained for the GBR algorithm, for which its tree-based procedure and the ensemble approach to processing the data successfully captured the electrical conductivity behavior.Notwithstanding the fact that numerical ML algorithms performed well on their predictions, the SR-based investigation also presented in this paper approached the problem analytically, providing mathematical expressions that can be used without further implications, thus overcoming the black-box nature of numerical ML algorithms.
We believe that by further enriching the dataset with the values deriving from either experiments or carefully established molecular simulations, ML data-driven techniques can become part of the property calculations of ionic liquids.It is of importance to suggest novel evolutional processes, reliable pre-and post-processing techniques, and physics-oriented justification to establish an integrated computational platform that can be used by scientists and engineers who wish to harness the vast volume of data involved in their field.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/10.3390/fluids7100321/s1,Table S1: The database of ILs incorporated for our model, with 2274 data points; Table S2: Ionic Liquids Database.

Figure 1 .
Figure 1.Machine learning model for electrical conductivity prediction, providing both numerical and analytical output.

Figure 2 .
Figure 2. Correlation matrix for the three inputs and the output, σ.

Figure 2 .
Figure 2. Correlation matrix for the three inputs and the output, σ.The statistical information for the input data can be obtained from the pair plot diagram in Figure 3.The distribution of the three input parameters, T, P, and M w, and the output, σ, is shown.The investigated ionic liquid is distinguished by the value of the molecular weight, which, in this paper, ranged from 108.1 ≤ M w ≤ 556.18.The temperature and pressure conditions were 203.4K ≤ T ≤ 528.55 K, and 0.1 MPa ≤ P ≤ 250.9 MPa, respectively.The output was in the range of 3×10 −7 S/m ≤ σ ≤ 19.3 S/m.

Figure 3 .
Figure 3.A pair plot diagram, showing input and output parameter distribution.The diago plots display the distribution of each parameter, while the remaining figures are scatte between all input pairs.

Figure 3 .
Figure 3.A pair plot diagram, showing input and output parameter distribution.The diagonal bar plots display the distribution of each parameter, while the remaining figures are scatter plots between all input pairs.

Fluids 1 Figure 5 .
Figure 5. Partial dependence plot, where each input is investigated on the effect on electrica conductivity (σ).

Figure 5 .
Figure 5. Partial dependence plot, where each input is investigated on the effect on electrical conductivity (σ).(a) molecular weight; (b) temperature; (c) pressure.
, in identity plots that estimate the model's accuracy by fitting the experimental and predicted data on the 45 • diagonal line.The prediction is more accurate when the data points are set close to the line [78].

Figure 7 .
Figure 7. Interpretation output diagrams from the application of GBR algorithm on the pre of electrical conductivity: (a) feature importance plot; (b) learning curve diagram.

Figure 7 .
Figure 7. Interpretation output diagrams from the application of GBR algorithm on the prediction of electrical conductivity: (a) feature importance plot; (b) learning curve diagram.

Table 1 .
Accuracy metrics and comparison of 6 ML algorithms for the ionic liquids' dataset.

Table 1 .
Accuracy metrics and comparison of 6 ML algorithms for the ionic liquids' dataset.

Table 2 .
The three SR-extracted equations for electrical conductivity and their respective accuracy metrics.Comp. is the equation complexity.