Machine Learning Techniques for Fluid Flows at the Nanoscale

: Simulations of ﬂuid ﬂows at the nanoscale feature massive data production and machine learning (ML) techniques have been developed during recent years to leverage them, presenting unique results. This work facilitates ML tools to provide an insight on properties among molecular dynamics (MD) simulations, covering missing data points and predicting states not previously located by the simulation. Taking the ﬂuid ﬂow of a simple Lennard-Jones liquid in nanoscale slits as a basis, ML regression-based algorithms are exploited to provide an alternative for the calculation of transport properties of ﬂuids, e.g., the diffusion coefﬁcient, shear viscosity and thermal conductivity and the average velocity across the nanochannels. Through appropriate training and testing, ML-predicted values can be extracted for various input variables, such as the geometrical characteristics of the slits, the interaction parameters between particles and the ﬂow driving force. The proposed technique could act in parallel to simulation as a means of enriching the database of material properties, assisting in coupling between scales, and accelerating data-based scientiﬁc computations.


Introduction
It is a fact that the study of physical phenomena and the extraction of material properties near the atomic scale have matured with aid of the various computational techniques during recent decades, along with experimental efforts that validate fundamental knowledge. Although it remains a challenge to fabricate devices at the nanoscale [1], experimental nanofluidics have suggested the construction of nanodevices for DNA applications [2], charge-sensitive biosensing [3], nanofilters, filtration membranes and desalination [4,5], to name a few. Continuum theory may be sometimes accurate in calculating bulk fluid properties; however, in cases where wall/fluid interaction becomes significant, bulk description of the fluid is not valid, and molecular dynamics (MD) simulations are incorporated to describe the flow behavior [6].
Apart from static properties, such as density, velocity or temperature distribution, the transport properties of fluids, e.g., the diffusion coefficient, shear viscosity and thermal conductivity, that control the rate of mass, momentum and heat transfer, are also affected in confined space [7][8][9]. Simulation of Poiseuille-like flows in nanochannels has been a popular choice among researchers to investigate fluid flows at the nanoscale, along with experiments, where possible [10]. Surface topology, atomic, thermal or geometrical roughness [11] and wall material properties such as particle mass and degree of wettability [12] result in density inhomogeneity near the surface [13]; the overall particle dynamics were found to slow down [14] and slip lengths arise that violate the no-slip assumption from the macroscale [15].
Even though simulations are applicable from nano-to micro-scale, for systems containing some millions of particles, there are cases where the underlying physics would necessitate extreme computational load and effort. In the new computational era, machine learning (ML) has arisen as an efficient alternative technique to classical physical problems. The statistical nature of ML, based on its implementation simplicity, has favored a unique

System Model
The generation of datasets to be used for training the ML model came both from simulations of our previous works and relative literature datasets. As far as our generic system model is concerned, a Lennard-Jones (LJ) monoatomic liquid is flowing between two infinite solid walls, which can be flat or grooved (Figure 1). Periodic boundary conditions are considered in xand y-directions. The distance between the two walls in the z-direction is h, while groove height and length are h g and h l , respectively. Fluid/fluid, wall/fluid and wall/wall interactions are described by the Lennard-Jones (LJ) 12-6 potential with a cut-off radius r c = 2.5σ. The values of the LJ parameters σ and ε and the masses of the particles were chosen to correspond to argon (Ar) in liquid state, i.e., σ f = σ w = 0.3405 nm (w: wall; f : fluid), ε f /kB = 119.8K and m Ar = 39.95 a.u. The system was simulated for different wall/fluid interaction energy ratios, ε w /ε f , which is analogous to surface wettability (hydrophobic when ε w /ε f < 1 and hydrophilic when ε w /ε f ≥ 1; see [26] for details). 12 6 4 LJ ij ij u rr                        (1) with a cut-off radius rc = 2.5σ. The values of the LJ parameters σ and ε and the masses of the particles were chosen to correspond to argon (Ar) in liquid state, i.e., σf = σw = 0.3405 nm (w: wall; f: fluid), 119  ; see [26] for details). The ratio of wall-to-fluid interaction wf defines how close fluid atoms approach the wall. A flow originates due to the application of an external force ext F equally applied to all fluid particles, while the system temperature remains constant through the application of Nosé-Hoover thermostats in the NVT ensemble. Depending on the simulation, various time steps have been considered. In most cases, each simulation begins with an NVE equilibration stage; at the second stage, fluid particles attain random velocities and, finally, consecutive NVT simulations are performed to provide simulation outputs, each one for at least 20 ns total time, which are averaged to provide the final parameter value.
To understand the complexity of MD simulations, we have to keep in mind that at each time step, the interactions of all atoms are calculated, and then, the atoms are moved to their next positions by incorporating the resulting forces. By using atom positions and velocities during the simulation, one can obtain several material properties through appropriate relations. The relations used to extract the three transport properties of fluids investigated, i.e., the diffusion coefficient, D, shear viscosity, η, and thermal conductivity, k, are given in Appendix A. There is a strong effect of the walls on fluid flows in small dimensions; however, all properties approach their respective bulk values as the channel width increases (e.g., for the argon case, h > 20σ) [27]. Calculations arise from tracking-down of the microscopic state of the system being simulated and demand time-consuming calculations to achieve satisfying accuracy. Specifically, for the calculation of shear viscosity and thermal conductivity, precise and long simulations are needed in order to obtain statistically significant and convergent values. Thus, having the alternative to predict them with ML techniques would be an asset.

Machine Learning
Machine learning is a subfield of Artificial Intelligence (AI) that involves the use of statistical methods to investigate and construct algorithms that are trained on data inputs and make predictions for data outputs. The algorithms inferred operate by building a model from example inputs, follow a decision process and provide predictions, which are usually verified by the same input dataset. Estimating an output from an input dataset is called regression, a type of supervised learning in machine learning. Learning corresponds to adjusting the parameters so that the model makes the most accurate predictions on the data [28]. Channel model, with y-direction normal to the page, where F ext is the driving force applied in x-direction, the channel height is h, the length of the grooves is h l and the height is h d . The ratio of wall-to-fluid interaction ε w /ε f defines how close fluid atoms approach the wall. A flow originates due to the application of an external force F ext equally applied to all fluid particles, while the system temperature remains constant through the application of Nosé-Hoover thermostats in the NVT ensemble. Depending on the simulation, various time steps have been considered. In most cases, each simulation begins with an NVE equilibration stage; at the second stage, fluid particles attain random velocities and, finally, consecutive NVT simulations are performed to provide simulation outputs, each one for at least 20 ns total time, which are averaged to provide the final parameter value.
To understand the complexity of MD simulations, we have to keep in mind that at each time step, the interactions of all atoms are calculated, and then, the atoms are moved to their next positions by incorporating the resulting forces. By using atom positions and velocities during the simulation, one can obtain several material properties through appropriate relations. The relations used to extract the three transport properties of fluids investigated, i.e., the diffusion coefficient, D, shear viscosity, η, and thermal conductivity, k, are given in Appendix A. There is a strong effect of the walls on fluid flows in small dimensions; however, all properties approach their respective bulk values as the channel width increases (e.g., for the argon case, h > 20σ) [27]. Calculations arise from trackingdown of the microscopic state of the system being simulated and demand time-consuming calculations to achieve satisfying accuracy. Specifically, for the calculation of shear viscosity and thermal conductivity, precise and long simulations are needed in order to obtain statistically significant and convergent values. Thus, having the alternative to predict them with ML techniques would be an asset.

Machine Learning
Machine learning is a subfield of Artificial Intelligence (AI) that involves the use of statistical methods to investigate and construct algorithms that are trained on data inputs and make predictions for data outputs. The algorithms inferred operate by building a model from example inputs, follow a decision process and provide predictions, which are usually verified by the same input dataset. Estimating an output from an input dataset is called regression, a type of supervised learning in machine learning. Learning corresponds to adjusting the parameters so that the model makes the most accurate predictions on the data [28].
In a simple regression model, if Y is the predicted variable, X is the input variable, b is the bias term and w is the weight of the variable, then: For a set of n independent input variables (e.g., the regressor), the multiple linear regression model is: In the above expression, w 1 , w 2 , . . . , w n are a set of unknown parameters, representing the impact of the respective X 1 , X 2 , . . . , X n independent inputs on the dependent variable, and Y and b the bias terms which equal the unknown error imposed in the model.
A useful metric for the success of the predicted value over the real value is the root mean square error (RMSE). It is given by: The mean square error (MSE) is given by: The model investigated in this work is graphically presented in Figure 2. The algorithm was written in Python, with functions employed from the scikit-learn library [29]. There are five inputs fed in the ML algorithm; the external force F ext , the channel height h, the length percentage h l /h, the height percentage h d /h of the grooves (if they exist) and the ratio of wall-to-fluid interaction ε w /ε f . The algorithm is expected to estimate the weight of each input and its impact on each one of the four independent outputs, the diffusion coefficient D, the shear viscosity η, the thermal conductivity k and the average channel velocity u m .

Dataset Creation
An extensive literature search was performed to employ simulation data to train and test the model, along with our in-house simulation data. This is a non-trivial task since the obtained data should be in accordance with our input data. Therefore, we had to be careful with what values we could use from the vast database of MD papers found in the literature. It must be clarified that although data from our own simulations have been extracted under the same conditions, data from the literature may differ. For example, different types of thermostats may have been used, or different simulation parameters, such as the set temperature or time step, fluid and wall density, the wall spring constant K that keeps wall atoms around their original positions, etc. However, we believe that these differences may only have a small effect on the accuracy of the model, and they can still be incorporated to quantitatively verify our model. Table 1 presents the literature sources and the types of data incorporated to create the dataset. From each of these sources, only values corresponding to similar simulation conditions were kept. Each number under the output properties denotes the number of The choice of these input parameters was made because relevant simulation evidence supports the assumption that they are significant in affecting most flow and transport properties in nanochannels [9,21,30,31]. With h, h l /h and h d /h being the geometrical characteristics of the channels, ε w /ε f affecting atomic interactions and F ext being the main factor defining the Reynolds number, we believe that we cover a wide range of simulation cases. The external driving force is considered only for the training and prediction of u m ; it has no significant effect on the three transport properties D, η, and k [7], at least in the range of forces studied so far. Furthermore, one could also consider the system temperature T, the average fluid density ρ, the LJ parameter σ, the particle mass m or the surface stiffness K as parameters affecting the flow in nanochannels [26,30]. Nevertheless, the simulation complexity would increase and data from the literature would be hard to obtain.

Dataset Creation
An extensive literature search was performed to employ simulation data to train and test the model, along with our in-house simulation data. This is a non-trivial task since the obtained data should be in accordance with our input data. Therefore, we had to be careful with what values we could use from the vast database of MD papers found in the literature. It must be clarified that although data from our own simulations have been extracted under the same conditions, data from the literature may differ. For example, different types of thermostats may have been used, or different simulation parameters, such as the set temperature or time step, fluid and wall density, the wall spring constant K that keeps wall atoms around their original positions, etc. However, we believe that these differences may only have a small effect on the accuracy of the model, and they can still be incorporated to quantitatively verify our model. Table 1 presents the literature sources and the types of data incorporated to create the dataset. From each of these sources, only values corresponding to similar simulation conditions were kept. Each number under the output properties denotes the number of points extracted from the respective reference. The dataset may seem small; however, it is regarded as a representative set of parameters that could represent simulation results in a qualitative manner, while more data points are to be added in a future work.

Data Preprocessing
During the process of producing output data in our simulation system, each independent input variable (h, h l /h, h d /h, ε w /ε f and F ext ) covers a range of values while the four others are kept constant. The range of input values for the simulations is tabulated in Table 2. Table 2. Range of input data in reduced Lennard-Jones (LJ) units.
The complete dataset was divided into training points to feed the model and testing points to compare with predicted data, in a percentage of 80/20, respectively. Training points can be selected randomly or from carefully selected data points. For the dataset employed in this work, training points were chosen randomly and cover the entire dataset length.
Data inputs/outputs were first pre-processed before being fed to the regression model in Figure 2. The flowchart in Figure 3 demonstrates the complete data flow. After data collection, a normalization stage followed, to restrict the input value range, which transforms to: Fluids 2021, 6, 96 6 of 16 Next, a correlation check was performed. There are five independent input variables in the model that define, in a weighted manner, each one of the 4 dependent output variables. It is common practice in statistics to check whether any correlations exist between the independent variables. A popular measure is the Pearson correlation coefficient, r xy . It is employed to quantify a correlation between two inputs, X i and Y i , of length n, with mean values of X i and Y i , respectively, as follows: The variation inflation factor (VIF) provides an estimate of high multicollinearity between variables and is given by: i is the coefficient of determination for an independent variable [43]. In general, VIF values greater than 10 denote that the respective input can be omitted. where yj is the jth output value, yj(i) is the jth output value after the removal of yj, p is the number of regression coefficients and σ is the estimated variance from the fit.  Figure 4 presents the correlation matrices of each one of the three transport properties (dependent variables) and the average channel velocity, um, according to the Pearson coefficient calculation (Equation (7)). In all correlation matrices, it is shown that there is a strong negative correlation between the two geometrical wall characteristics, l hh and d hh . In grooved channels, the simulations have shown that the length of the grooves has an inverse proportional effect to the groove height; for example, when the groove length l hh is large (compared to the channel height), the flow resembles the smooth channel case, while, on the other hand, when the groove height d hh is large, it blocks the flow, affecting all parameters. It is expected that diffusion coefficient values in Figure  4a are affected mainly by the channel width h (large channel-large D).

Correlations
In Figure 4b, the respective correlation matrix for the shear viscosity η does not locate any correlations between the inputs. MD simulations have revealed the prominent effect of the channel width h to η (large channel-small shear viscosity [7,33]). The correlation matrix for thermal conductivity presents no significant correlation between the inputs (Figure 4c). In contrast to the other dependent variables, the correlation matrix reveals a remarkable behavior for the average velocity um case, shown in Figure 4d. The input parameters wf  and ext F are highly correlated. Apart from possible input rejection due to collinearity, the regression analysis can spot output points, the so-called "outliers", that lie far from the regression lines and whose behavior needs further investigation. They could be considered either as "bad" predictions, or, in many cases, they may have resulted from statistical errors, noisy data or some kind of computational inaccuracies during the simulations [44].
A statistical measure used to identify the contribution of a data point to the total regression quality, identifying outliers, is Cook's distance [45], given by where y j is the jth output value, y j(i) is the jth output value after the removal of y j , p is the number of regression coefficients and σ is the estimated variance from the fit.  Figure 4 presents the correlation matrices of each one of the three transport properties (dependent variables) and the average channel velocity, u m , according to the Pearson coefficient calculation (Equation (7)). In all correlation matrices, it is shown that there is a strong negative correlation between the two geometrical wall characteristics, h l /h and h d /h. In grooved channels, the simulations have shown that the length of the grooves has an inverse proportional effect to the groove height; for example, when the groove length h l /h is large (compared to the channel height), the flow resembles the smooth channel case, while, on the other hand, when the groove height h d /h is large, it blocks the flow, affecting all parameters. It is expected that diffusion coefficient values in Figure 4a are affected mainly by the channel width h (large channel-large D).

Model Accuracy
To scrutinize the regression model performance, calculated and predicted values for each output are plotted in Figure 5a-d. Each diagram includes the training (blue squares) and the testing (yellow circles) points. The lines correspond to the linear ML model regression fits. Inset figures include the 95% confidence intervals, i.e., a statistical measure to quantify the uncertainty of predicted values over values used to test the model. The calculated prediction accuracy results (RMSE, MAE and R 2 ), as well as the weights for every input according to Equation (3), are presented in Table 4. The predictions of two of the three transport properties, D and k, show remarkable accuracy between the tested and In Figure 4b, the respective correlation matrix for the shear viscosity η does not locate any correlations between the inputs. MD simulations have revealed the prominent effect of the channel width h to η (large channel-small shear viscosity [7,33]). The correlation matrix for thermal conductivity presents no significant correlation between the inputs (Figure 4c). In contrast to the other dependent variables, the correlation matrix reveals a remarkable behavior for the average velocity u m case, shown in Figure 4d. The input parameters ε w /ε f and F ext are highly correlated.
This finding indicates possible multicollinearity, and further investigation is to be performed. The VIF (Equation (8)) was calculated for every input and the values are shown in Table 3. All input parameters are only slightly correlated, below the threshold of V IF < 10, and this denotes that the ML procedure is to be executed keeping in mind all input parameters.

Model Accuracy
To scrutinize the regression model performance, calculated and predicted values for each output are plotted in Figure 5a-d. Each diagram includes the training (blue squares) and the testing (yellow circles) points. The lines correspond to the linear ML model regression fits. Inset figures include the 95% confidence intervals, i.e., a statistical measure to quantify the uncertainty of predicted values over values used to test the model. The calculated prediction accuracy results (RMSE, MAE and R 2 ), as well as the weights for every input according to Equation (3), are presented in Table 4. The predictions of two of the three transport properties, D and k, show remarkable accuracy between the tested and predicted values, as shown by the high R 2 values. In contrast, the model performance on shear viscosity, η, and the average channel velocity, u m, is small, albeit acceptable. The data points in Figure 5a  However, further investigation is needed to characterize a data point as outlier or not. Towards this direction, we have employed two of the most widely used statistical tools, the residuals plot and the Cook's Distance plot. Visualization for these is made possible with the Python Yellowbrick package [46]. The residuals plot presents the cal-  However, further investigation is needed to characterize a data point as outlier or not. Towards this direction, we have employed two of the most widely used statistical tools, the residuals plot and the Cook's Distance plot. Visualization for these is made possible with the Python Yellowbrick package [46]. The residuals plot presents the calculated difference between the real value and the predicted value, i.e., the prediction error. Figure 6a is a residual plot for D train and test data. Data points are scattered around the horizontal axis. A good regression fit is considered when data are close to the horizontal line. The respective histogram shows that the induced error is distributed around zero. There are data points in the histogram far from zero, nevertheless, the main distribution is around zero. Train and test R 2 values shown in the diagram are similar to the average value shown in Table 4.
To strengthen our statistical evidence, Figure 6b depicts the calculated Cook's distance for the diffusion coefficient in our model (Equation (9)), a measure that identifies the influential outliers, providing the index of the data from a stem plot, where a horizontal line is drawn at the 4/n threshold. Stems above this line are possible outliers and their percentage is shown in the Figure legend. For the diffusion coefficient, D, simulation data with index = 24 are considered as outliers. Going back on the dataset incorporated, it is found that this point belongs to a simulation result from a h = 18.5σ nanochannel, taken from our in-house simulations, with the extreme value of wall/fluid interaction ε w /ε f = 5.0, which is found to have a decreasing effect on D, as reported in [8].
If we remove this outlier from the dataset, we obtain the respective residuals plot and Cook's distance in Figure 6c,d. The outlier removal does not seem to affect the accuracy of the regression model, as shown in the residuals plot; only one outlier is not so influential. No other possible influential outliers exist, as all data points are now below the threshold horizontal line (Figure 6d).
For the shear viscosity, η, a residual plot for train and test data is shown in Figure 7a. Although data is mainly scattered around the horizontal line, yet, there are scarce points that keep the R 2 value low. The Cook's distance (Figure 7b) depicts these outliers, and after their removal, we observe that residuals have significantly improved and data distribution is around zero, as shown from the respective histogram plot in Figure 7c. No other outliers remain in the dataset (Figure 7d). We argue that linear regression has reached its prediction limits for shear viscosity with acceptable accuracy, at least for this dataset range. Previous works have shown that shear viscosity values are high at small nanochannels (from h = 2σ) and reach the bulk value for h > 10 − 12σ [6]. Moreover, η also increases when roughness elements "block" the flow region inside nanochannels, i.e., h d /h ≥ 0.15 and when the walls are strongly hydrophilic, i.e., ε w /ε f = 2-5 [8]. Therefore, our ML model fails to predict shear viscosity values for small nanochannels, with roughness elements that block the flow, and strongly hydrophilic walls, creating outliers. However, in all other cases, accuracy obtained with multivariant regression is good.
For thermal conductivity, k, the residual plot in Figure 8a shows good accuracy for training data. We must point out that small R 2 in test data is circumstantial, since our model selects randomly from the dataset which data to consider as train and test. The Cook's distance (Figure 8b) depicts two outliers, and after their removal, we observe that R 2 is improved for test data. The large ε w /ε f = 2-5 ratio (strongly hydrophilic wall) is also responsible for the outliers in thermal conductivity values. We note that thermal conductivity has shown remarkable accuracy to the regression method investigated here. the threshold horizontal line (Figure 6d).  For the shear viscosity, η, a residual plot for train and test data is shown in Figure 7a. Although data is mainly scattered around the horizontal line, yet, there are scarce points that keep the R 2 value low. The Cook's distance (Figure 7b) depicts these outliers, and after their removal, we observe that residuals have significantly improved and data distribution is around zero, as shown from the respective histogram plot in Figure 7c. No  other outliers remain in the dataset (Figure 7d). We argue that linear regression has reached its prediction limits for shear viscosity with acceptable accuracy, at least for this dataset range. Previous works have shown that shear viscosity values are high at small nanochannels (from h = 2σ) and reach the bulk value for h > 10 − 12σ [6]. Moreover, η also increases when roughness elements "block" the flow region inside nanochannels, i.e., ℎ ℎ ⁄ ≥ 0.15 and when the walls are strongly hydrophilic, i.e., ⁄ = 2 − 5 [8]. Therefore, our ML model fails to predict shear viscosity values for small nanochannels, with roughness elements that block the flow, and strongly hydrophilic walls, creating outliers. However, in all other cases, accuracy obtained with multivariant regression is good. For thermal conductivity, k, the residual plot in Figure 8a shows good accuracy for training data. We must point out that small R 2 in test data is circumstantial, since our model selects randomly from the dataset which data to consider as train and test. The Cook's distance (Figure 8b) depicts two outliers, and after their removal, we observe that R 2 is improved for test data. The large 25 w f   ratio (strongly hydrophilic wall) is  The residuals plot for um in Figure 9a reveals good accuracy to the regression model, while the histogram on the same plot presents normal distribution. This is evidence that linear regression is a choice for predicting average velocity values with ML in systems with similar characteristics. After the outlier removal, the model accuracy is further increased, as shown in Figure 9c. As in previous cases, outliers for um are due to roughness elements height, ℎ ℎ ⁄ , and hydrophilic walls. The residuals plot for u m in Figure 9a reveals good accuracy to the regression model, while the histogram on the same plot presents normal distribution. This is evidence that linear regression is a choice for predicting average velocity values with ML in systems with similar characteristics. After the outlier removal, the model accuracy is further increased, as shown in Figure 9c. As in previous cases, outliers for u m are due to roughness elements height, h d /h, and hydrophilic walls.

Discussion
The ML regression technique incorporated in this work has shown a good performance in predicting the three transport properties of fluids, D, η and k, and the average velocity across the channel, um, a property that is a basic element in many computational fluid mechanics equations, such as the estimation of the Reynolds number [38]. To our knowledge, data from nanoscale simulations have been mainly used for coupling ab initio calculations to MD simulations for the construction of Coarse-Grained systems or for decreasing the order of ordinary and partial differential equations. Nevertheless, as ML is currently a widely investigated field in the condensed matter physics region, it is expected to continuously provide new research results.
Data curation, when obtained from various databases, is an important issue as indicated by early papers in this domain [47], although it seems that there is no commonly accepted protocol or set of procedures for data preprocessing, with data regularization/normalization one of the widely used techniques. Our input data were normalized before being fed to the regression model. A small, though representative, dataset, was chosen which covers a wide range of simulation cases. The quality of the datasets is considered high with respect to the impact of the journals from which they were imported. Checking for outliers and their effects in the resulting model can also act as a control for the quality of the dataset. As regards the number of points to incorporate in such a procedure, there is no clear answer. In the literature, there are cases for successful ML mod-

Discussion
The ML regression technique incorporated in this work has shown a good performance in predicting the three transport properties of fluids, D, η and k, and the average velocity across the channel, u m , a property that is a basic element in many computational fluid mechanics equations, such as the estimation of the Reynolds number [38]. To our knowledge, data from nanoscale simulations have been mainly used for coupling ab initio calculations to MD simulations for the construction of Coarse-Grained systems or for decreasing the order of ordinary and partial differential equations. Nevertheless, as ML is currently a widely investigated field in the condensed matter physics region, it is expected to continuously provide new research results.
Data curation, when obtained from various databases, is an important issue as indicated by early papers in this domain [47], although it seems that there is no commonly accepted protocol or set of procedures for data preprocessing, with data regularization/normalization one of the widely used techniques. Our input data were normalized before being fed to the regression model. A small, though representative, dataset, was chosen which covers a wide range of simulation cases. The quality of the datasets is considered high with respect to the impact of the journals from which they were imported. Checking for outliers and their effects in the resulting model can also act as a control for the quality of the dataset. As regards the number of points to incorporate in such a procedure, there is no clear answer. In the literature, there are cases for successful ML models with datasets containing from less than a hundred [48] to thousands of values [25]. It is generally accepted that for smaller datasets, classical and statistical ML approaches (e.g., regression, support vector machines, k-nearest neighbors and decision trees) are more suitable [49].
Our work focused on fluid flows at the nanoscale. As simulation systems become bigger and multiscale methods have succeeded in coupling flow phenomena among scales, it has to be investigated whether there is a way of replacing some time-and hardwaredemanding computations with procedures that are easier to perform. In the previous sections, it was shown that, even with common multivariate regression techniques, ML models can be constructed that are capable of predicting values close to properties extracted from MD simulations found in the literature.
Calculation of the three transport properties, D, η and k, is computationally demanding, especially in nano-confined systems where the impact of the walls is significant and stronger shear stresses exist. Calculating the interactions between all atoms in a system is challenging and many researchers have suggested modified relations from the macroscale that could be applied at the nano-and micro-scale after some modifications [50][51][52]. The equations for the extraction of the three transport properties used in our simulations are presented in Appendix A. For the ML model exploited here, five inputs are fed in a regression-based ML procedure; the geometrical channel characteristics, such as the channel height, wall groove length and groove height (h, h l /h and h d /h, respectively), the interaction ratio between wall/fluid atoms, ε w /ε f , and the external driving force F ext used to drive the flow (taken into consideration only for the u m extraction). These independent parameters have been proven to be uncorrelated for the prediction of D, η, k and u m .
Predicted values from our model apply well on linear regression fits. Based on the residual plots presented in Section 3.2, it is inferred that multiple linear regression can be a good choice for data prediction, at least at the nanoscale, with accuracy comparable to MD results. Values that influence the model accuracy have been spotted for each output. From the interpretation of the results in Figures 6-9, it is inferred that statistical tools such as residuals plots and Cook's distance can locate data outliers from a database. It is expected that, since one must deal with simulation data, immersed statistical errors or noise would affect the ML model. In contrast, these inaccuracies do not seem to qualitatively affect the procedure in our regression-based ML; nevertheless, values taken from extreme simulation conditions, such as with large ε w /ε f ratios or at channels with grooves of large height, seem to affect the efficiency. These points could be removed from our dataset to achieve increased accuracy, yet this would still affect the physical meaning of the ML model. Since intuition plays a key role in selecting which outliers are to be removed [44], we believe that what has to be done is to increase the training samples with more extreme data points so that the model is fully trained and achieves higher accuracy.
Another approach would be the incorporation of other ML algorithms, such as various types of neural networks and deep learning. It is anticipated, though, that this would demand a larger dataset which, in order to comply with our simulation data, must be created from scratch. Training and test data, apart from our simulation database, were also drawn from the literature. We have to note that slight inaccuracies may have occurred during the data extraction from the respective published papers. Moreover, there may be simulation conditions different from our own, with different simulation techniques, time steps, temperatures, etc., that may also induce some inaccuracies.
In a broad sense, this work aimed to couple machine learning and computational condensed matter physics. Overall, our results, despite the approximations necessarily made to permit the inclusion of data coming from different sources, appear to be in qualitative agreement with a number of literature results and achieved satisfying accuracy. Simulation techniques combined with machine learning analysis enable us to use scarce data more effectively [53].

Conclusions
It is widely believed that when classical, quantum simulations and ML methods are joined, it could change our efforts towards making predictions in condensed matter physics. In this work, we focused on flow simulations in nanoslits of various dimensions for a range of characteristics affecting the flow, such as the wall structure, the interaction strength between fluid/solid and the external driving forces. Along with data obtained from the literature, a small, albeit indicative, database was created. Transport and flow properties of a simple LJ fluid were predicted after employing the appropriate technique, with multivariate regression showing good accuracy.
We have shown that, in this context, ML can be a valuable predictive tool, especially at the point where missing data among various scales exist. This would increase our ability to replace some simulation points and, in the next step, further facilitate coupling across scales. The key concept towards this direction is the creation of a statistically large database that could be incorporated from a powerful machine learning framework. However, it should be kept in mind that the proposed method should not be viewed as a replacement of current simulation techniques, which have been verified and tested over various conditions throughout the years. Simulations and ML techniques could coexist in order to unlock new, promising possibilities in computational science and engineering problems. Data Availability Statement: Data may be available from the authors upon request.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The diffusion coefficient is obtained using Einstein's relation: where r j is the position vector of the jth atom and d is the dimensionality of the system (d = 1 for diffusivity calculation in one direction, d = 2 in two directions and d = 3 in three directions). The brackets indicate the time average, while N is the number of LJ fluid atoms.
Shear viscosity and thermal conductivity for systems in equilibrium can be calculated using the Green-Kubo formalism. Shear viscosity η for a pure fluid is computed by the relation η = 1 VkBT Thermal conductivity k can be calculated by the integration of the time-autocorrelation function of the microscopic heat flow J x q , i.e., where the microscopic heat flow J x q is given by where υ i is the speed velocity magnitude of atom i and I is the unitary matrix.