Data Mining and Machine Learning Techniques for Aerodynamic Databases: Introduction, Methodology and Potential Beneﬁts

: Machine learning and data mining techniques are nowadays being used in many business sectors to exploit the data in order to detect trends, discover certain features and patters, or even predict the future. However, in the ﬁeld of aerodynamics, the application of these techniques is still in the initial stages. This paper focuses on exploring the beneﬁts that machine learning and data mining techniques can o ﬀ er to aerodynamicists in order to extract knowledge from the CFD data and to make quick predictions of aerodynamic coe ﬃ cients. For this purpose, three aerodynamic databases (NACA0012 airfoil, RAE2822 airfoil and 3D DPW wing) have been used and results show that machine-learning and data-mining techniques have a huge potential also in this ﬁeld.


Introduction
In the field of aerodynamics, complex steady flows are simulated by computational fluid dynamics (CFD) daily in the industry since CFD tools have already reached an acceptable level of maturity. These simulations are usually performed over full-aircraft configurations or several aircraft components where meshes of hundreds of million points are required in order to provide precise features of the flow. In addition, simulations are performed for different parameters to properly explore the design space. This implies a high computational cost that may be, in certain situations, even infeasible nowadays. To overcome this limitation, the CFD solver could be replaced by a surrogate model which produces a fast prediction of the aerodynamic features, based on previous simulations or wind-tunnel data. Machine learning techniques commonly used in the area of artificial intelligence (AI) and data mining (DM) can represent a valuable support to reduce the computational cost required for aerodynamic analysis.
The objective of this paper is to research in the application of machine learning and data-driven approaches for aerodynamic analysis. While these techniques have been broadly used in other sectors such as finances or risk analysis, the application in the aeronautical sector is still in its infancy. The novelty of this paper is to research on the feasibility and potential benefits of applying these techniques for aerodynamic analysis of aeronautical configurations. Application test cases have been selected amongst those commonly used in the literature for validation purpose, in order to be able to quickly generate the required databases for testing the methods, and to provide comparable results. For the abovementioned purpose, this paper covers all the required aspects in any machine learning project, such as data analysis, feature scaling, model construction, and accuracy measurement.
The main motivation for this research is to analyze the potential of data mining and machine learning techniques for a fast aerodynamic features prediction based on previous and existing

Brief Review of the State of the Art
This section will review the state of the art in the technical fields involved in this research, namely machine learning and data-driven approaches for aerodynamic analysis.
In the five last years, there has been an increasing interest in the development of techniques to handle aerodynamic data, coming from different sources, such as CFD simulations, wind tunnel experiments or even flight test data. The ability of handling this vast amount of data of a heterogeneous nature is a crucial factor in order to enable machine learning methods to be applied in the aeronautic industry.
In the following Table 1, the most recent state-of-the-art studies in the scientific literature are reviewed: Table 1. Summary of the recent state-of-the-art studies in the scientific literature.

Ref.
Year of Publication Main Use of Machine Learning Summary of the Main Advance Proposed [1] 2020 To predict the distribution of a coarse grid CFD local error and correct the fluid-flow variables.
Authors propose a surrogate model trained to predict the distribution of a coarse grid CFD local error. The proposed surrogate model is built using ML regression algorithms. They tested artificial neural network (ANN) and random forest (RF) and the test case selected was a three-dimensional turbulent flow inside a lid-driven cavity.
[2] 2020 To perform wake modeling of wind turbines A combined framework with CFD and machine learning techniques is presented to improve the turbine wake predictions. In particular, an ANN model in combination to a reduced-order turbine model ADM-R, actuator disk model with rotation, is proposed and demonstrated to be capable to handle big amounts of data with complex relations between the parameters involved. The selected test case was a standalone Vestas V80 2 MW wind turbine. [3] 2020 To develop a machine learning framework for Reynolds-averaged Navier-Stokes (RANS) models Authors propose a new data-driven machine learning method to model RANS equations. The proposed CFD-driven machine learning approach was applied to model development for wake mixing in turbomachines [4] 2019 To develop surrogate models to augment the capability of the current turbulence models Authors propose a mapping approach between the turbulent eddy viscosity and the mean flow variables by ANNs. The study includes several tests cases using well-known airfoils, such as the NACA0012.
The data-driven turbulence model is applied to predict eddy viscosity, lift/drag coefficients, and skin friction distributions.

Ref.
Year of Publication Main Use of Machine Learning Summary of the Main Advance Proposed [5] 2019 To develop surrogate models to augment the capability of the current turbulence models A method based on neural networks is applied to reduce the error of RANS simulations by using the data-augmented turbulence model approach in an integrated way. In addition, also a new layered approach for the NN is proposed to reduce the required training times.
[6] 2019 To develop surrogate models for fast prediction of aerodynamic coefficients of aeronautical configurations.
Authors propose in this paper to use support vector machines as surrogate models for quick prediction of aerodynamic coefficients. Research included also testing in different aeronautical configurations such as NACA0012, RAE2822 and DPW wing. [7,8] 2019 To develop machine learning techniques to predict uncertainties In these papers, data-driven models are built form uncertainty quantification in aerodynamics.
[9] 2019 To develop machine learning methods for aerodynamic shape optimization Authors propose a new optimizer based on machine learning techniques, in particular reinforcement learning, transfer learning and deep neural networks. The proposed approach is tested for a typical aerodynamic shape optimization of missile control surfaces with computational fluid dynamics (CFD).
[10] 2018 To accelerate RANS simulations using a data-driven method Authors demonstrated that machine learning can be used to improve the RANS modeled Reynolds stresses by leveraging data from high-fidelity simulations. [11] 2018 To predict aerodynamic data in transonic flows Authors propose a local decomposition method to improve the accuracy of aerodynamic fields in transonic conditions.
Tests were performed on the AS28G aircraft configuration.
[12] 2018 To develop machine learning methods for prediction of airfoil lift coefficient A convolutional neural network is developed to learn the lift coefficients of several airfoils of different shapes and parameters such as Mach, Reynolds and AoA. [13] 2018 To predict wind turbine wakes Authors propose to use a deep neural network with transfer learning ability for efficient prediction of wind turbine wakes and efficiency. [14] 2017 To predict aerodynamic coefficients of transport airplanes An artificial neural network model is proposed to predict aerodynamic coefficients of transport airplanes. The proposed model is able to efficiently predict both lift and drag coefficients in wing-fuselage configurations.
[15] 2017 To develop machine learning methods for nonlinear unsteady aerodynamic reduced-order modeling Authors propose a multi-kernel neural networks approach to improve the accuracy and generalization capability through linearly combining the Gaussian and wavelet basis functions as the hidden basis functions.
[16] 2017 To develop machine learning methods for condition monitoring of wind turbines A surrogate model is proposed to monitor wind turbine conditions and to be able to detect possible anomalies in turbine performance which have the potential to result in unexpected failure.
[17] 2016 To reduce the computational cost of aerodynamic shape design process In order to reduce the computational cost of aerodynamic shape design process of aeronautical configurations, authors propose to use evolutionary optimization methods in combination to support vector machines to speed-up the design stage while preserving a certain level of accuracy.

Ref.
Year of Publication Main Use of Machine Learning Summary of the Main Advance Proposed [18] 2016 To predict the objective function prediction within aerodynamic shape optimization Authors present a comparison of Kriging and Support Vector Machines for Regression (SVR) surrogate models applied to the objective function prediction within an aerodynamic shape optimization framework. [19] 2015 To develop machine learning methods for unsteady aerodynamics modeling Authors propose to use support vector machines for unsteady aerodynamic modeling at high angles of attack.
[20] 2015 To develop machine learning methods to model aerodynamic loads Author proposes to integrate Centroidal Voronoi tessellation, leave-one-out cross validation, proper orthogonal decomposition, and multidimensional interpolation for the evaluation of steady aerodynamic loads.
In summary, from the last 5 years, it can be observed that machine learning techniques have been used to predict the aerodynamic features, to accelerate or improve the precision of turbulence models, to speed-up the shape design optimization process and to quantify manage uncertainties in the flow fields, amongst others.
All the papers found, however, focus more on the application, and do not provide an overall view of all the requires steps and requirements to properly handle and prepare the data for machine learning techniques. This paper aims to fill in this gap and provides, through simple examples, an overall scheme of all the steps needed to obtain proper models by machine learning.

Methodology and Results
The Figure 1 shows the main steps of a typical machine learning process: Energies 2020, 13, x FOR PEER REVIEW 4 of 22

Methodology and Results
The Figure 1 shows the main steps of a typical machine learning process: In this section, each of the steps in the figure above will be explained and applied to the selected aerodynamic databases. It is important to mention that all the research performed in this paper used Scikit-learn 0.22.1, pandas 1.0.1, matplotlib 3.1.3, and python 3.8.2 libraries, all included in the Anaconda distribution. CFD computations for the databases generation were performed with the DLR TAU code (release 2019.1.0, Spalart-Allmaras turbulence model, convergence criteria based on minimum residuals).
In order to allow other researchers to perform experiments on existing aerodynamic databases and avoid the repetition of the CFD simulations required to build such databases, Appendix A provides all the required data for free use within the research community  In order to allow other researchers to perform experiments on existing aerodynamic databases and avoid the repetition of the CFD simulations required to build such databases, Appendix A provides all the required data for free use within the research community

Data Polishing and Statistical Analysis
The first step was to quickly explore what the databases looked like. As mentioned previously, the complete databases for all the tested performed in this paper can be found in the annexes.
One of the important issues in databases for machine learning is that there are no empty values because, in case they exist, they should be substituted by the average, by zero or other values, depending on the specific case, in order to not affect the surrogate model performance. Therefore, it was checked if there were empty values for any of the variables and the result, as can be observed in the following Table 2, was that there were no empty values in the databases. Table 2. Initial information of the aerodynamic databases. As can be observed in the previous table, all three databases are composed of 4 columns (corresponding to the Mach, AoA, lift coefficient and drag coefficient). The data size varies depending on the configuration, the NACA0012 database has 185 rows (it means CFD computations), the RAE2822 database has 122 samples, and the DPW databases includes 100 samples.
Then, it is possible to have a look at some statistics of the aerodynamic data in the databases Tables 3-5: In the table above, the rows labeled "count", "mean", "min", and "max" correspond to the number of samples, mean value of the parameter, minimum and maximum values, respectively. The row labeled "std" shows the standard deviation of the values for this particular parameter. Rows labeled as 25%, 50%, and 75% show the percentiles, which reflect the value below which a given percentage of observations in a certain group of observations falls.
From these statistics, there is one important aspect to consider. The AoA values have a high standard deviation and this will have to be considered further when deciding how to scale the training data to not affect the model performance.
It is also possible to plot the histograms of each of the considered parameters, to better understand the type of data to deal with. The following Figures 2-4 show the histograms of each variable in the database for the three cases considered:      As can be deduced from the pictures above, all the three database were generated with a Latin Hypercube Sampling (LHS) method in parameters AoA and Mach, this is why the histograms show the same bars altitude except for those cases where the CFD solver did not converge (those with highest angles of attack which did not achieve the minimum residual convergence criteria) and were eliminated from the database.
In the case of Cd, the histogram shows a strong concentration of values near to 0, and the Cl histogram shows the main concentration for values between 0.7 and 1.25, especially in the two airfoil databases.

Splitting Training and Test Sets
The next step was to split the database between train and test sets. In this example, the split was done with a pure random sampling method and considering 80% of the initial samples for the train set and the other 20% for the test set. This step could be improved by using other splitting methods (such as cross-fold validation for instances as the work performed in [21]), but since the purpose of this paper is to give a general overview of the machine learning process for aerodynamic analysis, this random sampling technique was considered.

Exploring the Training Set
Now, the kind of data that will be used to build the model is explored more in detail, as can be As can be deduced from the pictures above, all the three database were generated with a Latin Hypercube Sampling (LHS) method in parameters AoA and Mach, this is why the histograms show the same bars altitude except for those cases where the CFD solver did not converge (those with highest angles of attack which did not achieve the minimum residual convergence criteria) and were eliminated from the database.
In the case of Cd, the histogram shows a strong concentration of values near to 0, and the Cl histogram shows the main concentration for values between 0.7 and 1.25, especially in the two airfoil databases.

Splitting Training and Test Sets
The next step was to split the database between train and test sets. In this example, the split was done with a pure random sampling method and considering 80% of the initial samples for the train set and the other 20% for the test set. This step could be improved by using other splitting methods Energies 2020, 13, 5807 8 of 22 (such as cross-fold validation for instances as the work performed in [21]), but since the purpose of this paper is to give a general overview of the machine learning process for aerodynamic analysis, this random sampling technique was considered.

Exploring the Training Set
Now, the kind of data that will be used to build the model is explored more in detail, as can be observed in Figure 5: In addition, since the datasets are of a manageable size, it is also possible to compute the Pearson's coefficient for every pair of variables, as are displayed in the following Tables 6-8:  In addition, since the datasets are of a manageable size, it is also possible to compute the Pearson's coefficient r for every pair of variables, as are displayed in the following Tables 6-8:       Since the main diagonal of these plots would be full of straight lines, instead of showing them, it is displayed as a histogram of each attribute (remember that these histograms look different with respect to the ones showed previously, since now only the training dataset is considered).   Since the main diagonal of these plots would be full of straight lines, instead of showing them, it is displayed as a histogram of each attribute (remember that these histograms look different with respect to the ones showed previously, since now only the training dataset is considered). Since the main diagonal of these plots would be full of straight lines, instead of showing them, it is displayed as a histogram of each attribute (remember that these histograms look different with respect to the ones showed previously, since now only the training dataset is considered).
From these pictures, it can be observed that, for predicting the lift coefficient, the AoA has a very strong importance, while the Mach number is less important. However, for the prediction of the Cd, both parameters have almost the same importance, especially in the 2D cases. This aspect is less clear in the DPW test case, where the Mach number seems to be less important than the AoA for Cd prediction.

Preparing the Data for Machine Learning Algorithms
The first step here would be to handle the missing values in the database, but as it was mentioned before, the databases do not have missing values, since all the cases without solver convergence were not incorporated in the database. Now, it is necessary to apply one of the most important transformations to the data, which is feature scaling. Machine learning algorithms do not behave well when the parameters have different ranges, and this is the case here since the scales of the AoA and the Cd for instance are very different, as it was also mentioned previously. In this research, a standard normally distributed scaling method [22] for each column in the training database (0 mean and unit variance) was used. The selected scaling method does not have a relevant impact on the results; what is really crucial is that all features are scaled in order to help machine learning methods to provide efficient predictions.

Model Construction
Until now, the problem has been stablished, the data have been obtained and examined, the training set and a test set have been sampled, and feature scaling has been performed to adapt the data for machine learning algorithms. In this subsection, a machine learning model is going to be selected and trained.
Since this paper aims to provide a global overview of the machine learning process, it is not in the scope to provide a deep evaluation of different models and tune the model parameters. Instead, only two regression models are selected and used with the default variables defined in the Scikit-learn packages:

Model Validation
Once the model is trained, the final step is to use it for predicting new values and validate its performance.
First, the linear regression model was tested. Table 9 shows the typical regression metrics for model comparison. Figure 9 shows the comparison between true vs. predicted coefficient values with linear regression method.
Then, the support vector regression model was tested. The following Table 10 shows the typical regression metrics for model comparison. Figure 10 shows the comparison between true vs. predicted coefficient values with SVR method.  Then, the support vector regression model was tested. The following Table 10 shows the typical regression metrics for model comparison. Figure 10 shows the comparison between true vs. predicted coefficient values with SVR method.   As can be observed in the figures above, the SVR model behaves better than the linear regression model, as was expected. The R 2 metrics show reasonable accuracy values for the constructed model. Of course, a more robust cross-validation strategy, as well as the optimization of the SVR hyper parameters could have been performed, but the objective of this paper was only to give an overall view of the whole machine learning process for analysis of aerodynamic databases and prediction of aerodynamic coefficients.

Conclusions
This paper focuses on exploring the benefits that machine learning and data mining techniques can offer to aerodynamicists in order to extract knowledge from the CFD data and to make quick predictions of aerodynamic coefficients. The main objective of this paper has been to introduce all the steps in a typical machine learning process and apply these steps to aerodynamic databases. For this purpose, three aerodynamic databases (NACA0012 airfoil, RAE2822 airfoil and 3D DPW wing) have been used and results have demonstrated the feasibility and potential benefits of applying machine learning and data-driven techniques for aerodynamic analysis of aeronautical configurations.
As the future work, there is still further potential to be exploited: a clever generation of the samples in the initial dataset (not LHS), the use of more robust model validation strategies, such as cross-fold validation, the combination of multi-fidelity data within the aerodynamic database (e.g., CFD, wind tunnel, flight testing data, etc.), the comparison of different regression models and tuning these parameters, etc. In addition, the use of these models for uncertainty quantification is another future research topic to face.
Finally, it is important to mention that all databases used in this paper are freely available for the scientific community.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Databases
In order to allow other researchers to perform experiments on existing aerodynamic databases and avoid the repetition of the CFD simulations required to build such databases, this appendix provides all the required data for free use within the research community (databases will be provided on request by email).