Exploration of Axial Fan Design Space with Data-Driven Approach

: Since the 1960s, turbomachinery design has mainly been based on similarity theory and empirical correlations derived from experimental data and manufacturing experience. Over the years, this knowledge was consolidated and summarized by parameters such as speciﬁc speed and diameters that represent the ﬂow features on the meridional plane, hiding however the direct correlations between all the actual design parameters (e.g., blade number or hub-to-tip ratio). Today a series of statistical tools developed for big data analysis sheds new light on correlations among turbomachinery design and performance parameters. In the following article we explore a dataset of over 10,000 axial fans by means of principal component analysis and projection to latent structures. The aim is to ﬁnd correlations between design and performance features and comment on the capabilities of this approach to give new insights on the design space of axial fans.


Introduction
Since the 1960s, turbomachinery design has relied on similarity theory and empirical correlations based on the regression of experimental data [1]. This has been done by exploiting consolidated design experience by means of normalized parameters-namely specific speed (Ns) and specific diameter (Ds)-according to the typical design rules defined [2][3][4][5]. In this way, it is possible to select a fan to reach a specific duty point (axial, mixed, radial) and the best expected efficiency, using Balje charts [4] and similar performance maps [1]. However, Ns and Ds, intended to represent the meridian flow geometry, hide the contribution the different design parameters have on the final performance of the fan. In fact, Ns and Ds, according to their respective definitions, depend upon rotating speed, maximum diameter, flow rate, and specific work at best efficiency operations. In reality, a larger set of parameters concur to the performance, such as blade aspect ratio, chord and twist distributions along the blade span, hub-to-tip ratio, solidity, blade number, and tip gap [6], related to the three-dimensional design criterion or even to the manufacturing process of the fan. All those parameters need to be selected during the design process, and often this is done by exploiting other charts, which were also derived from consolidated empirical manufacturer experiences [7]. In these charts, correlations between some of the design parameters are presented and summarized into different coefficients and correction factors that can enrich the classic design space given by Euler work analysis. Starting from the work of Balje, Lieblein, and Howell [4][5][6], many scholars derived correlations and corrections to account for different design parameters to improve the performance of turbomachinery [8]. Still today most of the works are limited to linear regression approaches, and often limited to specific classes of fans [9].
In recent years, social networks have overhauled not just social dynamics and media, but also the approach to big data analysis. In fact, the formidable amount of data exchanged by users on large platforms needs to be classified and correlated to be monetized [10]. This led to the application and revamping of old statistical approaches but also to the development of new methods for big data analysis [11,12]. One of the key properties of big data analysis lies in the principle that correlations and relationships inside the dataset can be unveiled independently from the nature of data [13]. Therefore, it is possible to use the same algorithm to classify photos on Instagram [14], customers of a bank [15], or classify galaxies inside all-sky surveys [16,17]. This opened new research perspectives in finance, astrophysics, molecular chemistry, turbulence modelling, and other fields where large dataset are available [18]. Industrial product research and development is in fact another potential test bed for the application of data-driven analysis.
The following article presents preliminary work on the exploration of correlations between axial flow fan design parameters and performance carried out on a database of about 4000 individuals. The idea is that this procedure can be applied to a dataset that is heterogeneous, incomplete, and populated with a significant number of samples, for example the database of a fan manufacturer. Once the population has enough samples, in fact, the source of the data is not important. In this work, we will refer to a specific class of turbomachinery: Axial flow fans with rotor-only arrangement.
The aim of this work is to explore the possibilities and limits of big data analytics, through a combination of a multi-variate statistical approaches of principal component analysis (PCA) and projection to latent structures (PLS) to the design and optimization of industrial turbomachinery.
In the next sections, the methods used for the analysis are illustrated. Then the process of dataset creation is described and correlated to the considerations typical of the axial flow fan design and manufacturing process. Results of the analysis of said dataset are then discussed and finally conclusions are drawn.

Data Mining
Two different data mining approaches are used: principal component analysis and projection to latent structure. The present work advocates the combination of these approaches to characterize the features of the dataset, identifying possible clusters within the individuals (PCA) and unveiling design rules among an augmented variable set (PLS).

Principal Component Analysis
Principal component analysis is a multivariate statistical method that reduces the dimensionality of the feature space, while retaining most of the variance in the dataset [19]. An orthogonal transformation allows you to convert a set of N samples, containing possibly K correlated features, into a new set of values of linearly uncorrelated variables, defined as p principal components, which are linear combinations of the original variables [20]. The first principal component explains the largest possible variance, representing the direction along which the samples show the largest variation. The second component is computed under the constraint of being orthogonal to the first component and to have the second largest possible variance [21]. The following components, constructed with the same criterion, account for as much of the remaining variability as possible.
This means that the original data matrix X is decomposed in two matrices V and U that are mutually orthogonal. The V matrix is called the loadings matrix, while U is the scores matrix. Loadings are the weights of each original feature when calculating the principal component, while U contains the original data in a rotated coordinate system ( Figure 1).

Projection to Latent Structure
Projection to latent structure is a statistical method that acts on the data similarly to the PCA, with the main difference being that the features of the original dataset are grouped in input and output, and PLS aims to find relationships between these sub-sets. Specifically, PLS will find the multidimensional direction in the input variables space X that defines the maximum multidimensional variance in the output variables space Y [22]. In its general form, PLS creates orthogonal score vectors (also called latent vectors or components) by maximizing the covariance between these different sets of variables [9,23]. The influence of each input variable is quantified computing the loading vector for each considered component. Similar to the PCA loadings interpretation, highly correlated variables have similar weights in the loading vector. Different from PCA, PLS shows the influence exerted by the input variables on the selected outputs. Note that, before performing PLS analysis, all the variables have been normalized to avoid issues related to different variable units.

Dataset
The complete set of features considered for each individual are summarized in Table 1, which also shows how PCA considers all the features together, while PLS divides them into input and output features.
The dataset for PCA and PLS was populated considering three different families of industrial fans generated from three parent individuals labelled as Fan A, B, and C, Table 2. The families are labelled in the same way. The parent individuals were selected according to different segments of the fan market, considering the same original size and variations in hub-to-tip ratio, blade numbers, rotational speed, and design duty point. The geometry of each individual is completely defined, so it is possible to derive the chord and twist distributions of the blade, the pitch angle at the hub, and the 2D profile of the blade at different radii. Chord and twist distributions were characterized using the coefficients of a second order interpolating polynomial. Here, the factors that enter the analysis are C1, C2, T1, and T2, respectively, the linear and quadratic terms of the chord distribution, and those of twist distribution. To these we add C0, the chord at the hub, while T0 is neglected because twist at the hub is equal to 0. Admittedly a direct correlation with the work distribution along the blade span could be more accurate, yet data on the design of these fans are generally not available, especially since fans are usually operated in off-design conditions. Starting from the three parent individuals, a population of three families with more than 1300 individuals per family was generated. The approach followed a process of scaling and cutting of different blades, to exploit one design to cover a wide operating envelope. In practice, it is possible to adjust the same design to a different fan size by scaling in similitude. Additionally, to extend the operating range and save in manufacturing process, the same blade can be cut at the tip to fit a higher hub-ratio. The possible variations of all the input parameters are summarized in Table 3. The performance of each fan was calculated using AxLab, an in-house axis-symmetric code [8]. All the information was stored inside a MySQL database and then processed through an in-house Python tool. Figures 3 and 4 show the overall population of individuals, respectively, on the Q−∆P plane and the D tip −χ plane, which will be used for discussion of the results. The Q−∆P plane shows the fan performance, and the D tip −χ plane is related to the size of the device.

Results
Like all the statistical techniques used for data analysis, PCA and PLS can work on very large datasets and provide insights on the correlations between parameters of all the design space. Of course, this means that applying the analysis to all the data, results are likely to distillate general rules, valid on the whole design space. Since the possible application of this approach is to drive an optimization algorithm, it makes more sense to focus on correlations that apply to specific design sub-spaces. These can be identified in different ways, and here we decided to focus on the design point of the fan in terms of flow rate and pressure rise, and to the size of the fan, identified by the tip diameter and the hub-to-tip ratio. Then we went back to the respective charts, shown in Figures 3 and 4, and here divided the design space in different sub-spaces with grids of different sizes and partially overlapping, to derive results linked to that specific subset. For example, here we report on analyses carried out for (i) fans operating at low flow-rate/low pressure-rise, (ii) fans operating at high flow-rate/high pressure-rise, and (iii) 0.3 m < D tip < 0.7 m. On each sub-dataset we applied PCA to identify clusters of individuals with similar features. Then we carried out a PLS analysis on each sub-dataset to derive correlations between input and output features and derive design rules. A final summary of the rules derived from all the sub-dataset is given in the conclusions.

Q−∆P Analysis
In Figure 5, the PCA score plots corresponding to the sub-sets of low flow rate-low pressure rise and high flow rate-high pressure rise are shown. The first plot is characterized by three clusters and as expected these can be directly linked to the three families of fans of the dataset. This is indirect proof of the capability of this approach to find correlations between sparse data. The second score plot in the same figure, with individuals working at high flow rates, shows two clusters of data, one with individuals from the A family, the other populated with fans belonging to the A and C families. In this case of extreme performance, a low number of individuals is present and therefore there is no direct link to the original families. From PLS analysis of individuals working at lower flow rates, Table 4, it follows that the first four latent variables have a strong correlation between input and output scores. This means that it is possible to find correlations between the loading coefficients of the original values in all the four latent variables, Figure 6. From the plot coefficient of LV 1 we can see that there is a direct correlation between midspan solidity, chord at the hub (C0), and peak efficiency. As for chord and twist distributions, the increase of quadratic terms C2 and T2 has a direct correlation with the same peak efficiency, while there is an inverse proportionality between the linear terms C1 and T1. All the other loadings are lower than the threshold value (dashed red line) and therefore must be interpreted as not significant. Here, threshold values are calculated according to the correlation coefficient that refers to the observed latent component according to: where LV i is the i-th latent component, w i and c i are the loadings of input and output features, and t i and u i are the scores of input and output features. CM is the correlation matrix between scores of input and output features. If we look at the coefficients for LV 2 we can see that increasing D tip leads to an increase of ∆P pe , η pe , and Q zs : The whole stable range of operations shifts to higher flow rates and pressures. Loadings in LV 3 show that an increase in χ and ω leads to increases of ∆P pp and ∆P pe and also in η pe .
Basically, we can see that from the loadings of LV 1 , an increase in efficiency at peak efficiency can be achieved by changing the chord and twist distribution of the blade increasing C0, C2, and T2, while decreasing C1 and T1. Keeping C0 fixed to simplify, a fan increases in efficiency as the rotor twist and chord distributions are changed according to Figure 7.  Computing the scores from the sub-dataset of fans working at high flow rates-high pressure ratio, Table 5, all these four latent variable loadings appear to be influential. The loading plots in Figure 8, however, suggest that the only relevant relationships between inputs X and output Y are explained by the third and fourth latent variables, as the loadings of outputs in the first two are below the threshold. In LV 3 , however, we can see that an increase in efficiencies at peak pressure and peak efficiency is achieved, increasing the rotational speed of the fan or decreasing the number of blades or the midspan solidity. The fact that in this case no clear indication is given about chord and twist distributions is probably to be related to the lower number of samples in this sub-dataset.

D tip −χ Analysis
The same data processing can be applied selecting the fans according to their size, and here we focus on those fans that have D tip between 0.3 m and 0.7 m. The PCA score plot, shown in Figure 9, highlights the presence of three clusters. One of them is clearly composed of fans belonging to the C family, while the other two include fans that belong to both A and B families. This kind of clustering seems to be related to the original rotating velocity of the fans, that for family C is half of that of families A and B. The low number of individuals in this dataset is reflected by the low level of correlations found in the PLS analysis, Table 6. In this case the loading plots, Figure 10, shows that for LV 1 there is an inverse correlation between peak efficiency and rotational speed, a direct proportionality between peak efficiency and midspan solidity, and, in general, a strong correlation between peak efficiency and chord and twist distributions.  The second latent variable loadings show a direct proportionality of rotational velocity with efficiency at peak pressure and peak efficiency operations. The third and fourth latent variable loadings are not significant, as they are below threshold values.

Conclusions
Big Data Methods (PCA and PLS) were applied to a dataset of~4000 individuals belonging to three fan families. The analysis was carried out on a series of sub-datasets corresponding to different ranges of fan performance and different fan sizes, aiming at discovering hidden correlations among design parameters and fan performance. Correlations already present in literature were found, as pressure increased alongside increases of blade number, confirming the validity of the method. Other findings emerged from a deeper analysis of PLS loadings.
As it was not possible to show all the relationships for all the sub-datasets analyzed, we summarize our findings in Figures 11-14. In Figure 11 we summarized the relationship between midspan solidity and efficiency at peak pressure and peak efficiency in different regions of the Q−∆P plane. In particular, the region in green, corresponding to low flow-rate/low pressure rise is characterized by a direct proportionality between σ and η pp and η pe . For individuals in the orange region the proportionality is limited to η pp . Finally, in the blue region an inverse proportionality between σ and η pe is found. In the same way, Figure 12 shows the relationships between blade number and different fan performance, for individuals belonging to different regions, while other relationships between design parameters and fan performance are presented in Figure 13. The same analysis, carried out on the size chart D tip −χ led to results summarized in Figure 14.
Finally, the possible biases in the dataset must be highlighted: In fact, all the individuals originated from a process of scaling and cutting applied to three parent individuals. The fact that PCA highlights the presence of three clusters strongly related to the parent individuals, is indirect proof that the method works, but also that some relationships could be related to this generating mechanism. Furthermore, some sub-datasets used have a low number of individuals and it is possible that the low correlations that emerged from PLS were derived from an insufficient number of samples.