Development and Optimization of VGF-GaAs Crystal Growth Process Using Data Mining and Machine Learning Techniques

: The aim of this study was to assess the ability of the various data mining and supervised machine learning techniques: correlation analysis, k-means clustering, principal component analysis and decision trees (regression and classiﬁcation), to derive, optimize and understand the factors inﬂuencing VGF-GaAs growth. Training data were generated by Computational Fluid Dynamics (CFD) simulations and consisted of 130 datasets with 6 inputs (growth rate and power of 5 heaters) and 5 outputs (interface position and deﬂection, and temperatures at various positions in GaAs). Data mining results conﬁrmed a good dispersion of the training data without the feasibility of a dimensionality reduction. Data clustering was observed in relation to the position of the crystallization front relative to the side heaters. Based on the statistical performance criteria and training results, decision trees identiﬁed the most decisive inputs and their ranges for a favorable interface shape and to keep GaAs temperature beyond limits for heavy arsenic evaporation. Decision trees are a recommendable machine learning technique with short training times and acceptable predictive accuracy based on small volume of CFD training data, capable of providing guidelines for understanding the crystal growth process, which is a prerequisite for the growth of low-cost, high-quality bulk crystals.


Introduction
The development and optimization of bulk crystal growth processes is a demanding task due to the multidisciplinarity of the phenomena associated with a phase change, numerous process parameters, challenging scale up and, in particular, the dynamic nature of the process with a considerable time delay [1].
Conventional experimental and CFD approaches to derive crystal growth process recipes are laborious, costly and time consuming. The common optimization approach described in [2] for process development based on the strategy of "inverse modeling" is limited to a small number of specific independent parameters (power of heaters) and a single optimization parameter interface deflection.
The recent tremendous success of artificial neural networks (ANN) [3] in detecting the complex patterns and relationships in non-linear static and dynamic data sets in related fields (e.g., [4]) has triggered feasibility studies on the application of ANN for the prediction of transport phenomena in crystal growth furnaces of semiconductors and optimization of growth parameters, inter alia [5][6][7][8][9][10][11][12][13][14][15][16][17][18]. In this case, the number and specification of the independent and optimization parameters are constrained only by the availability of the training data and not by the method. The process parameters could be numerous and various. Beyond ANN, there are also other various machine learning algorithms available. The guiding principles for the choice of the employed algorithm depend primarily on the quality of input data (e.g., data volume, noise, number of outliers, missing data etc.), type results, the temperatures in the key monitoring points (MPs) in the melt and crystal (points marked 1 to 5 in a Figure 1b) and their corresponding coordinates were extracted for the ML training database. The interface deflection z was measured at the melt symmetry axis with respect to the three-phase junction (melt/crystal/crucible) and varied between detrimental concave (z > 0) and favorable slightly convex (z < 0) [22]. The origin of the coordinate system corresponded to the MP5. The GaAs crystals were grown in a positive z direction.  Due to the rotational symmetry, the furnace for the growth of 4 inches GaAs crystals was described by a 2D axisymmetric model. Governing equations for CFD modeling included equations of continuity, Navier Stokes with the Boussinesq approximation and energy equation (Equations (1)-(3)).

Data Mining
Weak buoyancy-driven melt convection [22] was described by the laminar flow model. All CFD simulations were performed using quasi steady-state (QSS) assumption. The QSS is justified if the characteristic time of the growth process is much larger than the characteristic time for heat transport, as is the case in VGF growth with weak buoyancy convection.
In the QSS approach, the latent heat released during crystal growth is included as an additional heat source at the crystallization front, and the growth rate is considered as a fixed input parameter. Along the solid/liquid interface, Stefan and isotherm conditions have to be satisfied (Equations (4) and (5)): The GaAs material properties used in this study are given in [22]. The crystals were grown in Ar atmosphere under the pressure of 4 bar. The CFD simulations were performed using commercial code Crysmas.
For the generation of datasets for ML simulations, 130 combinations of heating power in 5 heaters in the range from 0-4 kW and growth rates in the range 0.1-5.4 mm/h were used as input parameters for the CFD simulations. From the obtained axisymmetric CFD results, the temperatures in the key monitoring points (MPs) in the melt and crystal (points marked 1 to 5 in a Figure 1b) and their corresponding coordinates were extracted for the ML training database. The interface deflection ∆z was measured at the melt symmetry axis with respect to the three-phase junction (melt/crystal/crucible) and varied between detrimental concave (∆z > 0) and favorable slightly convex (∆z < 0) [22]. The origin of the Crystals 2021, 11, 1218 4 of 20 coordinate system corresponded to the MP5. The GaAs crystals were grown in a positive z direction.

Data Mining
The DM techniques [25]: correlation analysis, k-means clustering and principal component analysis were used in this study before ML with the aim of assessing the quality of the training data and examining their relationships.
Correlation analysis is used to study the degree to which the variables are associated with each other, measured by the value of some kind of correlation coefficient, e.g., Pearson's coefficient of correlation r given in Equation (6).
Two variables can be positively correlated (r > 0), when they are changing in the same direction (either increasing or decreasing simultaneously), negatively correlated (r < 0), when they are changing in the opposite direction, or can have a zero correlation (r = 0), when there is no relationship between the variables. In this study, a correlation plot with correlation coefficients for all input and output pairs of variables was prepared in order to examine their relationships.
Principal component analysis (PCA) is a data mining technique typically used to structure, simplify and illustrate large data sets by approximating a large number of statistical variables with a smaller number of linear combinations (the main components) that retain as large part of the overall variance as possible. A resulting PCA biplot can be used to visualize data dispersion, variables correlation and feasibility of dimensionality reduction by means of the PCA from the point of view of the retained variance. If the projections of two original variables into the 2-dim subspace of the first and second eigenvector form a small angle and at the same time have a very high proportion of the variance, the two variables are positively correlated. If the eigenvectors are orthogonal (close to 90 • ), they are not likely to be correlated. When they form a large angle (close to 180 • ), they are negatively correlated. If vectors have lengths completely different or the vectors are concentrated in the same quadrant of biplot, the distribution of data points has to be rethought.
K-means clustering is another data mining technique that aims to find groups in the data (clusters) based on feature similarity with the number of groups represented by the variable K. The goal is to divide vector (x 1 , x 2 , . . . , x n ), into k (≤n) sets S = {S 1 , S 2 , . . . , S k } so as to maximize the sum of squared deviations between points and the cluster center in the same cluster (Equation (7)):

Machine Learning
In this study, we selected the following DM/ML techniques: regression trees (RT) and classification trees (CT) [26] for the accurate prediction of VGF-GaAs growth recipes and easy understanding of the role of various process parameters. The RT is a type of DTs where the target variables are real numbers. The CT is other kind of DT where the target variables are labels of classes. Consequently, the RT is a type of supervised learning algorithm that can be used in regression problems, while the CT can be used in classification problems.
For both kinds of DT techniques, data comes in records of the form: (x, y) = (x1, x2, . . . , xk,y). The dependent variable y is the target variable that one is trying to model and predict. The vector x is composed of the inputs, x1, x2, . . . , xk.
All DT algorithms have a tree structure (see Figure 2). Each node in the tree represents a variable from the input datasets, each branch a decision and each leaf at the end of a branch the corresponding output value. The decision process starts from the top root node (the entire dataset) downwards through the tree until a bottom leaf is reached, which contains the result (more homogeneous subsets of the initial dataset). At each node, the process of splitting and the path to be followed depends on the values y i of the output variable for the inputs x i in that node. More precisely, building a DT consists of recursive binary splitting of the training data, stopping only when each terminal node has fewer than some minimum number of datasets. In this study, that minimum leaf size was set to 4 and the maximum tree depth was set to 10. For both kinds of DT techniques, data comes in records of the form: (x y). The dependent variable y is the target variable that one is trying to mod The vector x is composed of the inputs, x1,x2…xk.
All DT algorithms have a tree structure (see Figure 2). Each no represents a variable from the input datasets, each branch a decision and end of a branch the corresponding output value. The decision process star root node (the entire dataset) downwards through the tree until a bottom which contains the result (more homogeneous subsets of the initial dataset the process of splitting and the path to be followed depends on the values variable for the inputs xi in that node. More precisely, building a DT consi binary splitting of the training data, stopping only when each terminal n than some minimum number of datasets. In this study, that minimum leaf 4 and the maximum tree depth was set to 10. For a regression tree, the splits are made in such a way that the sum of (SSE) with respect to the average values of the yi in the above subsets is min 8).
( * , * ) = min{ ( , ); ( , ) is some split of } First, this method is applied to the entire set of available input da resulting sets (S1 * ,S2 * ), etc. The splitting continues as long as necessary, form For a regression tree, the splits are made in such a way that the sum of squared errors (SSE) with respect to the average values of the y i in the above subsets is minimal (Equation (8)).
First, this method is applied to the entire set of available input data, then to the resulting sets (S 1 * ,S 2 * ), etc. The splitting continues as long as necessary, forming a hierarchy of regions (tree) in the input space. The best tree is derived using cross-validation.
In RT analysis of process parameters affecting the crystal growth, the following six inputs were studied: x1 crystal growth rate, x2 and x3 heating power in the inner-top and outer-top heater, x4 and x5 heating power in the upper-side and lower-side heater and x6 heating power in the bottom heater.
The input variables were correlated with the two key outputs: y2-interface deflection and y3-temperature at GaAs free surface at the crucible rim (corresponding to MP3 and MP1 in Figure 1b, respectively).
In the CT analysis, the input variables were correlated with the output y2 to assess conditions for the growth of convex s/l interface, i.e., interface deflection y2 > 0. The same analysis can be performed for all other output variables.
The training data were identical to those used in DM analysis and consisted of the 130 (x,y) pairs. For assessing the effect of various inputs on specific output, regression and classification tree analysis and the comparison of mean output values were applied.
For both kinds of DT simulations, the commercial software Matlab ® was used.

CFD Modeling
From the literature (e.g., [22] and beyond), it is known that typical temperature gradients in the GaAs melt and crystal in the industrial VGF processes are in the range of 2-10 K/cm and up to 15 K/cm, respectively. The crystal growth rate is typically in the range of 2-4 mm/h. The maximal temperature in the GaAs should not exceed 15 K above the melting temperature of GaAs (ca. T = 1528 K) in order to avoid a great loss of arsenic [2]. Guided by these facts, 130 process recipes were simulated and values of interface deflections, interface position and temperatures at the monitoring points at the GaAs seed bottom, end of cone and at the melt-free surface ( Figure 1b) were collected. All data in the form of 11-dimensional datasets are shown in parallel coordinates in Figure 3. Each line in the plot corresponds to one data set (x 1 . . . x 6 , y 1 . . . y 5 ). The generated database was used for DM/ML training and analysis.

CFD Modeling
From the literature (e.g., [22] and beyond), it is known that typical temperature gradients in the GaAs melt and crystal in the industrial VGF processes are in the range of 2-10 K/cm and up to 15 K/cm, respectively. The crystal growth rate is typically in the range of 2-4 mm/h. The maximal temperature in the GaAs should not exceed 15 K above the melting temperature of GaAs (ca. T = 1528 K) in order to avoid a great loss of arsenic [2]. Guided by these facts, 130 process recipes were simulated and values of interface deflections, interface position and temperatures at the monitoring points at the GaAs seed bottom, end of cone and at the melt-free surface ( Figure 1b) were collected. All data in the form of 11-dimensional datasets are shown in parallel coordinates in Figure 3. Each line in the plot corresponds to one data set (x1 …x6,y1…y5). The generated database was used for DM/ML training and analysis.
Examples of axisymmetric quasi steady-state CFD simulation results for buoyancydriven flows in the form of temperature and stream function distributions in the VGFfurnace are shown in Figure 4. Due to the cylindrical geometry of the crucible, the melt flow was always toroidal, varying between multi-vortices (Figure 4b   As expected, our results confirmed the fact that generally favorable flat and/or a slightly convex s/l interface shape (y2 ≤ 0) was easier to obtain by lower growth rates x1 ( Figure 3, blue lines). On the contrary, with an increase of the crystal growth rate (Figure 3, red lines), more latent heat is generated at the s/l interface and interface deflection turns toward concavity (y2 > 0).
Obviously, the search for the optimal VGF process parameters is a difficult task that requires advanced statistical methods.

Data Mining
As mentioned before, PCA-biplot was used to visualize training data dispersion and variables correlation and to show the feasibility of dimensionality reduction without loss of information. The findings are given in Figure 5 and summarized below.
requires advanced statistical methods.

Data Mining
As mentioned before, PCA-biplot was used to visualize training data dispersion and variables correlation and to show the feasibility of dimensionality reduction without loss of information. The findings are given in Figure 5 and summarized below.
Since the two main components' combined contribution to the variance is only 60.6%, i.e., not high enough, further discussion about the projections of original variables into the 2-dim subspace and corresponding correlations was not justified. However, the results showed that all vectors were present in all quadrants of the PCA-biplot and were of similar lengths, and therefore the distribution of points can be considered appropriate for further ML/DM analysis. The k-means clustering method was applied to all 11 variables in this study. The k value varied from k = 2 to k = 10. The best clustering was observed for k = 2. Selected results for input x1 and output y1 are shown in Figure 6. These point out that, for all growth rates x1, there are 2 clusters of data with respect to the s/l interface position y1, i.e., data corresponding to the interface position y1 in the zone of influence of the upper side heater (z > 0.423 m) behave differently compared to the data in the zone of influence of the lower side heater (z < 0.423 m). This observation was used in further correlation analysis. Since the two main components' combined contribution to the variance is only 60.6%, i.e., not high enough, further discussion about the projections of original variables into the 2-dim subspace and corresponding correlations was not justified. However, the results showed that all vectors were present in all quadrants of the PCA-biplot and were of similar lengths, and therefore the distribution of points can be considered appropriate for further ML/DM analysis.
The k-means clustering method was applied to all 11 variables in this study. The k value varied from k = 2 to k = 10. The best clustering was observed for k = 2. Selected results for input x1 and output y1 are shown in Figure 6. These point out that, for all growth rates x1, there are 2 clusters of data with respect to the s/l interface position y1, i.e., data corresponding to the interface position y1 in the zone of influence of the upper side heater (z > 0.423 m) behave differently compared to the data in the zone of influence of the lower side heater (z < 0.423 m). This observation was used in further correlation analysis. Results for the correlation plots for all data (whole crystallization process) for "lower" cluster 1 and "upper" cluster 2 are shown in Figures 7-9, respectively.
From the analysis of the correlation coefficients for all data (Figure 7), it is possible to derive how inputs and outputs are correlated. The most pronounced correlation among inputs was observed by x4 and x5, i.e., by the power of the side heaters. Their correlation coefficient showed that they were weakly up to medium-strongly negatively correlated, Figure 6. Results for k-means clustering (k = 2) for input x1 and output y1.
Results for the correlation plots for all data (whole crystallization process) for "lower" cluster 1 and "upper" cluster 2 are shown in Figures 7-9, respectively.
From the analysis of the correlation coefficients for all data (Figure 7), it is possible to derive how inputs and outputs are correlated. The most pronounced correlation among inputs was observed by x4 and x5, i.e., by the power of the side heaters. Their correlation coefficient showed that they were weakly up to medium-strongly negatively correlated, with maximal value for r x4,x5 = −0.611. This result is in agreement with the nature of the VGF process, where the position of the crystallization front corresponds to a certain amount of the heating power and power distribution, and the growth rate determines the interface shape. Consequently, more power in the upper side heater implies less power in the lower side heater and vice versa. Concerning interface deflection y2, it is the most negative correlated by the increase of the power of upper side heater x4 (r x4,y2 = −0.556) and the most positive correlated by the increase of the power of top inner heater x2 (r x2,y2 = 0.411). Please note that in this study, convex interface deflection has a negative value (y2 < 0) and that a negative correlation is beneficial for the crystal quality.  From the analysis of the correlation coefficients for the "upper" data cluster 2 ( Figure  8), interface deflection y2 was the strongest negative correlated by the power of upper side heater x4 (rx4,y2 = −0.772) and the strongest positive correlated by the power of lower side heater x5 (rx5,y2 = 0.743). The first result was identical to the result for the whole VGF process, i.e., for all data. The second result differed. It can be understood by remembering that the "upper" data cluster is related to the second half of the crystallization process with the crystallization front positioned above the lower side heater. In this case, x5 was bringing heat to the GaAs crystal that was harmful for y2.
For the temperature at the melt-free surface y3, the most pronounced negative and positive correlation were achieved by x4 (rx4,y3 = −0.255) and x6 (rx6,y3 = 0.498), respectively. These findings are similar to the case when data for the whole process are considered. For the GaAs temperature at the melt-free surface y3, the most pronounced negative correlation (weak with beneficial influence) had inputs: the power of upper side heater x4 and growth rate x1, since they decreased y3 value and therewith they limited severe As evaporation (r x4,y3 = −0.256, r x1,y3 = −0.232). The most pronounced positive correlation, but a detrimental influence on y3, had the power of the bottom heater x6 and the lower side heater x5 (r x6,y3 = 0.362, r x5,y3 = 0.165). These results can be explained if one recalls the typical heat transfer in the VGF growth, where heat is entering GaAs via melt and leaving via crystal periphery, in addition to the heat generated at the crystallization front. More heat coming from the crystallization front x1 and the upper side heater x4 means less heat coming from the top heaters x2 and x3 that are closer to the melt-free surface. More heat coming into the system from the x5 and x6 retards heat transfer out of the GaAs and consequently triggers the rise of the temperature y3.
From the analysis of the correlation coefficients for the "upper" data cluster 2 (Figure 8), interface deflection y2 was the strongest negative correlated by the power of upper side heater x4 (r x4,y2 = −0.772) and the strongest positive correlated by the power of lower side heater x5 (r x5,y2 = 0.743). The first result was identical to the result for the whole VGF process, i.e., for all data. The second result differed. It can be understood by remembering that the "upper" data cluster is related to the second half of the crystallization process with the crystallization front positioned above the lower side heater. In this case, x5 was bringing heat to the GaAs crystal that was harmful for y2.
For the temperature at the melt-free surface y3, the most pronounced negative and positive correlation were achieved by x4 (r x4,y3 = −0.255) and x6 (r x6,y3 = 0.498), respectively. These findings are similar to the case when data for the whole process are considered.  From the analysis of the correlation coefficients for the "lower" data cluster 1 ( Figure  9), interface deflection y2 was influenced detrimentally in a similar strength by the x4, x1 and x2 (rx4,y2 = 0.431, rx1,y2 = 0.421, rx2,y2 = 0.411) and the most beneficially influenced by the power of lower side heater x5 (rx5,y2 = −0.777). The results are different in comparison to the result for the whole VGF process, i.e., for all data. This can be explained by the fact that the "lower" data cluster was related to the first half of the crystallization process, with the crystallization front positioned sidewise on the lower side heater and far away down from the top and upper side heaters.
For the temperature at the melt-free surface y3, the most beneficial and detrimental influence had x5 (rx5,y3 = −0.470) and x2 (rx2,y3 = 0.532), respectively. The greater heat was coming from the GaAs side periphery (x5), and less heat was coming from the top (x2) so that, consequently, y3 decreases.
Interestingly, the analysis of all correlation plots pointed out that the power of side heaters had a much stronger influence on the interface shape and maximal GaAs temperature than solely the crystal growth rate.

Decision Trees
As mentioned before, a successful VGF process is characterized inter alia by a flat crystallization front during the growth and constrained maximal temperatures in the melt to prevent strong arsenic evaporation/loss. The purpose of the DT analysis was to better understand the role of various process parameters and to identify their suitable values for the growth of high-quality crystals. From the analysis of the correlation coefficients for the "lower" data cluster 1 (Figure 9), interface deflection y2 was influenced detrimentally in a similar strength by the x4, x1 and x2 (r x4,y2 = 0.431, r x1,y2 = 0.421, r x2,y2 = 0.411) and the most beneficially influenced by the power of lower side heater x5 (r x5,y2 = −0.777). The results are different in comparison to the result for the whole VGF process, i.e., for all data. This can be explained by the fact that the "lower" data cluster was related to the first half of the crystallization process, with the crystallization front positioned sidewise on the lower side heater and far away down from the top and upper side heaters.
For the temperature at the melt-free surface y3, the most beneficial and detrimental influence had x5 (r x5,y3 = −0.470) and x2 (r x2,y3 = 0.532), respectively. The greater heat was coming from the GaAs side periphery (x5), and less heat was coming from the top (x2) so that, consequently, y3 decreases.
Interestingly, the analysis of all correlation plots pointed out that the power of side heaters had a much stronger influence on the interface shape and maximal GaAs temperature than solely the crystal growth rate.

Decision Trees
As mentioned before, a successful VGF process is characterized inter alia by a flat crystallization front during the growth and constrained maximal temperatures in the melt to prevent strong arsenic evaporation/loss. The purpose of the DT analysis was to better understand the role of various process parameters and to identify their suitable values for the growth of high-quality crystals.
The most important DT results for both regression (RT) and classification trees (CT) are given in Figures 10-12 with errors and summarized results in Tables 1-5. The errors are  given in Tables 1 and 4, in the form of the root mean square error RMSE corresponding to the nodes in the regression tree. The root node in the regression tree stands for the average value of the studied output among all data in the database. The path from the root to leaf states the influence of a certain input on the studied output, with the highest relevance at the top, decreasing downwards. Table 1. Root mean square error RMSE corresponding to the nodes in regression tree for interface deflection y2 [m] in Figure 10.

Node
Mean y2 RMSE The resulting RT for interface deflection y2 is shown in Figure 10. It reveals x2, followed by x4 and x1, as the most decisive inputs for the favorable flat or slightly convex interface shape (interface deflection y1 ≤ 0, the branches marked red in the tree graph). Their significance decreases in the order mentioned above. The most decisive input is x2 (the heating power of the inner top heater) that has a deteriorating effect on the interface flattening, i.e., x2 should be below 678 W to strongly push y2 towards lower values (less concavity), i.e., from average y2 = 0.00272 m to average y2 = 0.00188 m. All decisive inputs and ranges of their optimal values that assure the VGF-GaAs growth with flat or slightly convex s/l interface (y2 ≤ 0 m), derived from RT analysis, are given in Table 2. For the fast growth of GaAs crystals (growth rate > 3 mm/h) with a flat or slightly convex interface, the process heat should be provided predominantly from the upper side heater (x4), while the bottom and lower side heater should be turned off. Moreover, the inner top heater should provide only a very limited amount of heat to the system. Obtained RT results are consistent with the findings from our correlation analysis, remembering the fact that the most influential input in RT doesn't mean that its influence is necessarily beneficial for the variable y2. In the literature and among the crystal growers, there is a common opinion that the growth rate x1 has the strongest influence on the interface deflection (i.e., the higher the growth rate, the more generated latent heat at the crystallization front and consequently more concave s/l interface shape). Here obtained RT results do not refute the strong influence of x1. They mean only that other inputs outperformed the importance of x1 for y2. in the set of data remaining in the node after the last splitting. Red frames correspond to branches where leaf nodes have a mean deflection y2  0m, i.e., a convex shape of s/l interface. Table 2. The most decisive inputs and their optimal values for the VGF-GaAs growth with flat or slightly convex s/l interface (y2  0 m), derived from RT analysis. The data ranges correspond to the red marked branches in Figure 10. Figure 10. Regression tree analyzing the dependence of the solid-liquid interface deflection y2 on the power of heaters, i.e., inputs x1-x6. The values in yellow marked interior nodes and leaf nodes correspond to the mean value of output y2 in the set of data remaining in the node after the last splitting. Red frames correspond to branches where leaf nodes have a mean deflection y2 ≤ 0 m, i.e., a convex shape of s/l interface. Table 2. The most decisive inputs and their optimal values for the VGF-GaAs growth with flat or slightly convex s/l interface (y2 ≤ 0 m), derived from RT analysis. The data ranges correspond to the red marked branches in Figure 10. The same optimization task for y2 was solved using a classification tree. The target variable y2 consists further of real numbers and the classification task was performed associating these numbers with labels "+" for concave interface (y2 > 0) and "-" for convex interface (y2 < 0). The CT results for interface deflection y2 are given in Figure 11, and the corresponding most influential inputs and their ranges for optimized growth are given in Table 3. Except for x3, all other inputs played some role. The most influential input for obtaining the convex interface was x4, followed by x1 and x5. As with the RT, the results showed that during the rapid growth of crystals with a slightly convex interface, the process heat should mainly be provided by the upper side heater (x4), while the lower side heater should be almost switched off. obtaining the convex interface was x4, followed by x1 and x5. As with the RT, the results showed that during the rapid growth of crystals with a slightly convex interface, the process heat should mainly be provided by the upper side heater (x4), while the lower side heater should be almost switched off. Figure 11. Classification tree analyzing the dependence of the solid-liquid interface deflection y2 on the power of heaters, i.e., inputs x1-x6. The values in leaf nodes correspond to the sign of the mean value of output y2 in the remaining set of data after the last splitting. Red frames correspond to brunch where leaf nodes have mean deflection y2 < 0, i.e., a convex shape of s/l interface. Table 3. The most decisive inputs and their optimal values for the VGF-GaAs growth with convex s/l interface (y2 < 0), derived from CT analysis. The data ranges correspond to the red marked branches in Figure 11.

Mean
Decisive Inputs y2 x1 x2 x4 x5 x6 Figure 11. Classification tree analyzing the dependence of the solid-liquid interface deflection y2 on the power of heaters, i.e., inputs x1-x6. The values in leaf nodes correspond to the sign of the mean value of output y2 in the remaining set of data after the last splitting. Red frames correspond to brunch where leaf nodes have mean deflection y2 < 0, i.e., a convex shape of s/l interface. Table 3. The most decisive inputs and their optimal values for the VGF-GaAs growth with convex s/l interface (y2 < 0), derived from CT analysis. The data ranges correspond to the red marked branches in Figure 11. The resulting RT for the temperature at the melt top rim y3 is showed in Figure 12. The most decisive inputs and the range of their optimal values that prevent great loss of arsenic are given in Table 5. The data ranges correspond to the red marked branches in Figure 12. x6 played the role. The most influential input is x6, followed by x4. Other inputs are less important and appear at the lower position in the tree. The increase of the initial average value of y3 from 1520 K1530 K after the first split, by the increase of the power of the bottom heater x6, showed their positive correlation, but detrimental influence. On the contrary, higher values of the x4 (x4 > 3070 W) after the second split caused the decrease of the average y3 values from 1520K1510 K and confirmed their negative correlation (with beneficial influence) observed by the correlation plots. As with the RT for y2, the results of RT analysis for y3 showed again that during the optimized rapid growth of crystals (x1 > 3.28 mm/h) without great loss of arsenic, the process heat should mainly be provided by the upper side heater (x4), while the lower side and bottom heaters (x5 and x6) should be almost switched off (Table 5).

Figure 12.
Regression trees analyzing the dependence of the temperature y3 at the rim of the melt-free surface on the power of heaters and the growth rate, i.e., inputs x1-x6. Values in yellow marked interior nodes and leaf nodes correspond to the mean value of output y3 in the remaining set of data after the last splitting. Red frames correspond to brunches where leaf nodes have mean T  1520K. Table 5. The most decisive inputs and their optimal values for the VGF-GaAs growth that prevent significant loss of arsenic (y3 < 1528K), derived from RT analysis. The data ranges correspond to the red marked branches in Figure 12.   Table 5. The most decisive inputs and their optimal values for the VGF-GaAs growth that prevent significant loss of arsenic (y3 < 1528 K), derived from RT analysis. The data ranges correspond to the red marked branches in Figure 12. In summary, our RT and CT analysis revealed the key process parameters, their importance and the ranges of their values for achieving beneficial conditions for the VGF-GaAs growth.

Mean
To compare the DM and ML techniques used here, it is important to note that the DM techniques measured the relationship between one pair of variables among the inputs and outputs (x 1 . . . x 6 , y 1 . . . y 5 ), while ML/DT measured the relationships between all inputs and one output (x 1 , . . . x 6 , y i ) and suggested the range of their optimal values. For the simultaneous optimization of all outputs in relation to all inputs, artificial neural networks are the best choice. The latter, however, is associated with a loss of interpretability, much higher computing times and a vast amount of training data, which is often difficult to come by.

Conclusions
This study demonstrated the capabilities of data mining and machine learning in smart derivation of the crystal growth recipes on the example of VGF-GaAs growth.
The data mining and machine learning techniques used were: correlation analysis, k-means clustering, principal component analysis and decision trees extract patterns and correlations among the raw process data. The decision trees, both regression and classification type, are an excellent choice if we need a machine learning model with short training times based on a low volume of CFD training data able to provide humancomprehensible results. The decision trees also provide ranges of process parameters (e.g., power of heaters and growth rate) where nearly-optimal values of the output variables (e.g., interface deflection or maximal temperature in GaAs) can be found.
The results of the correlation analysis, the k-means clustering and the principal component analysis showed the good scatter of the training data and the existing correlation between all variables, without the possibility of dimensionality reduction.
The regression trees predicted: (i) the most influential input to keep the GaAs temperature below the limits for high arsenic evaporation is the heating power in the bottom heater x6, followed by the heating power in the upper side heater x4; (ii) the most influential input for shaping the solid-liquid interface is the heating power of the inner top heater x2, followed by the heating power of the upper side heater x4 and the crystal growth rate x1. The classification trees predicted that the most influential input for obtaining the convex interface would be x4 followed by x1 and x5.
The proposed combination of two modern data-driven techniques and standard CFD modeling can be easily further deployed in the fast development of the novel crystal growth processes/grown materials, as well as their scale up.