Application of Artiﬁcial Neural Networks in Crystal Growth of Electronic and Opto-Electronic Materials

: In this review, we summarize the results concerning the application of artiﬁcial neural networks (ANNs) in the crystal growth of electronic and opto-electronic materials. The main reason for using ANNs is to detect the patterns and relationships in non-linear static and dynamic data sets which are common in crystal growth processes, all in a real time. The fast forecasting is particularly important for the process control, since common numerical simulations are slow and in situ measurements of key process parameters are not feasible. This important machine learning approach thus makes it possible to determine optimized parameters for high-quality up-scaled crystals in real time.


Introduction
The crystal growth has a multi-disciplinary nature where heat, momentum and mass transport phenomena, chemical reactions (e.g., crystal and melt contamination) and electro-magnetic processes (e.g., induction and resistance heating, magnetic stirring, magnetic breaks, etc.) play a crucial role. Phase transformation, scaling problem (solid/liquid interface control on nm scale in growth system of ∼m size), numerous parameters (10 or more [1]) that have to be optimized, many constrains among them, and especially the dynamic character of the crystal grow process, make its development a difficult task.
The primary objective of this paper is to provide a comprehensive overview about the potential of artificial intelligence (AI) in crystal growth by addressing pros and cons of the AI technology for the enhancement of the growth of affordable high quality bulk crystals with higher aspect ratios. Particular focus will be laid on the crystal growth of semiconductors, oxides and fluorides using Czochralski (Cz), vertical gradient freeze (VGF), directional solidification (DS) and top seeded solution growth (TSSG) methods.
The content of this paper is presented as follows: a general overview about the challenges in crystal growth and potentials of AI is given. In this context, increased emphasis will be placed ANNs as a large class of machine learning algorithms that attempt to simulate important parts of the functionality of biological neural networks. Machine learning is a subarea of AI that attempts to imitate with computer algorithms the way in which humans learn from previous experience.
This general overview is followed by introducing the reader to the basics of ANN modeling and other relevant statistical methods. The next section gives examples of already successful applications of ANN in the crystal growth. Finally, the main points and outlook of this highly industrially important technique are summarized.
for the optimization of process parameters and automation of manufacturing. Despite tremendous success in many fields of science and industry, including solid state material science and chemistry [5,6], wider applications in the crystal growth are still missing. The main reason lays in the fact that the ultimate success of AI is usually linked with so-called 4V challenges: data volume, variety, veracity and velocity. In the experimental crystal growth, large datasets are seldom available, the range of useful process parameters is rather narrow and the data trustworthiness is an issue. The data veracity is a challenge, since in situ in operando measurements of important process parameters are constrained by the aggressive environment and high purity requirements. In the industry, the apparent volume of data is high; however, due to the ageing of the equipment and often small changes in the growth recipe and/or hot zone parts, the data veracity is questionable.
Recently, many different approaches were proposed in the literature how to tackle the 4V constrains, e.g., to use CFD simulations to generate large and diverse datasets in combination with available experimental data for validation. On the other hand, the volume of the needed training data can be reduced by using advanced machine learning methods known as active learning and transfer learning [7]. Various examples of the successful ANN applications will be presented in Section 4.

Artificial Neural Networks Overview
Machine learning is an area in computer science aiming to optimize the performance of a certain task throughout learning from examples and/or past experience. Neural networks are by far the most widespread technique of machine learning. There are many kinds of neural networks, differing most apparently through the architecture connecting their functional units−Neurons (Figure 1), each with their unique strengths that determine their applications. The most important neural network types for materials science/crystal growth will be a topic of this chapter.
The Artificial Neural Network (ANN) is a statistical method inspired by biological processes in the human brain able to detect the patterns and relationships in data. ANNs are particularly powerful in correlating very high number of variables and for highly non-linear dependences [8].
An ANN is characterized primarily by its architecture, i.e., a set of artificial neurons and connection patterns between them. The neurons are often organized into layers: an input layer, hidden (intermediate) layers and an output layer ( Figure 2). Each neuron acts as a computational unit. It receives inputs x i , and multiplies them by weights w i (a synaptic operation) and then uses the sum of such weighted inputs as the argument for a nonlinear function (somatic operation), which yields the final output of the neuron y j (known as neuron activation). The whole ANN receives inputs by neurons in the first layer, and provides output by the neurons in the last layer.
The most common activation function has been taken over from the logistic regression, well known in statistics, which is why it is called logistic sigmoidal function f(x,w) (1): By adjusting the weights w j,i of connections and biases b i of artificial neurons (process known as ANN training), one can obtain the targeted output y j for a specific combination of inputs x i . The final goal of ANN training is to adjust the weights and biases to minimize some kind of error E measured on the network.
For crystal growth, the most relevant kind of error is the sum of squared differences between the outputs y j of the network and the desired output o j , summed over all the neurons in the output layer (2): The weights can be, in the simplest case, adjusted using the method of gradient descendent [9] with constant learning rate η (3): Crystals 2020, 10, x FOR PEER REVIEW 4 of 17 Prior to ANN training, it is necessary to select its architecture, the activation function and training method.
A suitability of different ANN architectures is most reliably compared by the k-fold cross validation method [11]. First, the training set is partitioned into k subsets. For each architecture, the training is performed k times, each time using one of the subsets as the validation set and the remaining subsets as the training set. In the next step, the architecture that had the smallest error averaged over the validation sets from the k runs is selected. Finally, the network with that architecture is used to train with all of the data. In traditional feed-forward neural networks with one or only few hidden layers (so called shallow networks), three training algorithms are most frequently used: Levenberg-Marquardt, Bayesian regularization and Scaled conjugate gradients [10]. After being trained, the ANN model reflects the relationship between the input and output of the system. If an ANN is expected to correlate variables evolving in time, a dynamic ANN should be used [12]. The forecasting of time series is a typical problem in process control applications [13]. The response of a dynamic ANN at any given time depends not only on the current input, but on the history of the input sequence. Consequently, dynamic networks have memory and can be trained to learn transient patterns. Temporal information can be included through a set of time delays between different inputs, so that the data corresponds to different points in time. There are several types of  Prior to ANN training, it is necessary to select its architecture, the activation function and training method.
A suitability of different ANN architectures is most reliably compared by the k-fold cross validation method [11]. First, the training set is partitioned into k subsets. For each architecture, the training is performed k times, each time using one of the subsets as the validation set and the remaining subsets as the training set. In the next step, the architecture that had the smallest error averaged over the validation sets from the k runs is selected. Finally, the network with that architecture is used to train with all of the data. In traditional feed-forward neural networks with one or only few hidden layers (so called shallow networks), three training algorithms are most frequently used: Levenberg-Marquardt, Bayesian regularization and Scaled conjugate gradients [10]. After being trained, the ANN model reflects the relationship between the input and output of the system.
If an ANN is expected to correlate variables evolving in time, a dynamic ANN should be used [12]. The forecasting of time series is a typical problem in process control applications [13]. The response of a dynamic ANN at any given time depends not only on the current input, but on the history of the input sequence. Consequently, dynamic networks have memory and can be trained to learn transient patterns. Temporal information can be included through a set of time delays between different inputs, so that the data corresponds to different points in time. There are several types of dynamic ANN models that can be used for time-series forecasting: e.g., Long Short-Term Memory (LSTM), Layer-Recurrent Network (LRN), Focused Time-Delay Neural Network (FTDNN), the Elman Network, and Networks with Exogenous Inputs (NARX) [14,15].
NARX are time-delay recurrent networks suitable for short time lag tasks. They have several hidden layers that relate the current value of the output to: (i) past values of the same variable and (ii) current and past values of the input (exogenous) variables ( Figure 3). Such a model can be described algebraically by Equation (4): where y[t] ∈ R Ny is an output variable, x[t] ∈ R Nx an exogenous input variable, f is a non-linear activation function (e.g., sigmoid), θ is an error function, d x and d y are input and output time delays.
Crystals 2020, 10, x FOR PEER REVIEW 5 of 17 If an ANN is expected to correlate variables evolving in time, a dynamic ANN should be used [12]. The forecasting of time series is a typical problem in process control applications [13]. The response of a dynamic ANN at any given time depends not only on the current input, but on the history of the input sequence. Consequently, dynamic networks have memory and can be trained to learn transient patterns. Temporal information can be included through a set of time delays between different inputs, so that the data corresponds to different points in time. There are several types of dynamic ANN models that can be used for time-series forecasting: e.g., Long Short-Term Memory (LSTM), Layer-Recurrent Network (LRN), Focused Time-Delay Neural Network (FTDNN), the Elman Network, and Networks with Exogenous Inputs (NARX) [14,15].
NARX are time-delay recurrent networks suitable for short time lag tasks. They have several hidden layers that relate the current value of the output to: (i) past values of the same variable and (ii) current and past values of the input (exogenous) variables ( Figure 3). Such a model can be described algebraically by Equation (4): where y[t] ∈ R Ny is an output variable, x[t] ∈ R Nx an exogenous input variable, f is a non-linear activation function (e.g., sigmoid), is an error function, dx and dy are input and output time delays. The input i[t] of the NARX network has dxNx + dyNy components: The input i[t] of the NARX network has d x N x + d y N y components: In the equation, T is a notation for the transpose of a matrix. The output y[t] of the network is governed by the Equations (6)-(8): where h 1 [t] ∈ R N1 is the output of the input layer at time t, h l [t] ∈ R N l is the output of the l-th hidden layer at time t, g(·) is a linear function, θ 1 are the parameters that determine the weights in the input layer, θ l in the l-th hidden layer and θ 0 in the output layer. NARX networks are trained and cross validated in the same way as the static ANNs. For solving complex long time lag tasks, LSTM networks are a better choice. LSTM network was proposed in [16] as a solution to the vanishing gradient problem found in training ANNs with gradient-based learning methods and back-propagation, where the training process may completely stop, i.e., weights do not adjust their values anymore. An LSTM uses a broader spectrum of information than more traditional recurrent networks. To this end, it consists of gated cells that can forget or pass on information, based on filters with their own sets of weights, as usually adjusted via network learning. By maintaining a more constant error, LSTM can learn over many time steps and link distant occurrences to a final output.
A Convolutional neural Network (CNN) is a special type of a neural network mostly used for image and pattern recognition. A CNN consists of multiple, repeating components that are stacked in basic layers: convolution, pooling, fully connected and dropout layer, etc., similar to most other types of ANNs ( Figure 4) [17]. A convolution layer applies a convolution filter to its input data. A pooling layer maximizes or averages values in each sub-region of the feature maps. A fully connected layer connects one neuron in the next layer to each neuron in the previous layer by a weight, like in the traditional feed-forward networks described earlier in this chapter. Activation functions as part of the convolutional layer and the fully connected layer are used to introduce nonlinear transformations into the CNN model. A dropout layer randomly ignores (drops out) a certain number or proportion of neurons and therewith decreases the danger of overtraining (and also training costs).
In the equation, T is a notation for the transpose of a matrix. The output y[t] of the network is governed by the Equations (6)-(8): where h1 [t] ∈ R N1 is the output of the input layer at time , ℎ [ ] ∈ is the output of the l-th hidden layer at time t, g(·) is a linear function, θ1 are the parameters that determine the weights in the input layer, θl in the l-th hidden layer and θ0 in the output layer.
NARX networks are trained and cross validated in the same way as the static ANNs. For solving complex long time lag tasks, LSTM networks are a better choice. LSTM network was proposed in [16] as a solution to the vanishing gradient problem found in training ANNs with gradient-based learning methods and back-propagation, where the training process may completely stop, i.e., weights do not adjust their values anymore. An LSTM uses a broader spectrum of information than more traditional recurrent networks. To this end, it consists of gated cells that can forget or pass on information, based on filters with their own sets of weights, as usually adjusted via network learning. By maintaining a more constant error, LSTM can learn over many time steps and link distant occurrences to a final output.
A Convolutional neural Network (CNN) is a special type of a neural network mostly used for image and pattern recognition. A CNN consists of multiple, repeating components that are stacked in basic layers: convolution, pooling, fully connected and dropout layer, etc., similar to most other types of ANNs ( Figure 4) [17]. A convolution layer applies a convolution filter to its input data. A pooling layer maximizes or averages values in each sub-region of the feature maps. A fully connected layer connects one neuron in the next layer to each neuron in the previous layer by a weight, like in the traditional feed-forward networks described earlier in this chapter. Activation functions as part of the convolutional layer and the fully connected layer are used to introduce nonlinear transformations into the CNN model. A dropout layer randomly ignores (drops out) a certain number or proportion of neurons and therewith decreases the danger of overtraining (and also training costs).  In the literature, most of the studies on the application of ANNs in the crystal growth were devoted to optimization problems. Fortunately, optima of ANNs can be determined applying methods for differentiable functions on ANN (almost all ANNs can be differentiated). Another optimization method sometimes encountered in this context, are Genetic Algorithms (GAs). Due to their popularity in various scientific fields in or adjacent to the material science [5,18,19], they will be shortly described.
A GA is the probably best known representative of evolutionary algorithms, which are stochastic methods for solving optimization problems based on the idea of biological evolution and natural selection. A GA repeatedly modifies a population of individual solutions, by randomly selecting individuals from the current population, evaluating and ranking them according to their fitness value and then either forwarding them to the next generation if they belong among those with the best fitness value, or recombining them or mutating them to produce the children for the next generation. Over consecutive generations, the population will evolve to better and better solutions ( In the literature, most of the studies on the application of ANNs in the crystal growth were devoted to optimization problems. Fortunately, optima of ANNs can be determined applying methods for differentiable functions on ANN (almost all ANNs can be differentiated). Another optimization method sometimes encountered in this context, are Genetic Algorithms (GAs). Due to their popularity in various scientific fields in or adjacent to the material science [5,18,19], they will be shortly described.
A GA is the probably best known representative of evolutionary algorithms, which are stochastic methods for solving optimization problems based on the idea of biological evolution and natural selection. A GA repeatedly modifies a population of individual solutions, by randomly selecting individuals from the current population, evaluating and ranking them according to their fitness value and then either forwarding them to the next generation if they belong among those with the best fitness value, or recombining them or mutating them to produce the children for the next generation. Over consecutive generations, the population will evolve to better and better solutions ( Figure 5) A probability ( ) that the individual in the population of N individuals will be selected to become a parent depends on its fitness value ( ) that first has to be normalized according to Equation (9): For the proportional selection scheme (roulette wheel), will be selected if a random number with uniform distribution on the interval [0,1]satisfies Equation (10): Two individuals described with vectors of real numbers , that are selected as parents will recombine with probability , producing the new individuals , according to Equation (11): where is a random number with uniform distribution on the interval [0,1]. A probability p S (X i ) that the individual X i in the population of N individuals will be selected to become a parent depends on its fitness value f (X i ) that first has to be normalized according to Equation (9): For the proportional selection scheme (roulette wheel), X i will be selected if a random number ξ with uniform distribution on the interval [0,1] satisfies Equation (10): Two individuals described with vectors of real numbers X, Y that are selected as parents will recombine with probability p c , producing the new individuals X , Y according to Equation (11): where ξ is a random number with uniform distribution on the interval [0,1]. Mutation of the individual X will produce in the next generation with probability p m an individual X according to Equation (12): where ξ is a random vector with Gaussian distribution with zero mean and unit variance. When combining an ANN and a GA, a search for the optimum starts by randomly generating a set of inputs and their corresponding outputs predicted by ANN. Candidate solutions are then selected according to their fit to previously defined criteria; the GA is then used for evolving new solutions to the problem using crossover and mutation. This is repeated until the optimization criteria are fulfilled [20].
An inherent stochastic nature of the crystal growth data originates from, e.g., inaccurate measurements or inaccurate simulations of the crystal growth process parameter: crucible rotational rate, crystal rotational rate, crystal pulling rate, gas pressure, gas flow rate, heating power, melt loading, etc. Addressing the uncertainty of ANN predictions is feasible if on ANN is superimposed a Gaussian process model (GP) [21]. Due to a high potential benefit for crystal growth applications, this combined ANN and GP approach will be shortly described.
GP is a statistical method capable of modeling the probability distribution of output values Y x for arbitrary sets of inputs x 1 , . . . , x n simultaneously [21]. A simple example of GP in one dimension is illustrated in Figure 6. Mutation of the individual will produce in the next generation with probability an individual according to Equation (12): = + (12) where is a random vector with Gaussian distribution with zero mean and unit variance. When combining an ANN and a GA, a search for the optimum starts by randomly generating a set of inputs and their corresponding outputs predicted by ANN. Candidate solutions are then selected according to their fit to previously defined criteria; the GA is then used for evolving new solutions to the problem using crossover and mutation. This is repeated until the optimization criteria are fulfilled [20].
An inherent stochastic nature of the crystal growth data originates from, e.g., inaccurate measurements or inaccurate simulations of the crystal growth process parameter: crucible rotational rate, crystal rotational rate, crystal pulling rate, gas pressure, gas flow rate, heating power, melt loading, etc. Addressing the uncertainty of ANN predictions is feasible if on ANN is superimposed a Gaussian process model (GP) [21]. Due to a high potential benefit for crystal growth applications, this combined ANN and GP approach will be shortly described.
GP is a statistical method capable of modeling the probability distribution of output values for arbitrary sets of inputs , … , simultaneously [21]. A simple example of GP in one dimension is illustrated in Figure 6. Mathematically speaking, a GP is a collection ( ) ∈ℝ of random variables assigned to points from a -dimensional vector space ℝ and such that any finite subcollection corresponding to some points , … , from that space has a multivariate Gaussian distribution: Here, is a the GP mean, determined by a function that models the non-stochastic part of the data, and the covariance matrix Σ ( ,…, ) is determined by a symmetric function : ℝ × ℝ → ℝ, called covariance function, on which usually a Gaussian noise with a variance is superimposed: where denotes the -dimensional identity matrix. One possible covariance function is defined in (15), where and are GP hyperparameters, i.e., signal variance and the characteristic length scale of Gaussians in the space ℝ , respectively. The hyperparameters and the Gaussian noise dispersion Mathematically speaking, a GP is a collection (Y x ) x∈R k of random variables Y x assigned to points from a k-dimensional vector space R k and such that any finite subcollection corresponding to some n points x 1 , . . . , x n from that space has a multivariate Gaussian distribution: Here, µ is a the GP mean, determined by a function that models the non-stochastic part of the data, and the covariance matrix Σ (x 1 ,...,x n ) is determined by a symmetric function K : R k × R k → R , called covariance function, on which usually a Gaussian noise with a variance σ 2 G is superimposed: where I n denotes the n-dimensional identity matrix. One possible covariance function is defined in (15), where σ 2 f and σ 2 l are GP hyperparameters, i.e., signal variance and the characteristic length scale of Gaussians in the space R k , respectively. The hyperparameters and the Gaussian noise dispersion σ 2 G are usually estimated with the maximum likelihood method, i.e., through maximizing the density (13) of (Y x 1 , . . . , Y x n ) corresponding to the vector (y 1 , . . . , y n ) from a given training set ((x 1 , y 1 ), . . . , (x n , y n )). Once the hyperparameters have been estimated, allowing to compute the value K(x, X ) for any x, X R k , (13) and (14) can be used to predict the distribution of Y x * for any x * x 1 , . . . , x n . This yields: When combining ANN and GP, i.e., if trained ANN is used as the GP mean function (Equation (13)), more information will be obtained about the system than from one single method. ANN offers information about the functional dependence among the variables and GP about the random influences.

Static ANN Applications
Concerning static applications, in the papers [25,26,29,31], feed-forward networks of either the mono-or multi-layer perceptron type were used to model dependences pertaining to crystal growth process.
In [26], TSSG of SiC crystals for power devices was studied. To make high-quality large-diameter (8 inch) SiC crystals using the TSSG method able to compete commercially standard SiC crystals grown by sublimation method, it is necessary to optimize the spatial distribution of supersaturation and the flow velocity in the solution (Figure 7a). In the literature, it was reported that solution flow from the center to the periphery gave rise to a smooth surface on the crystal [38]. The beneficial supersaturation distribution is the one in which the supersaturation near the seed crystal is relatively high and the supersaturation near the crucible bottom and wall is low. The TSSG optimization is a challenging task since the velocity and supersaturation depend on many process parameters (e.g., heater power, crucible position and rotation, seed crystal position and rotation, growth configuration of the heat insulator, crucible shape, and crucible size) that must be optimized simultaneously. Moreover, these parameters need to be optimized with respect to multiple objectives.
Common experimental and CFD approaches to the optimization of the process parameters are laborious and time consuming. The authors of [26] proposed the application of ANN for acceleration of CFD simulations, combined with GA for the process optimization. The database for the ANN training was derived from CFD simulations. Resulting feed-forward ANN with 4 hidden layers was derived from 1000 steady axisymmetric CFD simulated process recipes, able to correlate 11 inputs (boundary temperatures, seed rotation rates, sizes of the crucible and seed and spatial coordinates (r,z) of 400 points in the axisymmetric computational domain) with 3 outputs (flow velocity components (radial u r , axial u z ) and chemical composition of the solution in the points in the computational domain shown in Figure 7b. The comparison of the ANN and CFD predictions of the flow and concentration patterns are shown in Figure 7c. The ANN predictions mimicked the CFD results and were 10 7 times faster than the corresponding CFD simulations, enabling also fast optimization of the process parameters in the large parameter space. The superposition of the GA to the ANN prediction model enabled more optimum conditions to be found. The prediction of the growth conditions for upscaled SiC crystals using the same methodology was the topic of the authors' further papers [32,33].
supersaturation distribution is the one in which the supersaturation near the seed crystal is relatively high and the supersaturation near the crucible bottom and wall is low. The TSSG optimization is a challenging task since the velocity and supersaturation depend on many process parameters (e.g., heater power, crucible position and rotation, seed crystal position and rotation, growth configuration of the heat insulator, crucible shape, and crucible size) that must be optimized simultaneously. Moreover, these parameters need to be optimized with respect to multiple objectives.  Concerning the proposed method of CFD acceleration by ANN for optimization purposes, the question is often raised whether it is worth training ANN using thousands of CFD simulations or it is more efficient way to increase the computational power to perform solely the CFD simulations of the required case? The answer may lay in the economics of scale. The number of CFD cases that one has to run is exactly proportional to the processing power, while the number of cases that one avoids running because one has a trained ANN can by many orders of magnitude exceed that. Therefore, the more parameters to optimize, the better is the economy of ANN method for high speed predictions of CFD results. Nevertheless, the strength of ANNs in CFD modelling is more in model deduction, not in replacing the solver itself.
More researchers studied the application of static ANN combined with GA [18,31] or the Adam optimization method [39] for the optimization of parameters affecting the crystal growth [26].
For example, the prediction and optimization of parameters affecting the temperature field in the Czochralski growth of YAG crystals using data based on axisymmetric steady state CFD simulations was studied in [18]. In the Czochralski crystal growth process, a flat crystallization front during the growth assures production of single crystals with less structural defects, uniform physical properties and homogenous chemical composition. The study was focused on the influence of the crystal pulling rate, crystal rotational rate, ambient gas temperature and temperature of crucible on the deflection and the position of the crystallization front. The ANN with 4 inputs, 1 hidden layer and 2 outputs, derived from only 81 simulations was used. The CFD results were verified with Cz-InP growth experiments published in [40] (Figure 8b). Moderate accuracy of the ANN predictions may originate either from simple architecture of the ANN and low number of training data or from inaccurate CFD results used for ANN training. The latter may be an issue due to the over simplified CFD model (e.g., simple boundary conditions and steady state assumption) and verification of the obtained results using the crystal growth experiments for another material. This example of the ANN application revealed the greatest drawback of the usage of ANNs if based on CFD data. ANN strongly rely on the training data veracity. It can only extract information which is present in the input and map it into its training set, but ANN cannot compensate the inaccuracy of the CFD results in cases of a lack of experimental validation of the data.
Another example of the application of static ANN for optimization tasks is described in [37]. The authors addressed the common problem of accurate monitoring of temperatures during the directional solidification of silicon (DS-Si) with limited number of thermocouples. They used 195 data sets generated by 2D CFD modeling to train ANN with 8 inputs (3 temperatures of heaters, 4 equidistant temperatures along the crucible side wall and 1 crucible axial position) and 21 outputs (21 equidistant temperatures along the crucible side wall). The best predictions were obtained for the architecture with 2 hidden layers with 32 neurons. The top ten ranks of accurate temperature predictions contain positions around the crucible bottom, suggesting the importance of measuring temperatures in the zone of high-temperature gradients. This approach and the obtained results may be of interest for the prediction of the location and a reduction in the number of thermocouples inside small crystal growth furnaces. Nevertheless, the accuracy of the ANN predictions will again strongly depend on the accuracy of the CFD results, particularly for the processes such as DS-Si, when an axisymmetric CFD model is used for the description of the rectangular set-up. For these cases, prior to ANN training, the verification of the CFD model with crystal growth experiments is indispensable.
Crystals 2020, 10, x FOR PEER REVIEW 11 of 17 be of interest for the prediction of the location and a reduction in the number of thermocouples inside small crystal growth furnaces. Nevertheless, the accuracy of the ANN predictions will again strongly depend on the accuracy of the CFD results, particularly for the processes such as DS-Si, when an axisymmetric CFD model is used for the description of the rectangular set-up. For these cases, prior to ANN training, the verification of the CFD model with crystal growth experiments is indispensable. Feasible approach for the ANN applications with inaccurate input values or more than one possible solution is to provide uncertainty information to the ANN predictions by the superposition of ANN with a GP. This combination of two statistical methods was used in [29,30] for a fast prediction and optimization of magnetic parameters for temperature field management, i.e., for flat solid-liquid interface deflection Δ (|Δ| < 0.1 mm), in magnetically driven DS-Si and VGF of GaAs. In [29], 4 inputs (frequency, phase shift, electric current amplitude and crystal growth rate) were correlated with 1 output (solid-liquid interface deflection Δ in magnetic field) using mono layer feed-forward ANN based on 437 CFD axisymmetric quasi steady state simulations, verified with available experimental results ( Figure 9). Finally, ANNs were combined with GP models to derive the probability distribution of the output for every given combination of inputs (Figure 9c). Feasible approach for the ANN applications with inaccurate input values or more than one possible solution is to provide uncertainty information to the ANN predictions by the superposition of ANN with a GP. This combination of two statistical methods was used in [29,30] for a fast prediction and optimization of magnetic parameters for temperature field management, i.e., for flat solid-liquid interface deflection ∆ (|∆| < 0.1 mm), in magnetically driven DS-Si and VGF of GaAs. In [29], 4 inputs (frequency, phase shift, electric current amplitude and crystal growth rate) were correlated with 1 output (solid-liquid interface deflection ∆ in magnetic field) using mono layer feed-forward ANN based on 437 CFD axisymmetric quasi steady state simulations, verified with available experimental results ( Figure 9). Finally, ANNs were combined with GP models to derive the probability distribution of the output for every given combination of inputs (Figure 9c).
Analyzing the GP results shown in Figure 9c, it can be noticed the uneven narrowness in the spatial probability distribution. From the way how a GP was constructed follows that the uncertainty of the GP predictions depends on local data density, i.e., if there is a high density of training data, variance of the predicted Gaussian distribution is small, while for outliers or in sparsely populated regions of input space, it has a large variance. In view of this, a combination of ANN and GP offers more information than one single model, i.e., ANN offers information about the functional dependence and GP about the random influences. Analyzing the GP results shown in Figure 9c, it can be noticed the uneven narrowness in the spatial probability distribution. From the way how a GP was constructed follows that the uncertainty of the GP predictions depends on local data density, i.e., if there is a high density of training data, variance of the predicted Gaussian distribution is small, while for outliers or in sparsely populated regions of input space, it has a large variance. In view of this, a combination of ANN and GP offers more information than one single model, i.e., ANN offers information about the functional dependence and GP about the random influences.

Dynamic Applications
Exact control of the dynamic processes at the crystallization front is a key for enhanced crystal growth yield and improved crystal quality. It is particularly important to suppress the turbulent motions in the melt and to control temperature gradients in the crystal that are responsible for the generation of detrimental crystal defects and undesired variation of crystal diameter. The complex solidification process is difficult to control due to the large time delays, high-order dynamics and constrains in using suitable sensors in the crystallization furnace because of the hostile environment. The multivariable nonlinear model predictive control based on dynamic artificial neural networks is the most promising, real-time capable and accurate alternative to the conventional slow controllers based on linear theory.
The crystal growth process dynamics described by static feed-forward ANN was the topic of a paper [31]. In this study, 54 transient axisymmetric 2D CFD simulations were used to derive the cooling rates of two heaters and the velocity of the heat gate during directional solidification of 850 kg quasimono Si crystals in industrial G6 size furnace. These rates were correlated with crystal quality (i.e., thermal stress in crystal and solid/liquid interface deflection) and growth time using static ANN with 3 inputs, 1 hidden layer and 3 outputs (Figure 10). The growth recipe for the solidification step was optimized using GA. The total fitness of the evaluation was defined in Equation (17

Dynamic Applications
Exact control of the dynamic processes at the crystallization front is a key for enhanced crystal growth yield and improved crystal quality. It is particularly important to suppress the turbulent motions in the melt and to control temperature gradients in the crystal that are responsible for the generation of detrimental crystal defects and undesired variation of crystal diameter. The complex solidification process is difficult to control due to the large time delays, high-order dynamics and constrains in using suitable sensors in the crystallization furnace because of the hostile environment. The multivariable nonlinear model predictive control based on dynamic artificial neural networks is the most promising, real-time capable and accurate alternative to the conventional slow controllers based on linear theory.
The crystal growth process dynamics described by static feed-forward ANN was the topic of a paper [31]. In this study, 54 transient axisymmetric 2D CFD simulations were used to derive the cooling rates of two heaters and the velocity of the heat gate during directional solidification of 850 kg quasimono Si crystals in industrial G6 size furnace. These rates were correlated with crystal quality (i.e., thermal stress in crystal and solid/liquid interface deflection) and growth time using static ANN with 3 inputs, 1 hidden layer and 3 outputs ( Figure 10). The growth recipe for the solidification step was optimized using GA. The total fitness of the evaluation was defined in Equation (17).
Fitness weights in Equation (17) were selected in cooperation with the industry where thermal stress is the most important factor to cause dislocations in crystal and therefore was assigned the highest weight value. Compared with the original crystal growth recipe, the optimized recipe has faster movement of the heat gate and larger cooling rate of the top heater, but smaller cooling rate of the side heater. Moreover, the cooling rates of both heaters in the optimal recipe decrease slightly with time. The authors found out that the optimization of the process for coupled ANN with GA is about 45 times faster than in the case of optimization with CFD. The proposed combination of transient CFD results for database and static ANN has both advantages and disadvantages. Typically, static ANNs are defined by less parameters (weights and biases) than dynamic ANNs, i.e., they require smaller number of datasets to assure identifiability of the parameters and they will be trained faster.
The drawback is the use of heating rates as static ANN inputs, since they are not experimentally measurable during the crystal growth process. Typical crystal growth furnaces use either power or temperature control of heaters. Therefore, this approach is not suitable for process control and automation. Moreover, the proposed methodology aims to find the optimum of ANN, not the optimum of the crystal growth problem.

faster.
The drawback is the use of heating rates as static ANN inputs, since they are not experimentally measurable during the crystal growth process. Typical crystal growth furnaces use either power or temperature control of heaters. Therefore, this approach is not suitable for process control and automation. Moreover, the proposed methodology aims to find the optimum of ANN, not the optimum of the crystal growth problem.
Another concept for coping with process dynamics was proposed in a proof-of-concept study [28], where transient 1D CFD results of the simplified VGF-GaAs model provided the transient datasets of 2 heating powers and 5 temperatures at different axial positions in the melt and crystal and position of solid/liquid interface. Altogether 500 datasets were used for training a NARX type of dynamic ANN. The best results were obtained for NARX architecture with 2 inputs, 2 hidden layers with 9 and 8 neurons, 6 outputs, and 2 time delays (Figure 11b). The predictions were accurate for the slow growth rates (Figure 11c), but their accuracy decreased with the increase in the crystal growth rate. Beside a need for improved accuracy, for the practical application in process automation and control, it will be necessary to derive datasets from axisymmetric CFD simulations. Figure 10. Optimization of controlling recipe in quasi-mono Si growth using feed-forward ANN coupled with GA: (a) Configuration of a G6-size industrial seeded directional solidification (DS) furnace, (b) ANN architecture, (c) thermal stress field in the grown crystals between the original controlling recipe (left) and the optimal recipe (right), (d) original and the optimal growth recipe. Reprinted from [31] with permission from Elsevier. Figure 10. Optimization of controlling recipe in quasi-mono Si growth using feed-forward ANN coupled with GA: (a) Configuration of a G6-size industrial seeded directional solidification (DS) furnace, (b) ANN architecture, (c) thermal stress field in the grown crystals between the original controlling recipe (left) and the optimal recipe (right), (d) original and the optimal growth recipe. Reprinted from [31] with permission from Elsevier.
Another concept for coping with process dynamics was proposed in a proof-of-concept study [28], where transient 1D CFD results of the simplified VGF-GaAs model provided the transient datasets of 2 heating powers and 5 temperatures at different axial positions in the melt and crystal and position of solid/liquid interface. Altogether 500 datasets were used for training a NARX type of dynamic ANN. The best results were obtained for NARX architecture with 2 inputs, 2 hidden layers with 9 and 8 neurons, 6 outputs, and 2 time delays (Figure 11b). The predictions were accurate for the slow growth rates (Figure 11c), but their accuracy decreased with the increase in the crystal growth rate. Beside a need for improved accuracy, for the practical application in process automation and control, it will be necessary to derive datasets from axisymmetric CFD simulations.
One more example of the application of dynamic ANN in the crystal growth of semiconducting films was presented in [36]. By Metal Organic Chemical Vapor Deposition (MOCVD) growth of GaN for microelectronic and optoelectronic devices, accurate temperature control is needed to maintain wavelength uniformity, control wafer bow and reduce wafer slip. It was reported on the development of a dynamic NARX ANN for a prediction of time series of 2 temperatures (2 outputs) given a time series of 2 heater filament currents, carrier rotational rate and operating pressure (4 inputs). The time series predictions served as a plant model in model predictive control. Very accurate predictions of temperatures with error~1 K were obtained for the NARX architecture with 1 hidden layer of 10 neurons and 2 delays. Reprinted from [28] with permission from Elsevier.
One more example of the application of dynamic ANN in the crystal growth of semiconducting films was presented in [36]. By Metal Organic Chemical Vapor Deposition (MOCVD) growth of GaN for microelectronic and optoelectronic devices, accurate temperature control is needed to maintain wavelength uniformity, control wafer bow and reduce wafer slip. It was reported on the development of a dynamic NARX ANN for a prediction of time series of 2 temperatures (2 outputs) given a time series of 2 heater filament currents, carrier rotational rate and operating pressure (4 inputs). The time series predictions served as a plant model in model predictive control. Very accurate predictions of temperatures with error ~1 K were obtained for the NARX architecture with 1 hidden layer of 10 neurons and 2 delays. Different accuracy of the NARX predictions in bulk and films crystal growth in the above mentioned examples may be related to the different time scales of the transport phenomena (e.g., large time scale for the removal of the latent heat from the crystallization front in large industrial size bulk crystals versus short time scale in thin films) between these two crystal growth processes. NARX neural networks have shown success for many time-series modeling tasks, particularly in the control applications, but learning long-term dependencies from data remains difficult. This is often attributed to their vanishing gradient problem. More recent Long Short-Term Memory (LSTM) networks attempt to remedy this problem by preserving the error, which is then always back-propagated through time and layers [16]. By maintaining a more constant error, LSTM allows recurrent nets to continue to learn over many time steps. LSTM applications in the bulk crystal growth are still to come.

Image Processing Applications
Applications of CNN in the crystal growth are yet to come. Still, numerous papers are available on the applications of the CNNs in the fields pertinent to crystal growth simulations and crystal characterization, e.g., the prediction of turbulence [41], derivation of material data [5,[42][43][44][45][46], optimization of CFD meshes [3], classification of atomically resolved Scanning Transmission Electron Microscopy (STEM) [47], and Transmission Electron Microscopy (TEM) [48] images, just to mention a few. Different accuracy of the NARX predictions in bulk and films crystal growth in the above mentioned examples may be related to the different time scales of the transport phenomena (e.g., large time scale for the removal of the latent heat from the crystallization front in large industrial size bulk crystals versus short time scale in thin films) between these two crystal growth processes. NARX neural networks have shown success for many time-series modeling tasks, particularly in the control applications, but learning long-term dependencies from data remains difficult. This is often attributed to their vanishing gradient problem. More recent Long Short-Term Memory (LSTM) networks attempt to remedy this problem by preserving the error, which is then always back-propagated through time and layers [16]. By maintaining a more constant error, LSTM allows recurrent nets to continue to learn over many time steps. LSTM applications in the bulk crystal growth are still to come.

Image Processing Applications
Applications of CNN in the crystal growth are yet to come. Still, numerous papers are available on the applications of the CNNs in the fields pertinent to crystal growth simulations and crystal characterization, e.g., the prediction of turbulence [41], derivation of material data [5,[42][43][44][45][46], optimization of CFD meshes [3], classification of atomically resolved Scanning Transmission Electron Microscopy (STEM) [47], and Transmission Electron Microscopy (TEM) [48] images, just to mention a few.

Conclusions and Outlook
The recent boom in ANN applications in various fields of science and technology was possible thanks to increased data volumes, advanced algorithms, and improvements in computing power and storage.
For the years to come, it is feasible to expect that novel ANN applications will significantly accelerate fundamental and applied crystal growth research. The gain in terms of scientific research lies in the fast and accurate ANN predictive power that is a stepping stone towards an explanation for new crystal growth theories and hypothesis, if a convincing theory is unavailable. ANNs predictive power enables/provides: (1) pre-selection of well performing scientific models for further studies, (2) quantitate comparison of scientific models on the base of their prediction success that might reveal factors relevant for their success and thus contribute to the theory development and (3) ultimate reliable criterion for successful explanation of new theoretical models, free from error-prone human judgement. Concerning crystal growth applications, the need for affordable high quality crystals of semiconductors and oxides is continuously increasing, particularly for the electronic and photovoltaic industries, i.e., for solar cells, electric and fuel cell vehicles. Fast optimization of the process parameters and their exact control is a key for enhanced crystal growth yield and improved crystal quality. The next generation of smart crystal growth factories will use AI and automation to keep costs low and profits high.
In this paper, we reviewed the recent ANN applications and discussed their advantages and drawbacks. The latest international activities have been devoted to the development of sustainable infrastructure for the provision of experimental, theoretical, and computational research data in the field of condensed-matter physics and materials science. Once available, the usage of open source big crystal growth data will resolve the last bottleneck for ANN applications and will strongly push the development of new breakthrough crystalline material-based technologies. Until then, the volume of required training data may be reduced by using advanced machine learning methods known as active learning [49][50][51][52][53].