Real Time Predictions of VGF-GaAs Growth Dynamics by LSTM Neural Networks

: The aim of this study was to assess the aptitude of the recurrent Long Short-Term Memory (LSTM) neural networks for fast and accurate predictions of process dynamics in vertical-gradient-freeze growth of gallium arsenide crystals (VGF-GaAs) using datasets generated by numerical transient simulations. Real time predictions of the temperatures and solid–liquid interface position in GaAs are crucial for control applications and for process visualization, i.e., for generation of digital twins. In the reported study, an LSTM network was trained on 1950 datasets with 2 external inputs and 6 outputs. Based on network performance criteria and training results, LSTMs showed the very accurate predictions of the VGF-GaAs growth process with median root-mean-square-error (RMSE) values of 2 × 10 − 3 . This deep learning method achieved a superior predictive accuracy and timeliness compared with more traditional Nonlinear AutoRegressive eXogenous (NARX) recurrent networks.


Introduction
GaAs is one of the crucial future materials for microelectronic and optoelectronic devices in 5G and 6G technologies [1] that provides access to terahertz frequencies and realtime transmission for the next generation of wireless communications. The maturity level of 5G and 6G technologies highly depends on the advancement in Vertical-Gradient-Freeze (VGF) method of GaAs manufacturing. Devices are sensitive to GaAs crystal defects that induce leakage currents, which can dramatically deteriorate device performance. Since density and electrical activity of defects is influenced by the thermal history of the GaAs crystal, in operando process monitoring, earlier fault detection and precise control of crystal growth process are essential.
More precisely, a time variant spatially distributed temperature profile inside the GaAs has to be established and tracked while only using limited spatially lumped controls on the outside. This boundary control problem is made even more difficult, because the resulting partial differential equations for the temperature distribution in crystal and melt are coupled by the dynamics of the crystallization front and thus form a nonlinear free boundary problem.
Moreover, direct measurement of temperature, velocity and concentration fields in the melt and crystal may cause contamination and is, therefore, troublesome. Predictions for the transport phenomena during crystal growth based on computational fluid dynamics (CFD) numerical modeling are accurate, but too slow for real time applications like Model Predictive Control. When applying such a control scheme, several complete simulations of the whole process must be computed to find the optimal control input every time a new input is needed. This typically requires solving times of about 1 min for a 76 h process.
Difficulties lie i.a. in the fact that a VGF-GaAs growth process is transient in nature with duration up to 3 days and pronounced time lag effect. The later originates from low heat conductivity of GaAs (λ m = 0.178 W/cm/K and λ s = 0.071 W/cm/K at T m = 1512 K) that hampers latent heat removal. The growth rate is slow (~2-4 mm/h), process temperature range narrow (grad T melt~2 -5 K/cm; grad T crystal~1 5 K/cm) and critical shear stress low (σ cr = 0.587 MPa at 40 K below T m ) [2].
One feasible solution for the generation of digital twins and process control is the application of machine learning (ML) and particularly artificial neural networks (ANN) for the fast forecasting of the VGF-GaAs growth process. Application of ML to crystal growth is a new, but fast emerging and very promising field, as shown in, e.g., [3][4][5][6][7][8][9][10][11][12].
Recurrent neural networks (RNN) are a class of ANNs used in cases where the data to be learned are sequential, as it is the case with time series. For them, an RNN response at any given time depends not only on the current input, but on the previous input sequence. Consequently, RNNs have a memory and can be trained to learn transient patterns. RNNs come in many variants [13]. In our previous proof-of-concept study [3], we used the Nonlinear AutoRegressive eXogenous (NARX) [14] type of RNNs for the prediction of temperature profiles and the s/l interface position in VGF-GaAs growth. The predictions were accurate for slow growth rates, but their accuracy significantly decreased with the increase in the crystal growth rate. In our follow-up study, we increased the number of training datasets from 500 to 2000, but no improvement in prediction accuracy was observed. The reason may lie in the combination of the pronounced time lag effect of GaAs growth and NARX known difficulties in learning long-term dependencies due to their vanishing gradient problem, pertaining to gradient learning methods, such as backpropagation. In such methods, each of the neural network's weights receives an update proportional to the current gradient of the error function in each iteration of the training process. If the gradient is too small, the weights will not change and the neural network training stops or shows no improvement.
A solution to the vanishing gradient problem could be the application of Long Short-Term Memory (LSTM) networks that were specially developed to deal with both, the short and long-term dependencies [15]. A first successful application of LSTM networks in crystal growth was described in [5] for the process control of Cz-Si growth. The authors used experimental data, i.e., time series of a heating power (1 input) and crystal diameter (1 output) to train an LSTM neural network and derived an LSTM-based identification model. The pulling rate was varying along a given curve and was not used as an input. The concept of identification is demonstrated by showing that the unknown parameters in the studied model are only functions of identified parameters and that these functions lead to unique solutions [16].
In this study, we will present the first results of the application of LSTM networks in the fast prediction of VGF-GaAs crystal growth trained on 1D CFD datasets with temporal profiles of heating power (2 inputs) and temperatures at different positions in GaAs as well as the position of the crystallization front (6 outputs). The final goal is to set up mathematical models of the crystal growth process that are sufficiently precise, real-time capable and structurally suitable both for control applications and for generation of digital twins. Pros and cons of this approach will be discussed and results compared with previous NARX predictions.

Generation of Training Data
Transient datasets were generated by numerically solving a 1D model of VGF-GaAs growth as described by Equations (1)-(5) below with initial and boundary conditions given in Equations (4)- (6). The GaAs material properties were taken from [17,18]. A simplified schema of the VGF-GaAs model is shown in Figure 1. While the initial conditions always corresponded to a crucible containing fully molten GaAs with a certain axial temperature gradient, different growth velocities were chosen for which the resulting trajectories of Please note that there is no generic way to determine a priori the required volume of training data, given just a problem description. The number of datasets for training a neural network has to be at least as high as the number of network parameters in order to provide the identifiability of all parameters. However, a sufficient data volume can be diagnosed only by monitoring the performance of the neural network during training.
In our study, one dataset included 2 inputs and 6 outputs denoted as x1, x2 and y1-y6, respectively. The inputs were temporal incoming and outgoing heat fluxes q0 and q1. The outputs included temporal temperature profiles in in GaAs at 0-25-50-75-100% of its total length L and GaAs s/l interface position (corresponding to the crystal length). GaAs growth rates varied in the range of 1 to 5 mm/h. The maximal crystal length was L = 0.3 m. Total crystal growth time was 2.1 × 10 5 s (58.3 h). The numerical model was solved using the differential Equation stated in [19,20]: ( , ) = 1 ( ) (4) Figure 1. Sketch of the VGF-GaAs hot zone with related neural network inputs (incoming x 1 and outgoing x 2 heat fluxes) and outputs (temperatures in GaAs at 0-25-50-75-100% of its length y 1 to y 5 and interface position y 6 ) related to 1D model of melt and crystal.
Please note that there is no generic way to determine a priori the required volume of training data, given just a problem description. The number of datasets for training a neural network has to be at least as high as the number of network parameters in order to provide the identifiability of all parameters. However, a sufficient data volume can be diagnosed only by monitoring the performance of the neural network during training.
In our study, one dataset included 2 inputs and 6 outputs denoted as x 1 , x 2 and y 1 -y 6 , respectively. The inputs were temporal incoming and outgoing heat fluxes q 0 and q 1 . The outputs included temporal temperature profiles in in GaAs at 0-25-50-75-100% of its total length L and GaAs s/l interface position (corresponding to the crystal length). GaAs growth rates varied in the range of 1 to 5 mm/h. The maximal crystal length was L = 0.3 m. Total crystal growth time was 2.1 × 10 5 s (58.3 h).
The numerical model was solved using the differential Equation stated in [19,20]: Selected examples of generated training data that served as the temporal input and output datasets for LSTM network are presented in Figure 2.
Selected examples of generated training data that served as the temporal input and output datasets for LSTM network are presented in Figure 2.  Figure 2. Examples of crystal growth datasets: temporal profiles of inputs x 1 , x 2 (incoming and outgoing heat fluxes), temporal profiles of outputs y 1 to y 5 (temperatures in characteristic points in GaAs) and y 6 (interface position in GaAs).

LSTM Neural Network Modeling
In contrast to widely used feed forward neural networks where information moves only in forward direction from the input nodes organized in the input layer, through the hidden nodes in hidden layers to the output nodes in the output layer, RNNs are a type of neural networks that are characterized by connections between neurons in one layer and neurons in the same, or a previous, layer, i.e., where the inputs are not independent. RNNs response at any given time depends not only on the current input, but also on the history of the input sequence, RNNs behave like they have a memory. RNNs are typically used in cases where the data to be learned are sequential, i.e., for problems dealing with trajectories, control systems, robotics, speech recognition, language translation, stock predictions, etc.
Despite the fact that basic RNNs are a very powerful method for prediction of time series, still they include many hyperparameters, which are parameters that are not the result of learning. They are either properties of the network that act as constants from the view of the learning algorithm or they affect the learning algorithm itself. Therefore, they are generally difficult to train and may suffer from the vanishing or exploding gradients problem that hinders them to learn long term dependencies. The LSTM network proposed by Hochreiter and Schmidhuber in 1997 [15] successfully addressed these initial shortcomings and became the most popular RNN type nowadays.
An LSTM architecture (see Figure 3) includes several LSTM layers that consist of blocks which in turn consist of cells. Each cell has its own inputs, outputs, and memory. Cells that belong to the same block, share input, output, and forget gates. Input gate decides whether a given information is worth remembering, forget gate decides how much a given information is still worth remembering, i.e., how quickly it may be forgotten and output gate decides whether a given information is relevant at a given step and should be used. Each of these gates can be thought of as a neuron and LSTM cell as a hidden layer in a feed-forward neural network.
hidden nodes in hidden layers to the output nodes in the output layer, RNNs are a type of neural networks that are characterized by connections between neurons in one layer and neurons in the same, or a previous, layer, i.e., where the inputs are not independent. RNNs response at any given time depends not only on the current input, but also on the history of the input sequence, RNNs behave like they have a memory. RNNs are typically used in cases where the data to be learned are sequential, i.e., for problems dealing with trajectories, control systems, robotics, speech recognition, language translation, stock predictions, etc.
Despite the fact that basic RNNs are a very powerful method for prediction of time series, still they include many hyperparameters, which are parameters that are not the result of learning. They are either properties of the network that act as constants from the view of the learning algorithm or they affect the learning algorithm itself. Therefore, they are generally difficult to train and may suffer from the vanishing or exploding gradients problem that hinders them to learn long term dependencies. The LSTM network proposed by Hochreiter and Schmidhuber in 1997 [15] successfully addressed these initial shortcomings and became the most popular RNN type nowadays.
An LSTM architecture (see Figure 3) includes several LSTM layers that consist of blocks which in turn consist of cells. Each cell has its own inputs, outputs, and memory. Cells that belong to the same block, share input, output, and forget gates. Input gate decides whether a given information is worth remembering, forget gate decides how much a given information is still worth remembering, i.e., how quickly it may be forgotten and output gate decides whether a given information is relevant at a given step and should be used. Each of these gates can be thought of as a neuron and LSTM cell as a hidden layer in a feed-forward neural network. The first LSTM block uses the initial state of the network and the first time step of the sequence to compute the first output and the updated cell state. At time step t, the block uses the current state of the network and the next time step of the sequence to compute the output and the updated cell state, as shown in Equations (7)-(11): = ( + ℎ −1 + ) The first LSTM block uses the initial state of the network and the first time step of the sequence to compute the first output and the updated cell state. At time step t, the block uses the current state of the network and the next time step of the sequence to compute the output and the updated cell state, as shown in Equations (7)-(11): where denotes the elementwise product of vectors and matrices, aka Hadamard product.
In the Equations (7)-(11), h t ∈ R d is a vector which denotes the hidden state of the cell, where d is the cell dimension. x t ∈ R d and h t ∈ R d denote input and output of the cell at a time step t, respectively. R i , R o , R f , R g ∈ R d×d denote the weight matrices of the input gate, output gate, forget gate and the cell state, respectively. Similarly denote the weight matrices corresponding to the current input and the bias vectors, respectively. The symbols g t , i t , o t , f t , ∈ R d are the cell candidate, input, output, and forget gate vectors.
The cell state at time step t is given by: with σg denoting an componentwise applied gate activation function, which is, by default, the sigmoid function that outputs values in the range [0, 1]: In this study, we focused on an LSTM network with altogether 4 layers: input, LSTM, fully connected, and output layer. In LSTM layer, state and gate activation functions were tanh and sigmoid, respectively.
Training an LSTM network includes selecting an optimizer and the number of delay steps, the target prediction accuracy, as well as other options.
Among many, optimizers encountered in deep learning, which are typically using stochastic gradient descent (SGD), the Adam optimizer [21] was selected for this study, due to its superior performance in terms of convergence speed.
The number of time delays was set to 3. Data preprocessing included data normalization and classification. First of all, we normalized the data into the interval [0, 1], separately for the inputs x 1 and x 2 (heat fluxes), for the outputs y 1 -y 5 (temperatures at monitoring points in GaAs), and for the output y 6 (the position of the solid-liquid interface).
We used 10% of the total number of datasets for testing and the remaining ones for training. To this end, we took a random permutation of the numbers between 1 and total number of datasets and assigned its last 10% of elements as indices to the test samples and the first 90% of elements as indices to the training samples. This methodology was selected rather than the information richer cross validation [22] due to its much lower computational demands.
A number of hyperparameters are associated with LSTM network training, which require proper tuning. Some important hyperparameters concerning the learning algorithm include: mini-batch size, number of epochs, epoch size, learning rate, number of hidden layers, and validation frequency. The mini-batch size means the number of data sets considered for each gradient descent step full update of the weights through backpropagation. An epoch denotes one full forward and backward pass through the whole dataset. Consequently, the number of epochs denote the number of such passes across the dataset that are required for the optimal training. The learning rate is the intensity with which gradient descent influences the update. If the learning rate is too low, then training takes a long time. If it is too high, then training might diverge. The initial learning rate for the Adam optimizer was set to 0.001.
The optimal network and learning hyperparameters are different for different kinds of training data, it is related to how diverse the training data is. Moreover, some hyperparameters are interdependent (e.g., batch size may interact with learning rate). In this study, the hyperparameters were tuned manually [23].
Altogether, 21 combinations of three learning hyperparameters and of data volume were considered during the hyperparameter tuning, given in Table 1.
In our study, learning was always performed for the maximum number of epochs. The accuracy of predictions y p,i relative to actual target y i for each of the 100 time steps in all 1950 datasets was estimated using the root mean squared error (RMSE), defined in Equation (14). Herein, a lower RMSE implies a better prediction accuracy.
Due to low critical shear stress of GaAs and the associated high risk for polycrystalline growth, for practical applications and especially for feedback control, a high prediction accuracy is required for the process temperatures. Therefore, in this study, the LSTM prediction accuracy was assessed based on the whole distribution of obtained RMSE values. For ANN simulations, the commercial software Matlab ® and its Neural Network Toolbox™ were used.

Results and Discussion
Our goal was to assess the predictive accuracy of LSTM networks using datasets generated by numerical transient simulations of the VGF-GaAs growth process. A performance comparison between the various LSTM networks with combinations of learninghyperparameters described in Table 1 was carried out and the results shown in Figures 4-9 are summarized below.
The identification of suitable LSTM learning-hyperparameters and data volume started with a comparison of RMSE values in cases where one feature was varied while the rest was kept constant. The reference combination 1 corresponds to 1950 datasets, optimizer Adam, initial learning rate 1 × 10 −3 , mini batch size 225, maximum number of epochs 100, and a validation frequency of 50.
The influence of data volume and elapsed time for network training on the LSTM prediction performance was obtained by comparison of combinations 1-4. Results are shown in Figure 4a. By tripling the data volume from 500 to 1500 datasets, RMSE decreased 5 times from~5 to~1%. However, a further increase in the data volume to 1950 datasets only insignificantly improved the RMSE value. Elapsed time showed the reciprocally trend and it was in all cases less than 200 s.
The influence of mini batch size values in the range from 25 to 550 on the RMSE was studied via comparison of combinations 1 and 5-8. The results were shown in Figure 4b. It has been observed that when using a larger batch size of more than 100, there is a degradation in the prediction accuracy in terms of the RMSE. This finding is a consequence of the fact that large-batch methods tend to poorer generalization.
The influence of the number of epochs in the range 100-500 on the RMSE value was obtained by comparison of combinations 1 and 9-19. The results are shown in Figure 4c. A strong decrease in the RMSE value was observed with an increase in the number of epochs from 100 to 200. A further increase in this parameter influenced the RMSE value only a little. radation in the prediction accuracy in terms of the RMSE. This finding is a consequence of the fact that large-batch methods tend to poorer generalization.
The influence of the number of epochs in the range 100-500 on the RMSE value was obtained by comparison of combinations 1 and 9-19. The results are shown in Figure 4c. A strong decrease in the RMSE value was observed with an increase in the number of epochs from 100 to 200. A further increase in this parameter influenced the RMSE value only a little. As mentioned before, the predictive power of LSTM in GaAs growth must be assessed using the whole distribution of RMSE values. This approach is obvious if different properties of that distribution are plotted for different datasets and hyperparameters.
In Figures 5 and 6, RMSE values for the randomly selected datasets for testing are shown, with the learning hyperparameters corresponding to the combinations 20 and 8. The RMSE median value for the combination 20 was one order of magnitude lower than for the combination 8 (2.0076 × 10 −3 vs. 3.845 × 10 −2 ). Similarly, also the variance of RMSE values was lower.    The RMSE box plots for the studied combinations is given in Figure 7. The cases 1-4 and 8 show an order of magnitude higher RMSE values than the rest. The best overall performance was obtained for the combination 20 with one of the lowest median RMSE values (2.0076 × 10 −3 ), narrow interquartile range values and no outliers. In LSTM layer, input size and number of hidden units were 20 and 100, respectively. Another way of assessing the quality of network predictions is to compare LSTM predicted versus targeted temporal profiles of output variables y1-y6. For cases 20 and 8 and several selected datasets, results are given in Figures 8 and 9, respectively. eters. For the remaining combinations 1-4 and 8, an order of magnitude higher RMSE values (median values in the range 0.009-0.05) were obtained. The RMSE box plot displays the median (red line), the first and third quartiles, outliers (red '+' symbol), and the range values that are not outliers.
Another way of assessing the quality of network predictions is to compare LSTM predicted versus targeted temporal profiles of output variables y1-y6. For cases 20 and 8 and several selected datasets, results are given in Figures 8 and 9, respectively.   In both figures, the results are shown for four exemplary datasets that are differing in crystal growth rates. For example, among the four datasets in Figure 8 (combination 20), the datasets 59 corresponds to the lowest growth rate, while the dataset 1940 to the highest growth rate. Consequently, the crystal length, i.e., the value of the output y6 in dataset 1940 in the last time step was the highest. For all datasets, the predicted outputs Figure 9. Comparison of normalized LSTM-predicted outputs (p-1 to p-6) and normalized test outputs (t-1 to t-6) for 4 randomly selected data sets out of the 195 datasets for testing. The learning hyperparameters and data volume are given by the combination 8.
As mentioned before, the predictive power of LSTM in GaAs growth must be assessed using the whole distribution of RMSE values. This approach is obvious if different properties of that distribution are plotted for different datasets and hyperparameters.
In Figures 5 and 6, RMSE values for the randomly selected datasets for testing are shown, with the learning hyperparameters corresponding to the combinations 20 and 8. The RMSE median value for the combination 20 was one order of magnitude lower than for the combination 8 (2.0076 × 10 −3 vs. 3.845 × 10 −2 ). Similarly, also the variance of RMSE values was lower.
The RMSE box plots for the studied combinations is given in Figure 7. The cases 1-4 and 8 show an order of magnitude higher RMSE values than the rest. The best overall performance was obtained for the combination 20 with one of the lowest median RMSE values (2.0076 × 10 −3 ), narrow interquartile range values and no outliers. In LSTM layer, input size and number of hidden units were 20 and 100, respectively.
Another way of assessing the quality of network predictions is to compare LSTM predicted versus targeted temporal profiles of output variables y 1 -y 6 . For cases 20 and 8 and several selected datasets, results are given in Figures 8 and 9, respectively.
In both figures, the results are shown for four exemplary datasets that are differing in crystal growth rates. For example, among the four datasets in Figure 8 (combination 20), the datasets 59 corresponds to the lowest growth rate, while the dataset 1940 to the highest growth rate. Consequently, the crystal length, i.e., the value of the output y 6 in dataset 1940 in the last time step was the highest. For all datasets, the predicted outputs values accurately follow the targeted values.
Similarly, in Figure 9 (combination 8), the grow rate increased from the dataset 22 over 395 and 1428 to 1913. In this case, the predicted outputs differed strongly from the target outputs at the beginning of the growth process, i.e., in the time steps 1-15, for all growth rates. Additionally, predictions were inaccurate at the time steps when crystallization front passed by the location of the monitoring point and temperature profile suddenly changed slope (see also Figure 2).
Summarizing the obtained results, the LSTM network with learning hyper-parameters and data volume according to the combination 20 accurately predicted VGF-GaAs dynamics without difficulties in learning high crystal growth rates that we previously observed when using NARX networks. In addition, no observed outliers in the prediction of outputs decreased the risk for polycrystalline growth in the real process.
Considering the computational burden, i.e., the average training time for the same datasets, LSTM networks again overperformed NARX networks significantly.
For example, in our previous study, training of NARX network on 500 datasets using Bayesian Regularization and Levenberg-Marquardt training algorithm, at a personal computer with Intel Pentium i7-7700K CPU at 4.20 GHz, 64 GB RAM, x64 based processor took 48 and 14 days, respectively. LSTM training on the same 500 datasets took 1 min. at a Windows server with Intel ® Xeon ® 6234 CPU at 3.30 GHz 3.29 GHz (2 processors), 384 GB RAM, x64 based processor. LSTM training on extended 1950 datasets took up to 19 min. at the same Windows server. Direct comparison of these two studies is not possible due to differences in used hardware. Still, the significantly shorter training time of LSTM network is obvious.

Conclusions
This study demonstrated the capabilities of a LSTM neural network in fast forecasting of crystal growth recipes on the example of VGF-GaAs growth. Such real-time predictions are inevitably needed, e.g., in process automation and control.
The accuracy of predictions of growth recipes by LSTM neural networks was superior to NARX predictions.
An accurate and efficient LSTM architecture was identified, determined the hyperparameters and required data volume for the studied parameter space. Based on the statistical performance criteria and training results, the LSTM trained with the Adam optimizer, with 3 time delays, 450 epochs, 50 mini batch size, 1 × 10 −3 initial learning rate was accurate for our database consisting of 1950 growth recipes generated by 1D CFD modeling. The LSTM accuracy was assessed using the distribution of RMSE values. For the most accurate predictions of VGF-GaAs growth dynamics, the median RMSE value of 2 × 10 −3 was obtained.
In all cases, the LSTM training was several orders of magnitude faster than NARX training on the same datasets.
Since a LSTM network relies heavily on the availability and the quality of training data, our further efforts have to be directed to generate datasets from transient 2D CFD models.  Acknowledgments: Proof reading of the article by Klaus Böttcher is gratefully acknowledged.

Conflicts of Interest:
The authors declare no conflict of interest.