GPU-Accelerated Cellular Automaton Model for Grain Growth during Directional Solidiﬁcation of Nickel-Based Superalloy

: To accelerate the large-scale cellular automaton (CA) simulation for grain growth, a parallel CA model for grain growth was developed. The model was implemented based on the compute uniﬁed device architecture (CUDA) parallel computing platform. The model was veriﬁed by the grain growth of a single crystal and the columnar-to-equiaxed transition (CET) of an Al-7wt% Si specimen of uniform undercooling with a constant cooling rate. The grid independence of the model was veriﬁed. The grain growth of a plate-like casting of nickel-based superalloy during directional solidiﬁcation process was simulated and the obtained results of grain density at each section with different heights were compared with the experimental data. The CET transition of directional solidiﬁed Al-7wt% Si cylindrical ingot was simulated. The grain texture and cooling curves were in good agreement with experimental results from the literature. Finally, high parallel performance of the CA model was obtained and evaluated.


Introduction
Nickel-based superalloy has been widely used to produce blades which are used in aero engines and industrial gas turbines. To improve the high-temperature mechanical properties of blades, the directional solidification (DS) technique has been used to fabricate superalloy blades. The high-temperature performance of blades is closely related to the microstructure, grain size and crystallographic orientation. Therefore, the grain growth during solidification process has been widely investigated over the last two decades.
Many numerical models have been proposed to simulate the grain growth process during DS process. Models for grain growth during solidification process were originally investigated by Rappaz and Thévoz [1,2]. Then, the decentered cellular automaton (CA) algorithm was proposed by Gandin et al. [3] and it was widely accepted to predict the dendritic grain growth. Based on the analytical predictions of dendritic grain envelopes, the cellular automaton finite element (CAFE) model [4] was developed to predict grain structure, such as growth of equiaxed dendritic grains and columnar dendritic grains. Nastac et al. [5] proposed a stochastic modeling of microstructure formation to predict the grain structure in castings and the influences of various neighborhood configurations on nucleation and grain growth were evaluated. Wang et al. [6] proposed a modified decentered CA method to study the influence of perturbing the withdrawal velocity upon the stability of the primary dendrite spacing. Zhang et al. [7] developed a cellular automaton finite difference (CAFD) model to simulate the grain growth coupled with the temperature evolution of turbine blades during DS process. Then, Zhang et al. [8] investigated the grain growth and grain selection behavior in a spiral selector for nickelbased superalloy based on the developed CAFD model. Viardin et al. [9] developed a In this work, a parallel algorithm was proposed for both 2D and 3D CA model of grain growth using the graphic processing unit (GPU). A mapping strategy between the CA cells and threads on GPU was illustrated. The parallel performance of the developed algorithm was evaluated by the comparison with the program parallelized with Open Multi-Processing (OpenMP) technique. Then, the CET of Al-7wt% Si specimen was simulated. The grain growth was simulated during DS process for a plate-like casting of nickel-based superalloy and the obtained results of grain density with different heights were compared with the experimental data. Finally, the parallel performance of the developed CA model was evaluated.

Nucleation Model and Grain Growth Algorithm
To simulate grain growth in superalloy solidification, a CA model [32] was adopted. A continuous nucleation model proposed by Rappaz and Gandin et al. [33] was used to describe the heterogeneous nucleation. The total nucleation density n(∆T) is calculated as where ∆T is the undercooling; n max is the maximum nucleation site density; ∆T σ is the standard deviation of undercooling; ∆T N is the mean nucleation undercooling. The relation between the tip growth velocity with the given undercooling is given by the polynomial formulation.
The kinetic coefficients a 2 and a 3 are fitted on the predictions of Kurz-Giovanola-Trivedi (KGT) model [34]. The CA model is restricted to the face center cubic (FCC) crystal. In Figure 1, three orthogonal axes, which are the half-diagonals of the octahedron represent the growth directions of the primary dendritic, which are labeled as the crystallographic directions ([100]/[010]/[001]). The orientation of the [100] direction with respect to the global coordinate (X, Y, Z) was characterized by a set of Euler angles (φ 1 , φ 2 , φ 3 ).
dendrite growth simulation using phase field model, due to massive computation capacity and high memory bandwidth, where high efficiency and scalability were demonstrated [29][30][31].
In this work, a parallel algorithm was proposed for both 2D and 3D CA model of grain growth using the graphic processing unit (GPU). A mapping strategy between the CA cells and threads on GPU was illustrated. The parallel performance of the developed algorithm was evaluated by the comparison with the program parallelized with Open Multi-Processing (OpenMP) technique. Then, the CET of Al-7wt% Si specimen was simulated. The grain growth was simulated during DS process for a plate-like casting of nickelbased superalloy and the obtained results of grain density with different heights were compared with the experimental data. Finally, the parallel performance of the developed CA model was evaluated.

Nucleation Model and Grain Growth Algorithm
To simulate grain growth in superalloy solidification, a CA model [32] was adopted. A continuous nucleation model proposed by Rappaz and Gandin et al. [33] was used to describe the heterogeneous nucleation. The total nucleation density ( ) where T Δ is the undercooling; max n is the maximum nucleation site density; T σ Δ is the standard deviation of undercooling; N T Δ is the mean nucleation undercooling. The relation between the tip growth velocity with the given undercooling is given by the polynomial formulation.

( )
The kinetic coefficients a2 and a3 are fitted on the predictions of Kurz-Giovanola-Trivedi (KGT) model [34]. The CA model is restricted to the face center cubic (FCC) crystal. In Figure 1, three orthogonal axes, which are the half-diagonals of the octahedron represent the growth directions of the primary dendritic, which are labeled as the crystallographic directions ( , ,  The growth of the dendrite tip follows the rule above. Then, the grain envelope is determined by the crystallographic orientation and the dendrite tip length. For 2D CA model, the envelope is square shape. The capture rule can be seen in Figure 2. The grain envelope associated with cell A (smaller blue square) grow large enough to cover the center of cell B, which means that cell B is captured and will become interface cell at the next time step. Moore neighborhoods of cell A were considered for cell capture. The extension of the capture rule to 3D is straightforward, which can be referred to [32]. determined by the crystallographic orientation and the dendrite tip length. For 2D CA model, the envelope is square shape. The capture rule can be seen in Figure 2. The grain envelope associated with cell A (smaller blue square) grow large enough to cover the center of cell B, which means that cell B is captured and will become interface cell at the next time step. Moore neighborhoods of cell A were considered for cell capture. The extension of the capture rule to 3D is straightforward, which can be referred to [32].

Algorithm Implementation on GPU
To accelerate the CA model, we developed a program based on the compute unified device architecture (CUDA) parallel computing platform using a GPU. GPU is more suitable for compute-intensive computation compared with CPU, because there are several thousand cores on a single GPU.
In our model, the finite difference method (FDM) was used to solve the temperature field with coarse grids and the CA model was used for grain growth calculation with fine grids. One FDM grid consists of several CA cells depending on the grid size ratio. Due to the locality of computation of both FDM and CA model, the calculation performed on each cell can be easily mapped to the thread on GPU. Considering only the domain designated as alloy attribute needs the memory used for CA model, the memory for computation of CA model was allocated for these areas. A global index array was used as the indicator of the neighboring configuration of each CA cell. Hence, memory requirement was reduced in the simulation of casting with complex shape. The memory arrangement of CA cells stored on GPU is shown in Figure 3. The memory is only allocated for active cells, which are alloy cells and shell cells. In CA model, the computation requires the index of Moore neighboring cells. In Figure 3, it shows the step to search the index of the neighboring cell at (0, 1) direction for the 10th active cell. Firstly, the global index (5,4) was obtained by the global index of active cells. With the direction (0, 1) of the Moore neighboring cell, the global index of the neighboring cell is calculated as (5,5). Then, the local index of neighboring cell can be obtained by the value of index helper in global index, as shown in the left side of Figure 3.
The subroutine for each module, such as grain nucleation, cell capture process, grain growth, is defined in the kernel function. These kernel functions, invoked by CPU, run by threads on GPU. The index of each thread is labeled by the built-in variables on GPU, which help to find the index of each cell. Therefore, data stored on each cell, such as temperature, cell status, are accessible to the kernel functions.

Algorithm Implementation on GPU
To accelerate the CA model, we developed a program based on the compute unified device architecture (CUDA) parallel computing platform using a GPU. GPU is more suitable for compute-intensive computation compared with CPU, because there are several thousand cores on a single GPU.
In our model, the finite difference method (FDM) was used to solve the temperature field with coarse grids and the CA model was used for grain growth calculation with fine grids. One FDM grid consists of several CA cells depending on the grid size ratio. Due to the locality of computation of both FDM and CA model, the calculation performed on each cell can be easily mapped to the thread on GPU. Considering only the domain designated as alloy attribute needs the memory used for CA model, the memory for computation of CA model was allocated for these areas. A global index array was used as the indicator of the neighboring configuration of each CA cell. Hence, memory requirement was reduced in the simulation of casting with complex shape. The memory arrangement of CA cells stored on GPU is shown in Figure 3. The memory is only allocated for active cells, which are alloy cells and shell cells. In CA model, the computation requires the index of Moore neighboring cells. In Figure 3, it shows the step to search the index of the neighboring cell at (0, 1) direction for the 10th active cell. Firstly, the global index (5, 4) was obtained by the global index of active cells. With the direction (0, 1) of the Moore neighboring cell, the global index of the neighboring cell is calculated as (5,5). Then, the local index of neighboring cell can be obtained by the value of index helper in global index, as shown in the left side of Figure 3.
The subroutine for each module, such as grain nucleation, cell capture process, grain growth, is defined in the kernel function. These kernel functions, invoked by CPU, run by threads on GPU. The index of each thread is labeled by the built-in variables on GPU, which help to find the index of each cell. Therefore, data stored on each cell, such as temperature, cell status, are accessible to the kernel functions.
In CUDA programming, data can be stored in the global memory and the shared memory. The global memory is large, usually several gigabytes, while the shared memory, a low-latency memory near the processor core, is only 64 kilobytes for most GPU. The shared memory is expected to be much faster than the global memory. However, using the shared memory in a kernel function will increase the number of occupied registers, which reduces the number of cores that can be launched on GPU. The efficiency of the program is usually affected by the utilization of the shared memory. Therefore, only the global memory is used in the program. In addition, the constants such as the process parameters, and variables related to the index offset of Moore neighborhood's configuration are stored In CUDA programming, data can be stored in the global memory and the shared memory. The global memory is large, usually several gigabytes, while the shared memory, a low-latency memory near the processor core, is only 64 kilobytes for most GPU. The shared memory is expected to be much faster than the global memory. However, using the shared memory in a kernel function will increase the number of occupied registers, which reduces the number of cores that can be launched on GPU. The efficiency of the program is usually affected by the utilization of the shared memory. Therefore, only the global memory is used in the program. In addition, the constants such as the process parameters, and variables related to the index offset of Moore neighborhood's configuration are stored in the constant memory, which is a low-latency memory with the size of 64 kilobytes for frequent access.
The implementation of the model on GPU is given as following. Firstly, the initialization of the CA model was performed on CPU, then the data for the CA model and FDM calculation were transferred to GPU. Each thread on GPU performs a group of computation based on the kernel function after the thread configuration was set. During the calculation process, the data of grain orientation and temperature were transferred from GPU to CPU with a given interval of time step. Since no other extra data transfer between CPU and GPU, the consumed time mainly comes from the iteration calculation and the computation efficiency was ensured by the locality of computation. Different time steps can be used for the calculation of temperature field and CA model to reduce the total iteration times.
The algorithm of the heat transfer and grain growth process can be briefly summarized as follows.
(1) Heat transfer calculation on grids by FDM.
(2) Temperature interpolation from FDM grids to CA cells.  The implementation of the model on GPU is given as following. Firstly, the initialization of the CA model was performed on CPU, then the data for the CA model and FDM calculation were transferred to GPU. Each thread on GPU performs a group of computation based on the kernel function after the thread configuration was set. During the calculation process, the data of grain orientation and temperature were transferred from GPU to CPU with a given interval of time step. Since no other extra data transfer between CPU and GPU, the consumed time mainly comes from the iteration calculation and the computation efficiency was ensured by the locality of computation. Different time steps can be used for the calculation of temperature field and CA model to reduce the total iteration times.
The algorithm of the heat transfer and grain growth process can be briefly summarized as follows.
(1) Heat transfer calculation on grids by FDM.
(2) Temperature interpolation from FDM grids to CA cells.

Single Grain Growth
The predictions of a single grain growth with a given orientation under different temperature gradients are shown in Figure 4. The corresponding Euler angle is (10 • , 20 • , 30 • ) and the growth kinetics is given by the dendrite tip growth velocity υ = A · ∆T 2 , with In order to verify the accuracy of model, the tip growth velocity of the envelope was analyzed by comparison between the numerical results and the theoretical results.
The error caused by grid anisotropy is an expected defect of the CA model [35], which is associated with the corresponding transition rule of cell state and time step used in the numerical calculation. Hence, the error should be suppressed to ensure the accuracy of

Single Grain Growth
The predictions of a single grain growth with a given orientati temperature gradients are shown in Figure 4. The corresponding Eule 30°) and the growth kinetics is given by the dendrite tip growth velo with 4 1 2 In order to verify the accuracy of model, the tip growth velocity o analyzed by comparison between the numerical results and the theore The error caused by grid anisotropy is an expected defect of the CA is associated with the corresponding transition rule of cell state and tim numerical calculation. Hence, the error should be suppressed to ensu numerical calculation. The effects of orientation angle of the grain enve on the tip growth velocity were investigated in this section.
For simplicity, a two-dimensional case was performed. The comp 300 × 300 with a cell size of 15 µm. A nucleation site was positioned domain with a given orientation angle. A uniform undercooling of 3 the whole simulation. The growth kinetics of Al-7 wt% Si alloy is giv tip growth velocity  value. The error of tip length with different orientation angles and para in Figure 5. The results indicated that the error of tip length decreases of the time step (the parameter λ ) when orientation angle is small (le imately). The error of tip length is small when the parameter λ is which is an acceptable accuracy for grain growth simulation, as a singl not grow too large in a given thermal condition. The effect of orientati For simplicity, a two-dimensional case was performed. The computation domain is 300 × 300 with a cell size of 15 µm. A nucleation site was positioned in the center of the domain with a given orientation angle. A uniform undercooling of 3 K was kept during the whole simulation. The growth kinetics of Al-7 wt% Si alloy is given by the dendrite tip growth velocity υ = A · ∆T n with n = 2.7 and A = 2.9 × 10 −6 m · s −1 · K −2.7 [36]. In order to determine the time step with a given condition, a parameter λ is defined as λ = V m · ∆t/∆x, where V m is the maximum tip growth velocity at the whole domain, ∆t is the time step and ∆x is the cell size. Cases with envelope orientation angles from 0 • to 45 • (due to the four-fold symmetry) with an interval of 5 • and the parameter λ of 0.1, 0.01, and 0.001 were simulated. To ensure the same reference tip length of the envelope at the end time, the calculation was finished while the solid fraction of the growing envelope reaches 0.4 in the computation domain. The error of the tip length ε is defined as ε = |L c − L t |/L t , where L c is the tip length obtained by simulation, and L t is the theoretical value. The error of tip length with different orientation angles and parameter λ is shown in Figure 5. The results indicated that the error of tip length decreases with the decrease of the time step (the parameter λ) when orientation angle is small (less than 20 • approximately). The error of tip length is small when the parameter λ is smaller than 0.01, which is an acceptable accuracy for grain growth simulation, as a single grain usually will not grow too large in a given thermal condition. The effect of orientation angle on the tip length is not large, which indicates that the developed CA model is applicable to the grain growth with arbitrary orientation angle.

Grain Growth with a Uniform Undercooling
Then, the grain growth of Al-7 wt% Si specimen was simulated by 2D simulation. The computation domain is 300 × 300 with a cell size of 15 µm. The simulation started with a uniform undercooling and the cooling rate was −2.3 K/s. The growth kinetics of the alloy is the same as that of aforementioned parameters. The cell size is 15 µm, which is suitable for the description of dendrite growth. The volumetric nucleation site density n v is 5.5 × 10 10 m −3 and the surface nucleation site density n s is 2.5 × 10 8 m −2 . The corresponding parameters used in 2D simulation can be obtained by the stereological relationships in the Reference [33]. The standard deviations of volumetric nucleation undercooling ∆T s,σ and surface nucleation undercooling ∆T v,σ are both 0.1 K. The mean surface nucleation undercooling ∆T s,m is 0.5 K. Cases with different mean volumetric nucleation undercooling ∆T v,m were simulated. The CA model is used to describe the final grain structure or the grain texture after solidification, hence only the primary FCC aluminum dendrite is considered in the model. The formation of diamond silicon facet phase is not considered in the CA  Figure 6. Grains with crystallographic orientation aligned with the normal to mold surface were selected corresponding to the grains with orientation angle close to 0 • and 90 • . The columnar grains formed at the mold surface grow up to the center of the specimen as shown in Figure 6a. As the mean volumetric nucleation undercooling decreases, equiaxed grains nucleated before the columnar grains grow up to the center domain as shown in Figure 6b,c. The CET occurs as the mean volumetric nucleation undercooling becomes small.

Grain Growth with a Uniform Undercooling
Then, the grain growth of Al-7 wt% Si specimen was simulated by 2D simu The computation domain is 300 × 300 with a cell size of 15 µm. The simulation s with a uniform undercooling and the cooling rate was −2.3 K/s. The growth kinetics alloy is the same as that of aforementioned parameters. The cell size is 15 µm, wh suitable for the description of dendrite growth. The volumetric nucleation site den is 5.5 × 10 10 m −3 and the surface nucleation site density ns is 2.5 × 10 8 m −2 . The correspo parameters used in 2D simulation can be obtained by the stereological relationships Reference [33]. The standard deviations of volumetric nucleation undercooling and surface nucleation undercooling  Figure 6. Grains with crystallographic orientation aligned with the normal to mold s were selected corresponding to the grains with orientation angle close to 0° and 90 columnar grains formed at the mold surface grow up to the center of the specim shown in Figure 6a. As the mean volumetric nucleation undercooling decreases, equ grains nucleated before the columnar grains grow up to the center domain as sho Figure 6b,c. The CET occurs as the mean volumetric nucleation undercooling be small. is the theoretical tip length; (b) Error of tip length with different orientation angles and parameter λ.

Grain Growth with a Uniform Undercooling
Then, the grain growth of Al-7 wt% Si specimen was simulated by 2D simulation. The computation domain is 300 × 300 with a cell size of 15 µm. The simulation started with a uniform undercooling and the cooling rate was −2.3 K/s. The growth kinetics of the alloy is the same as that of aforementioned parameters. The cell size is 15 µm, which is suitable for the description of dendrite growth. The volumetric nucleation site density nv is 5.5 × 10 10 m −3 and the surface nucleation site density ns is 2.5 × 10 8 m −2 . The corresponding parameters used in 2D simulation can be obtained by the stereological relationships in the Reference [33]. The standard deviations of volumetric nucleation undercooling  Figure 6. Grains with crystallographic orientation aligned with the normal to mold surface were selected corresponding to the grains with orientation angle close to 0° and 90°. The columnar grains formed at the mold surface grow up to the center of the specimen as shown in Figure 6a. As the mean volumetric nucleation undercooling decreases, equiaxed grains nucleated before the columnar grains grow up to the center domain as shown in Figure 6b,c. The CET occurs as the mean volumetric nucleation undercooling becomes small.  To verify the grid independence of the CA model, a domain with a size of 24 × 24 mm 2 was used for simulation. The mean volumetric nucleation undercooling was 4 K to ensure large ratio of equiaxed grains' nucleation. The cell size of 15 µm, 20 µm and 25 µm were selected. Other parameters are the same as the aforementioned condition. The grain area size distribution of each case is shown in Figure 7a. The cumulative grain area size distribution of each case is shown in Figure 7b for comparison. In Figure 7a, results of the three cases show that the grain area size mainly ranges from 0.2 × 10 5 µm 2 to 1.0 × 10 5 µm 2 . The cumulative grain size distributions of the cases were consistent with each other and Figure 7b clearly shows that the grain size distributions of cases with cell size of 15 µm and 20 µm are almost the same. distribution of each case is shown in Figure 7b for comparison. In Figure 7a, results of the three cases show that the grain area size mainly ranges from 0.2 × 10 5 µm 2 to 1.0 × 10 5 µm 2 . The cumulative grain size distributions of the cases were consistent with each other and Figure 7b clearly shows that the grain size distributions of cases with cell size of 15 µm and 20 µm are almost the same.

Grain Growth during Directional Solidification
The model was applied to simulate grain growth of a plate-like casting of nickelbased superalloy during DS process. The directional solidification experiment of the platelike casting was carried out in the ALD furnace with a withdraw rate of 6 mm/min. A plate-like casting with dimension of 25 × 7 × 160 mm was used. The chemical composition of the superalloy is Ni-7.82Cr-5.34Co-2.25Mo-4.88W-6.02Al-1.94Ti-3.49Ta (wt%). A multicomponent pseudo-binary alloy method [37] was used to obtain the physical parameters of this superalloy. Then, the coefficients of growth kinetics were fitted by the results according to the Lipton-Glicksman-Kurz (LGK) growth model [34,38], as shown in Figure 8.

Grain Growth during Directional Solidification
The model was applied to simulate grain growth of a plate-like casting of nickel-based superalloy during DS process. The directional solidification experiment of the plate-like casting was carried out in the ALD furnace with a withdraw rate of 6 mm/min. A plate-like casting with dimension of 25 × 7 × 160 mm was used. The chemical composition of the superalloy is Ni-7.82Cr-5.34Co-2.25Mo-4.88W-6.02Al-1.94Ti-3.49Ta (wt%). A multicomponent pseudo-binary alloy method [37] was used to obtain the physical parameters of this superalloy. Then, the coefficients of growth kinetics were fitted by the results according to the Lipton-Glicksman-Kurz (LGK) growth model [34,38], as shown in Figure 8. The values of a 2 and a 3 are 9.478 × 10 −7 m s −1 K −2 and 2.323 × 10 −6 m s −1 K −3 . The grid size for the temperature field calculation was 0.5 mm. In the CA model, the cell size was 250 µm. To satisfy the numerical stability, the time step Δt was determined as The grid size for the temperature field calculation was 0.5 mm. In the CA model, the cell size was 250 µm. To satisfy the numerical stability, the time step ∆t was determined as where ρ is alloy density, C p is the specific heat, ∆x T is the grid size used for the temperature field, k is the thermal conductivity, D l is the liquid diffusion coefficient, ∆x CA is the grid size used for CA model and V m is the maximum speed of grain growth. The parameter λ is set as 0.01 to ensure acceptable accuracy. The grain structure of the specimen and grain structure at each section with different heights are shown in Figure 9a,b. Grain density decreases as the height increases due to the competitive growth as indicated in Figure 9b. The grain density obtained by simulation was compared with the experimental data and the simulation results were in good agreement with the experimental data as shown in Figure 9c. The grid size for the temperature field calculation was 0.5 mm. In the CA model, the cell size was 250 µm. To satisfy the numerical stability, the time step Δt was determined as ( ) ( ) 2 2 min , , 6 6 where ρ is alloy density, p C is the specific heat, T x Δ is the grid size used for the temperature field, k is the thermal conductivity, l D is the liquid diffusion coefficient, is the grid size used for CA model and m V is the maximum speed of grain growth.
The parameter λ is set as 0.01 to ensure acceptable accuracy.
The grain structure of the specimen and grain structure at each section with different heights are shown in Figure 9a,b. Grain density decreases as the height increases due to the competitive growth as indicated in Figure 9b. The grain density obtained by simulation was compared with the experimental data and the simulation results were in good agreement with the experimental data as shown in Figure 9c.

Grain Growth in Directional Solidified Al-7 wt% Si Ingot
The developed CA model was applied to the solidification process of an Al-7 wt% Si cylindrical ingot which has a detailed description in [36] and [39]. Both 2D and 3D simulation were conducted to demonstrate the capability of the model. The size of the ingot was φ70 × 170 mm. The bottom of the ingot was cooled by a copper chill. The other face was adiabatic boundary condition. Temperature curves of the points at the center line with the height of 20, 40, 600, 80, 100, 120 and 140 mm were recorded by the corresponding thermal couples. Temperature at the bottom of the ingot was deduced by the extrapolation of the temperature curves of the 20, 40, 60 and 80 mm thermal couples, which was imposed as the boundary condition of the bottom surface as indicated in literature [40]. The detailed parameters used in simulation can be found in [36]. The cell size was 100 µm and the grid size for the temperature field was 0.5 mm for 2D simulation. In the 3D simulation, a cell size of 250 µm and a grid size of 1.0 mm for the temperature field were used due to memory restriction of the GPU. A good agreement was observed between the grain texture from experiment [39] and 2D simulation as shown in Figure 10. The simulated temperature curves were compared with the experimental data; the temperature curves obtained by simulation were consistent with the experimental data except a little deviation during the mushy zone as shown in Figure 11. As the release of latent heat for eutectic reaction is not considered in the current model, the difference of the temperature curve is acceptable. The columnar grains grow from the bottom to the height of 110 mm and then the equiaxed grains formed on the top zone of the ingot. The height where the CET transition happens is consistent with the experimental grain texture. Similar 3D simulation results at different times are shown in Figure 12. ture from experiment [39] and 2D simulation as shown in Figure 10. The simulated temperature curves were compared with the experimental data; the temperature curves obtained by simulation were consistent with the experimental data except a little deviation during the mushy zone as shown in Figure 11. As the release of latent heat for eutectic reaction is not considered in the current model, the difference of the temperature curve is acceptable. The columnar grains grow from the bottom to the height of 110 mm and then the equiaxed grains formed on the top zone of the ingot. The height where the CET transition happens is consistent with the experimental grain texture. Similar 3D simulation results at different times are shown in Figure 12.

Parallel Performance Evaluation
The parallel performance was evaluated in detail by 2D CA model, where the grain growth of Al-7 wt% Si specimen with imposed temperature field of constant cooling rate was simulated. The million lattice unit per second (MLUPS) was adopted to evaluate both CPU-and GPU-based computing performance. The parallelization on CPU-based calculation was implemented with the Open Multi-Processing (OpenMP) technique. The parallelization method on GPU was CUDA. The tested CPU was Intel Core i7-7700 (3.6 GHz, Intel Corporation, Santa Clara, CA, USA) with eight cores, and a single NVIDIA RTX 2070 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and 8 gigabyte memory was used for testing. The speedup ratio was computed based on the reference of a serial CPU code on the same CPU. The cell number for performance evaluation ranges from 1.0 × 10 6 to 4.9 × 10 7 and the largest cell number tested in GPU-based parallelization is limited to 3.6 × 10 7 due to the memory restriction of the used GPU. The parallel performance and speedup ratio are shown in Figure 13. The CPU-based parallelization shows approximately 4 times speedup ratio compared with the use of a single CPU core. The maximum parallel performance of the GPU-based parallelization reaches 213.45 MLUPS and the corresponding speedup ratio is 37, approximately. The maximum speedup ratio obtained on a single

Parallel Performance Evaluation
The parallel performance was evaluated in detail by 2D CA model, where the grain growth of Al-7 wt% Si specimen with imposed temperature field of constant cooling rate was simulated. The million lattice unit per second (MLUPS) was adopted to evaluate both CPU-and GPU-based computing performance. The parallelization on CPU-based calculation was implemented with the Open Multi-Processing (OpenMP) technique. The parallelization method on GPU was CUDA. The tested CPU was Intel Core i7-7700 (3.6 GHz, Intel Corporation, Santa Clara, CA, USA) with eight cores, and a single NVIDIA RTX 2070 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and 8 gigabyte memory was used for testing. The speedup ratio was computed based on the reference of a serial CPU code on the same CPU. The cell number for performance evaluation ranges from 1.0 × 10 6 to 4.9 × 10 7 and the largest cell number tested in GPU-based parallelization is limited to 3.6 × 10 7 due to the memory restriction of the used GPU. The parallel performance and speedup ratio are shown in Figure 13. The CPU-based parallelization shows approximately 4 times speedup ratio compared with the use of a single CPU core. The maximum parallel performance of the GPU-based parallelization reaches 213.45 MLUPS and the corresponding speedup ratio is 37, approximately. The maximum speedup ratio obtained on a single GPU card is higher compared with the performance of Lian's work [28], where a maximum speedup ratio of 29.3 was obtained using 64 CPU processors with an efficiency of 45.85%. In MPI-based parallelization models, the efficiency of each processor usually decreases as the process number increases, due to the difficulty in load balance and the cost of communication between processors. These restrictions were avoided by the computation based on GPU, where the global memory in GPU card was continuous. By utilizing the GPU card with larger memory, large-scale simulation can be performed based on the proposed model. The GPU-based CA model shows high speedup ratio and stable performance over a wide range of cell number compared with the CPU-based parallelization. In addition, the performance does not decrease significantly as the cell number increases due to the computational locality of the CA model. tion based on GPU, where the global memory in GPU card was continuous. By utilizing the GPU card with larger memory, large-scale simulation can be performed based on the proposed model. The GPU-based CA model shows high speedup ratio and stable performance over a wide range of cell number compared with the CPU-based parallelization. In addition, the performance does not decrease significantly as the cell number increases due to the computational locality of the CA model.

Conclusions
In this work, a GPU-based parallel CA model for grain growth was developed to accelerate the grain growth simulation. The accuracy of the developed model was verified by detailed comparison of the grain texture, grain size distribution and the CET phenomenon during grain growth for both 2D and 3D simulation. The testing demonstrated that a maximum performance of 213.45 MLUPS and a speedup ratio of 37 can be obtained by utilization of a single GPU. The proposed GPU-based parallelization of the CA model can be extended to CA model for dendritic growth.

Conclusions
In this work, a GPU-based parallel CA model for grain growth was developed to accelerate the grain growth simulation. The accuracy of the developed model was verified by detailed comparison of the grain texture, grain size distribution and the CET phenomenon during grain growth for both 2D and 3D simulation. The testing demonstrated that a maximum performance of 213.45 MLUPS and a speedup ratio of 37 can be obtained by utilization of a single GPU. The proposed GPU-based parallelization of the CA model can be extended to CA model for dendritic growth.