Accurate Computation of Airfoil Flow Based on the Lattice Boltzmann Method

: The lattice Boltzmann method (LBM) is an important numerical algorithm for computational fluid dynamics. This study designs a two-layer parallel model for the Sunway TaihuLight supercomputer SW26010 many-core processor, which implements LBM algorithms and performs optimization. Numerical experiments with di ﬀ erent problem sizes proved that the proposed model has better parallel performance and scalability than before. In this study, we performed numerical simulations of the ﬂows around the two-dimensional (2D) NACA0012 airfoil, and the results of a series of ﬂows around the di ﬀ erent angles of attack were obtained. The results of the pressure coe ﬃ cient and lift coe ﬃ cient were in good agreement with those in the literature.


Introduction
With the development of the supercomputer theory and computer hardware technology, numerical computation has become an important technical mechanism in scientific work. Computational fluid dynamics (CFD) is a significant research field. Among the many numerical methods in the CFD field, the lattice Boltzmann method (LBM) [1,2] is a numerical simulation method with great potential. It is derived from a lattice gas automaton, which is based on the microscopic thermal motion in fluid molecules. The statistical average describes mesoscopically the macroscopic motion of the fluid. The method treats fluid as a mass of small particles with only the mass and no volume. These small particles move on a regular grid, collide with surrounding particles, and transfer statistical data. The movement of a large number of particles helps in obtaining macroscopic properties of the fluid, such as density, velocity, and pressure.
The collision migration idea of LBM elementary particles, and the use of Cartesian mesh division, enable simple boundary condition processing and natural parallelism, which is suitable for large-scale numerical computation on the supercomputer [3]. Ulrich Rüde et al. presented a novel multi-physics coupled algorithm based on the LBM, which employs an Eulerian description of fluid and ions, combined with a Lagrangian representation of moving charged particles, and achieved excellent performance and scaling on up to 65,536 cores of a current supercomputer [4]. For large-scale grids, Manfred Krafczyk et al. discussed modeling and computational aspects of their approach, and presented computational results, including experimental validations [5]. G Wellein et al. introduced the where τ is the relaxation time and f eq α is the equilibrium distribution function. According to the DnQm series model proposed by Qian et al. [19] (where n and m are respectively represented as the spatial dimension and the discrete velocity number), the equilibrium distribution function is unified and expressed as Appl. Sci. 2019, 9,2000 3 of 14 where c s = √ RT is the lattice sound velocity and the value is 1 √ 3 , u is the velocity, ω α is the weight coefficient, and e α is the direction vector. In this paper, the D2Q9 model is used (Figure 1), and the parameters are selected as follows: where c = δx/δt, δx, δt are the grid step size and time step, respectively. Velocity and density can be obtained from the following formula: Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 13   As shown in Figure 2, we select the NACA0012 airfoil as the computation object. The airfoil consists of a series of line segments connected by scatter points. After partitioning the Cartesian grid, a known point A is determined inside the airfoil, and, in turn for each grid point, a type judgment is made. For the grid point B to be judged, it is only necessary to calculate whether the line segment AB intersects the line segment constituting the airfoil. If it does intersect, point B is then considered an

Generation of Airfoil, Judgment of Grid Point Type
As shown in Figure 2, we select the NACA0012 airfoil as the computation object. The airfoil consists of a series of line segments connected by scatter points. After partitioning the Cartesian grid, a known point A is determined inside the airfoil, and, in turn for each grid point, a type judgment is made. For the grid point B to be judged, it is only necessary to calculate whether the line segment AB intersects the line segment constituting the airfoil. If it does intersect, point B is then considered an external point, i.e., it is a fluid point. If it does not intersect, then point B is an internal point, i.e., a solid point. After finding all solid points, a circle of grid points around the solid point is considered the boundary point.
Appl. Sci. 2019, 9, x FOR PEER REVIEW  3 of 13 where RT c s = is the lattice sound velocity and the value is 3 1 , u is the velocity,   is the weight coefficient, and  e is the direction vector. In this paper, the D2Q9 model is used (Figure 1 are the grid step size and time step, respectively. Velocity and density can be obtained from the following formula:  As shown in Figure 2, we select the NACA0012 airfoil as the computation object. The airfoil consists of a series of line segments connected by scatter points. After partitioning the Cartesian grid, a known point A is determined inside the airfoil, and, in turn for each grid point, a type judgment is made. For the grid point B to be judged, it is only necessary to calculate whether the line segment AB

Data Partition and Communication
When using MPI (Message Passing Interface) to accelerate the program in parallel, there are two ways to partition the mesh, namely, one-dimensional (1D) partitioning and 2D partitioning, as shown in Figure 3a,b. According to the literature [20], 2D partitioning is more advantageous than 1D partitioning. Therefore, this paper adopts the 2D partitioning method; at each peripheral boundary of each MPI process, a layer of data buffer is set to store the boundary data transmitted by the adjacent process. The computation area of the original size is NX × NY, and the size with the buffer is changed to (NX + 2) × (NY + 2) (NX and NY are the number of grids in the horizontal and vertical directions, respectively).
Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 13 external point, i.e., it is a fluid point. If it does not intersect, then point B is an internal point, i.e., a solid point. After finding all solid points, a circle of grid points around the solid point is considered the boundary point.

Data Partition and Communication
When using MPI (Message Passing Interface) to accelerate the program in parallel, there are two ways to partition the mesh, namely, one-dimensional (1D) partitioning and 2D partitioning, as shown in Figure 3a,b. According to the literature [20], 2D partitioning is more advantageous than 1D partitioning. Therefore, this paper adopts the 2D partitioning method; at each peripheral boundary of each MPI process, a layer of data buffer is set to store the boundary data transmitted by the adjacent process. The computation area of the original size is NY NX  , and the size with the buffer is changed to

Many-Core Structure and Communication of the Sunway TaihuLight
In the SW26010 heterogeneous many-core processor, each CG contains one MPE and one operational core array, which consists of 64 CPEs, an array controller, and a secondary instruction cache. Therefore, the parallelism of MPI exists only between the MPEs, and the computational tasks on each MPE need to be further divided into CPEs.

Many-Core Structure and Communication of the Sunway TaihuLight
In the SW26010 heterogeneous many-core processor, each CG contains one MPE and one operational core array, which consists of 64 CPEs, an array controller, and a secondary instruction cache. Therefore, the parallelism of MPI exists only between the MPEs, and the computational tasks on each MPE need to be further divided into CPEs.
Since each CPE has only 64 K of memory, the MPE cannot transfer the calculated data to the CPEs at one time. Instead, the MPE transfers part of the calculated data to the CPEs after the CPEs computation is completed, and the data is written into the MPE's memory. Then, we proceed to the processing of the next part of the computation data ( Figure 4a). However, this will result in frequent data communication between the MPE and CPEs, which increases the time cost. In order to solve this problem, this paper adopts a double-buffer mechanism (Figure 4b, Table 1). Each time the computation data is transmitted, about half of the memory is reserved from the core. The next data are transmitted while the data are being calculated on the CPEs. After the computation is complete, we check if the data transmission is completed; if it is completed, the next computation is performed. The computation of one data part at the same time starts the transmission of the next data part at the same time, and, if it is not completed, the CPEs wait. The double-buffer model can effectively reduce most of the communication time to improve the computation efficiency.
For the computation of the grid point migration part, the information of the eight neighboring grid points around the current grid point is required. However, the neighboring grid point is discretely stored in the MPEs memory and cannot be transferred to the CPEs memory by direct memory access (DMA) at one time; therefore, it leads to multiple accesses to the MPEs memory and a higher time cost. To solve this problem, MPEs adjust the grid point data of the current grid point's neighbor from discrete storage to continuous storage before the data are transmitted between the MPEs and CPEs. The MPEs are responsible for adjusting the storage each time when the data are being computed by the CPEs, and, therefore, no additional time is required (as shown in Figure 5).

Many-Core Structure and Communication of the Sunway TaihuLight
In the SW26010 heterogeneous many-core processor, each CG contains one MPE and one operational core array, which consists of 64 CPEs, an array controller, and a secondary instruction cache. Therefore, the parallelism of MPI exists only between the MPEs, and the computational tasks on each MPE need to be further divided into CPEs.
In order to further reduce the communication time, we divide the data in each MPE into the form shown in Figure 6, which are internal data and boundary data, respectively, where boundary data is the part that needs to communicate with other processes. When CPEs conduct internal data computation, the MPE is responsible for communicating the boundary data with related processes.
In addition to the double-buffering mode, we can further reduce the communication time by using register communication between CPEs, as shown in Figure 7 (only the first line CPE is displayed, lines 2 to 8 are similar).
Step 2: The data of CPEs 2 and 6 are transmitted to the CPEs 0 and 4 through register communication.
Step 3: The data of CPE 4 are transmitted to the CPE 0 through register communication.
Step 4-Step 6: Similar to Step 1 to Step 3, Step 4 to Step 6 operate on the column, and finally all the data that needs to pass will be transmitted from 64 CPEs to CPE 0.
Step 7: CPE 0 communicates with the main memory by DMA.
With this approach, CPE 0 is mainly responsible for the communication between the CPEs and MPE, so CPE 0 will be responsible for fewer computation tasks than other CPEs to achieve the purpose of reducing the communication time. Since register communication is much faster than access to the main memory, this method can greatly reduce the communication time. In order to further reduce the communication time, we divide the data in each MPE into the form shown in Figure 6, which are internal data and boundary data, respectively, where boundary data is the part that needs to communicate with other processes. When CPEs conduct internal data computation, the MPE is responsible for communicating the boundary data with related processes.  For the computation of the grid point migration part, the information of the eight neighboring grid points around the current grid point is required. However, the neighboring grid point is discretely stored in the MPEs memory and cannot be transferred to the CPEs memory by direct memory access (DMA) at one time; therefore, it leads to multiple accesses to the MPEs memory and a higher time cost. To solve this problem, MPEs adjust the grid point data of the current grid point's neighbor from discrete storage to continuous storage before the data are transmitted between the MPEs and CPEs. The MPEs are responsible for adjusting the storage each time when the data are being computed by the CPEs, and, therefore, no additional time is required (as shown in Figure 5).
In order to further reduce the communication time, we divide the data in each MPE into the form shown in Figure 6, which are internal data and boundary data, respectively, where boundary data is the part that needs to communicate with other processes. When CPEs conduct internal data computation, the MPE is responsible for communicating the boundary data with related processes.
In addition to the double-buffering mode, we can further reduce the communication time by using register communication between CPEs, as shown in Figure 7 (only the first line CPE is displayed, lines 2 to 8 are similar).
Step 2: The data of CPEs 2 and 6 are transmitted to the CPEs 0 and 4 through register communication.
Step 3: The data of CPE 4 are transmitted to the CPE 0 through register communication.
Step 4-Step 6: Similar to Step 1 to Step 3, Step 4 to Step 6 operate on the column, and finally all the data that needs to pass will be transmitted from 64 CPEs to CPE 0.
Step 7: CPE 0 communicates with the main memory by DMA.
With this approach, CPE 0 is mainly responsible for the communication between the CPEs and MPE, so CPE 0 will be responsible for fewer computation tasks than other CPEs to achieve the purpose of reducing the communication time. Since register communication is much faster than access to the main memory, this method can greatly reduce the communication time.

Experimental Environment
The Sunway TaihuLight computer system uses the SW26010 heterogeneous many-core processor, using the 64-bit independent SW (Sunway) instruction set, full-chip 260 cores, a chip standard operating frequency of 1.5 GHz, and a peak computing speed of 3.168 TFLOPS. The Sunway TaihuLight high-speed computing system has a peak computing speed of 125.436 PFLOPS, a total memory capacity of 1024 TB, a total memory access bandwidth of 4473.16 TB/s, a high-speed interconnection network halved bandwidth of 70 TB/s, and an input/output (I/O) aggregation bandwidth of 341 GB/s. The measured LINPACK continuous operation speed is 93.015 PFLOPS, the LINPACK efficiency is 74.153%, the system power consumption is 15.371 MW, the performance Step 1: The data of CPEs 1, 3, 5, and 7 are transmitted to the CPEs 0, 2, 4, and 6 through register communication.
Step 2: The data of CPEs 2 and 6 are transmitted to the CPEs 0 and 4 through register communication.
Step 3: The data of CPE 4 are transmitted to the CPE 0 through register communication.
Step 4-Step 6: Similar to Step 1 to Step 3, Step 4 to Step 6 operate on the column, and finally all the data that needs to pass will be transmitted from 64 CPEs to CPE 0. Step 7: CPE 0 communicates with the main memory by DMA.
With this approach, CPE 0 is mainly responsible for the communication between the CPEs and MPE, so CPE 0 will be responsible for fewer computation tasks than other CPEs to achieve the purpose of reducing the communication time. Since register communication is much faster than access to the main memory, this method can greatly reduce the communication time.

Experimental Environment
The Sunway TaihuLight computer system uses the SW26010 heterogeneous many-core processor, using the 64-bit independent SW (Sunway) instruction set, full-chip 260 cores, a chip standard operating frequency of 1.5 GHz, and a peak computing speed of 3.168 TFLOPS. The Sunway TaihuLight high-speed computing system has a peak computing speed of 125.436 PFLOPS, a total memory capacity of 1024 TB, a total memory access bandwidth of 4473.16 TB/s, a high-speed interconnection network halved bandwidth of 70 TB/s, and an input/output (I/O) aggregation bandwidth of 341 GB/s. The measured LINPACK continuous operation speed is 93.015 PFLOPS, the LINPACK efficiency is 74.153%, the system power consumption is 15.371 MW, the performance power consumption ratio is 6051.131 MFLOPS/W, the auxiliary computing system's peak operation speed is 1.085 PFLOPS, the total memory capacity is 154.5 TB, and the total disk capacity is 20 PB.
The Sunway TaihuLight computer system supports parallel programming models that are internationally integrated, including MPI3.0, OpenMPI3.1, Pthreads, and OpenACCess2.0, and it also supports message parallel programming models, shared parallel programming models, and accelerated parallel programming models. It also provides an accelerated thread library programming interface customized for the structural characteristics of the SW26010 heterogeneous many-core processor.

Parallel Efficiency
In order to test the acceleration effect under different computations, this paper selects the grid size of 50 million, 100 million, and 200 million; the maximum number of MPEs is 2048, and the maximum total number of cores is 133,120. The acceleration effects of only the MPEs and the MPEs plus CPEs mode are counted separately. Figure 8 shows the speedup and parallel efficiency for only MPEs operations at three problem sizes. Figure 9 shows the percentage of communication time spent on only MPEs for the three problem sizes. As can be seen in Figure 8, both the speedup ratio and the efficiency decrease as the number of computational cores increases. This is because, for the same problem size, as the number of computational cores increases, the number of computational grids on each core decreases. For one computational core, if we assume the size of the computation area to be NX × NY, the number of grid points that the current computation core needs to send data to the adjacent computation core is NX × NY − (NX − 2) × (NY − 2), and the ratio of the grid points that need to send data to the total grid point of the calculated core is R, where It can be seen from Equation (3) that, as the number of computational cores increases, both NX and NY decrease, and the ratio R increases. That is, the percentage of communication time in total time increases ( Figure 9). Therefore, parallel efficiency will gradually decline.
Similarly, for the same number of computational cores, as the problem size increases, the amount of computational grid allocated to each computational core increases, and the percentage of communication time as a percentage of total time decreases ( Figure 9); therefore, the acceleration ratio and parallelism efficiency will increase (Figure 8). When the problem size is 200 million, the parallel efficiency is over 94%.
It can be seen from Equation (3) that, as the number of computational cores increases, both NX and NY decrease, and the ratio R increases. That is, the percentage of communication time in total time increases (Figure 9). Therefore, parallel efficiency will gradually decline.
Similarly, for the same number of computational cores, as the problem size increases, the amount of computational grid allocated to each computational core increases, and the percentage of communication time as a percentage of total time decreases ( Figure 9); therefore, the acceleration ratio and parallelism efficiency will increase (Figure 8). When the problem size is 200 million, the parallel efficiency is over 94%.  As the problem size is 200 million, only MPEs have the highest parallel efficiency. Therefore, this paper uses the 200 million problem size to test the acceleration effect of the CPEs, and tests up to 133,120 cores.
Based on a single CG (1 MPE and 64 CPEs, a total of 65 cores), the number of CGs is increased gradually. The acceleration ratio and parallel efficiency are shown in Table 2 and in Figure 10a. The Mega Lattice Site Updates Per Second (MLUPS) is shown in Figure 10b. As the number of CGs increases, the parallel efficiency and MLUPS decrease, and the speedup and parallel efficiency also decrease compared with the only MPE mode. This is mainly attributed to the startup and shutdown of the CG operation thread group and the communication between the MPE and CPEs. However, . It can be seen from Equation (3) that, as the number of computational cores increases, both NX and NY decrease, and the ratio R increases. That is, the percentage of communication time in total time increases (Figure 9). Therefore, parallel efficiency will gradually decline.
Similarly, for the same number of computational cores, as the problem size increases, the amount of computational grid allocated to each computational core increases, and the percentage of communication time as a percentage of total time decreases ( Figure 9); therefore, the acceleration ratio and parallelism efficiency will increase (Figure 8). When the problem size is 200 million, the parallel efficiency is over 94%.  As the problem size is 200 million, only MPEs have the highest parallel efficiency. Therefore, this paper uses the 200 million problem size to test the acceleration effect of the CPEs, and tests up to 133,120 cores.
Based on a single CG (1 MPE and 64 CPEs, a total of 65 cores), the number of CGs is increased gradually. The acceleration ratio and parallel efficiency are shown in Table 2 and in Figure 10a. The Mega Lattice Site Updates Per Second (MLUPS) is shown in Figure 10b. As the number of CGs increases, the parallel efficiency and MLUPS decrease, and the speedup and parallel efficiency also decrease compared with the only MPE mode. This is mainly attributed to the startup and shutdown of the CG operation thread group and the communication between the MPE and CPEs. However, As the problem size is 200 million, only MPEs have the highest parallel efficiency. Therefore, this paper uses the 200 million problem size to test the acceleration effect of the CPEs, and tests up to 133,120 cores.
Based on a single CG (1 MPE and 64 CPEs, a total of 65 cores), the number of CGs is increased gradually. The acceleration ratio and parallel efficiency are shown in Table 2 and in Figure 10a. The Mega Lattice Site Updates Per Second (MLUPS) is shown in Figure 10b. As the number of CGs increases, the parallel efficiency and MLUPS decrease, and the speedup and parallel efficiency also decrease compared with the only MPE mode. This is mainly attributed to the startup and shutdown of the CG operation thread group and the communication between the MPE and CPEs. However, when the number of CGs is 2048 (133,120 cores total), the efficiency can still reach more than 72%, and our simulation performs at 285,600 MLUPS on 2048 CGs, corresponding to 54.8% of the effective memory bandwidth performance. Table 2. Acceleration ratio and efficiency based on a single core-group (CG) with the highest number of 2048 CGs (a total of 133,120 cores).

Airfoil Computation Results
The computation area when calculating the flow around the 2D NACA0012 airfoil is shown in Figure 11. C is the airfoil length, and C = 400dx, where dx is the grid size. The entire computation comprises 200 million grids. A velocity boundary condition and non-equilibrium extrapolation boundary condition are respectively used on the inlet boundary and outlet boundary. The standard bounce back boundary condition is used for the upper, lower, and airfoil edges [14]. The inflow Mach number M = 0.1, the Reynolds number Re = 1000, and a total of 1 million time steps are calculated; and the angles of attack are calculated as 0°, 4°, 6°, 8°, 10°, 12°, 15°, 17°, 20°, 22°, 25°, and 28°.

Airfoil Computation Results
The computation area when calculating the flow around the 2D NACA0012 airfoil is shown in Figure 11. C is the airfoil length, and C = 400dx, where dx is the grid size. The entire computation comprises 200 million grids. A velocity boundary condition and non-equilibrium extrapolation boundary condition are respectively used on the inlet boundary and outlet boundary. The standard bounce back boundary condition is used for the upper, lower, and airfoil edges [14].  Figure 11. Computation area. The left side is the inflow and the right side is the outflow. Figures 12 and 13 show the pressure and velocity clouds at several representative angles of attack with angles of attack of 0°, 4°, and 8° and 12°, 15°, and 20°, respectively. It can be seen from Figure 12 that the flow tends to be stable for angles of attack of 0° and 4°, and there is no flow separation phenomenon, which is consistent with the literature [21][22][23]. When the angle of attack is 8° or higher, unsteady vortex shedding occurs after the airfoil tail, and, as the angle of attack increases, the vortex shedding becomes more obvious than before and the flow instability becomes larger. It can also be seen from Figures 12 and 13 that, with an increase in the angle of attack, the pressure difference between the lower side and the upper side of the airfoil increases gradually; therefore, it can be qualitatively known that the lift coefficient will also increase, which can be understood as the vortex lift generated by the shedding vortex.
(a) Figure 11. Computation area. The left side is the inflow and the right side is the outflow. Figures 12 and 13 show the pressure and velocity clouds at several representative angles of attack with angles of attack of 0 • , 4 • , and 8 • and 12 • , 15 • , and 20 • , respectively. It can be seen from Figure 12 that the flow tends to be stable for angles of attack of 0 • and 4 • , and there is no flow separation phenomenon, which is consistent with the literature [21][22][23]. When the angle of attack is 8 • or higher, unsteady vortex shedding occurs after the airfoil tail, and, as the angle of attack increases, the vortex shedding becomes more obvious than before and the flow instability becomes larger. It can also be seen from Figures 12 and 13 that, with an increase in the angle of attack, the pressure difference between the lower side and the upper side of the airfoil increases gradually; therefore, it can be qualitatively known that the lift coefficient will also increase, which can be understood as the vortex lift generated by the shedding vortex.
8° or higher, unsteady vortex shedding occurs after the airfoil tail, and, as the angle of attack increases, the vortex shedding becomes more obvious than before and the flow instability becomes larger. It can also be seen from Figures 12 and 13 that, with an increase in the angle of attack, the pressure difference between the lower side and the upper side of the airfoil increases gradually; therefore, it can be qualitatively known that the lift coefficient will also increase, which can be understood as the vortex lift generated by the shedding vortex.   In order to further quantify the main characteristics of the flow and evaluate the aerodynamic performance of the airfoil, the lift coefficient C L and the pressure coefficient C p of the upper and lower surfaces of the airfoil are calculated, respectively, as where the subscript 0 represents the value under far-field conditions, respectively, C is the size of the airfoil, and F L is the lift, i.e., the vertical component of the aerodynamic force acting on the airfoil chord. Figure 14a shows the average pressure coefficient distribution of the upper and lower surfaces when the angle of attack is 8 • . Owing to the angle of attack, the average pressure coefficients for the upper and lower surfaces are different at the front of the airfoil, and they gradually change with the position; this appears to be a consistent trend. The trend and the extreme value of the computation result are in good agreement with those in the literature [22], and a mean error less than 1.5%. Figure 14b shows the trend of the lift coefficient as a function of the angle of attack. This result is also in good agreement with the results in the literature [24], and a mean error less than 5%. Once the angle of attack is greater than 25 • , the vortex may be broken, and the lift coefficient is reduced; this result agrees well with the critical angle of the stall phenomenon in [25].  Figure 14a shows the average pressure coefficient distribution of the upper and lower surfaces when the angle of attack is 8°. Owing to the angle of attack, the average pressure coefficients for the upper and lower surfaces are different at the front of the airfoil, and they gradually change with the position; this appears to be a consistent trend. The trend and the extreme value of the computation result are in good agreement with those in the literature [22], and a mean error less than 1.5%. Figure  14b shows the trend of the lift coefficient as a function of the angle of attack. This result is also in good agreement with the results in the literature [24], and a mean error less than 5%. Once the angle of

Conclusions
This paper discussed the implementation and optimization of the LBM based on the Sunway TaihuLight SW26010 heterogeneous multi-core processor architecture and the simulation computation of NACA0012 2D airfoil flow. In parallel computing, the focus is on two-layer parallel design, the parallelism between CGs, and the parallelism between MPEs and CPEs. Parallelism between CGs is achieved by MPI, and that between MPEs and CPEs is achieved by the accelerating thread library. We adopted the double-buffer mode to reduce the communication time and adjust the discrete storage structure along with other optimization methods; these achieved a good parallel efficiency and speedup ratio for the Sunway TaihuLight. In terms of airfoil computation, the pressure and velocity distributions at different angles of attack were calculated respectively; and the average pressure coefficient and lift coefficient were obtained for an angle of attack of 8 • . The results were in good agreement with the literature.
Future work will focus on optimizing procedures, improving efficiency-especially the efficiency of CPEs-exploring 3D models, more large-scale numerical computation of airfoils, and increasing the number of MPEs to one million. We will also explore the application of the LBM in flow noise. A common method for solving flow noise is to solve the Ffowcs Williams and Hawkings (FW-H) equation. High accuracy of the flow field information is required for solving the FW-H equation. The LBM is a discrete format of the Boltzmann equation. Computation results from the LBM can reflect the fine structure of the flow field and provide accurate numerical results for the computation of aerodynamic noise. Long-term research shows that the LBM is indeed suitable for the computation of aerodynamic noise [26,27].