A Code for Simulating Heat Transfer in Turbulent Channel Flow

: One numerical method was designed to solve the time-dependent, three-dimensional, incompressible Navier–Stokes equations in turbulent thermal channel ﬂows. Its originality lies in the use of several well-known methods to discretize the problem and its parallel nature. Vorticy-Laplacian of velocity formulation has been used, so pressure has been removed from the system. Heat is modeled as a passive scalar. Any other quantity modeled as passive scalar can be very easily studied, including several of them at the same time. These methods have been successfully used for extensive direct numerical simulations of passive thermal ﬂow for several boundary conditions.


Introduction
Turbulence is probably the open problem in physics with most applications in daily life. Turbulent flows are intrinsic to almost any flow in engineering, but they are also extremely important in meteorology or the dispersion of contaminants. It is well known that we still even lack an existence and uniqueness theorem about the solution of the governing equations of turbulent flows, the Navier-Stokes equations. Moreover, these equations cannot be solved analytically apart from some trivial examples. It is generally admitted that the simulation can be considered at three levels of detail, Reynolds Average Navier Stokes (RANS), Large Eddy Simulation (LES), and Direct Numerical Simulations (DNS) [1] which also corresponds to different levels of precision. Both RANS and LES models can simulate typical industrial flows with the necessary accuracy and in practical computation times [2][3][4][5], but DNS is until now the only trustworthy method to investigate physical properties of flows which are little understood.
However, DNS are extremely expensive as every scale of turbulence, both temporal and spatial ones, has to be properly resolved. This implies very fine grids and small temporal steps. Moreover, and as a consequence of Kolmogorov's scales, [6,7], the number of points needed to carry out the simulation grows extremely fast, as Re 9/4 , where Re is the Reynolds number. Thus, to use DNS to understand turbulent flows, very simple geometries are employed, allowing very precise and fast numerical tools that will be described in this paper. Canonical wall-bounded flows are pipes, channel flows, and the flat-plate boundary layer. These flows are in essence theoretical abstractions and do not appear as such in reality. However, they constitute the basic blocks of more complete real flows and have been studied for 80 years. For isothermal flows, experiments of these canonical flows have been instrumental in the development of the theory since the birth of boundary layer research. In particular, they remain important for the study of very high Reynolds numbers, as DNS is still not able to reach Reynolds numbers as high as those of experiments. High Reynolds numbers are needed to obtain a clear separation of scales related to the near-wall turbulence cycle. The main control parameter is the friction Reynolds Number Re τ = hu τ /ν, where ν is the viscosity, h is a characteristic length and u τ is the friction velocity at the wall. The largest experiment ever done has been published in 2018 for a friction Reynolds number of 2e4 [8], a value that is absolutely out of reach for DNS. That said, simulations of wall-bounded turbulence are particularly helpful in identifying physical processes occurring in near-wall turbulence as the whole velocity field (and its derivatives) is available. Moreover, experiments showing the 3D thermal field of a turbulent flow are almost impossible to do. For thermal flows, DNS can give insight where experimental flows are impossible.
The first wall-bounded simulation was run by Kim, Moin, and Moser in 1987 [9] for channel flows and in 1988 by Spalart [10] for Turbulent Boundary Layers. Since then, the friction Reynolds number simulated has been continuously growing [11][12][13][14][15][16][17][18][19]. Some of these simulations, Ref. [12,16,18,19] were made with an earlier version of this code. The history is similar for thermal passive flows, but the Reynolds number reached grew up at a slower pace. The first DNS of a thermal flow was carried out by Kim and Moin in 1987 [20]. Kim and Moin obtained first-order turbulence statistics for Re τ = 180 and several Prandtl numbers, Pr = 0.1, 1 and 2. The Prandtl number is the ratio between momentum diffusivity to thermal diffusivity. They also computed a very important number for modeling, the turbulent Prandtl number. In addition, for the Prandtl number of air, Pr = 0.71, correlations between the velocity and the temperature were also calculated. A somewhat artificial boundary condition was imposed in which heat was generated internally and removed from both cold isothermal walls. This condition for the thermal field plays an analogous role to that of the pressure gradient does for the velocity field. Later, Lyons et al. [21] performed a simulation for Re τ = 150 and Pr = 1. The boundary condition used in this later work consisted of both walls kept at different temperatures. Finally, Kasagi et al. [22] performed a DNS for Re τ = 150 and Pr = 0.71 with a more realistic boundary condition, the Mixed Boundary Condition (MBC from now on). For this condition, the average heat flux over both heating walls is constant and the temperature increases linearly in the streamwise direction. The instantaneous heat flux may vary with respect to time and position. In the work done by Piller [23], it was shown that the MBC acts as an ideal isothermal boundary condition in the inner layer and as an ideal isoflux boundary condition (fixed heat flux) in the outer layer.
After these simulations were made, the trend has been to increase the friction Reynolds number for different molecular Prandtl numbers. However, values of Re τ and Pr are limited by the computational cost. Yano and Kasagi [24] stated that the computational cost can be approximated as L 2 x L y Re 4 τ Pr 3/2 . Moreover, even if only one equation has to be added to the system, this equation is also nonlinear. Typically, the cost of accurately computing the nonlinear term accounts for 80 to 90% of the total time of the simulation. Thus, to add the energy equation we need to double the computational time, and the Reynolds number achieved is still low for the majority of Prandtl numbers.
In this paper, we will restrict ourselves to explain the numerical code and its implementation. Since 2018, several works have been published using this algorithm, which was written in Fortran90. In Lluesma et al. [25], the authors validated the code and found the optimal size of the computational box for Reynolds numbers up to Re τ = 2000. The range of Prandtl numbers, below Pr = 0.7, was increased in Alcántara-Ávila [26]. Later, in [27], the large structures of Couette flow were analyzed. In [28,29], the authors were able to increase the study for Pr > 0.7, reaching Pr = 10, and Re τ = 5000 for Pr = 0.7. Finally, this code has been used to study the large structures found in stratification problems, cite [30].
The structure of the paper is as follows. In the next section, the numerical problem is described. The third section is devoted to explain the numerical techniques used for discretization both in time and space. The fourth section describes the parallelization techniques used. Finally, conclusions and future works are outlined.

Methods
The flow considered in this work is a turbulent channel flow driven by a pressure gradient (Poiseuille flow). The flow is treated as incompressible. The thermal field is treated as a passive scalar. As mentioned before, the boundary condition used for the thermal field is the MBC. For this boundary condition, an uniform heat flux is heating both walls, which introduces the heat into the flow. The temperature of these walls increases linearly in the streamwise direction and does not depend on time.
In Figure 1, a schematic representation of the lower half of the computational box used can be observed. In this plot, contours of an instantaneous snapshot of the streamwise velocity are represented, colored by the magnitude of the velocity. The flow moves from left to right. Periodic conditions are imposed in the streamwise and spanwise boundaries. As it was said above, the computational box was optimized in [25]. The dimensions of the box are L x = 2πh, L y = 2h, and L z = πh in the streamwise, wall-normal, and spanwise directions, respectively. Coordinates in these directions are denoted by x, y, and z. The corresponding velocities are U, V, and W, or, using index notation, U i . This is also the case for vorticity, Ω = ∇ × U = (Ω 1 , Ω 2 , Ω 3 ) = Ω x , Ω y , Ω z and helicity H = U × Ω = (H 1 , H 2 , H 3 ) = H x , H y , H z . Temperature is represented by T. However, the transformed temperature, Θ (defined below), will be used during the entire paper. Uppercase letters denote instantaneous flow magnitudes. Using the Reynolds decomposition, one can obtain the averaged value, denoted by an overbar, and the fluctuating part, denoted by a lowercase letter, of the flow magnitudes, i.e., U = U + u. The superscript + indicates normalization in wall units, using ν and u τ = τ w /ρ, where τ w is the mean shear stress and ρ is the fluid density.
The behavior of turbulent flows is described by the Navier-Stokes equations, which are composed by the continuity and momentum equations, and the energy equation, where U + xyz is the average of U + in time and in the three spatial directions, i.e., is the bulk velocity. In the energy equation, the transformed temperature, Θ = T w − T is used. Since the MBC is used, the temperature in the channel increases linearly in the streamwise direction and periodic conditions cannot be used for T. T w contains the nonperiodic part of the temperature, which makes Θ periodic in the streamwise direction. Notice that the method described below is also useful for any other passive scalar, as it could be NOx concentration. It is also important to point out that for clarity, only the system with one energy equation will be explained here. However, as heat transfer is treated as a passive scalar, several different Prandtl numbers can be run simultaneously. In what follows and to facilitate the discussion, + superscripts will be omitted.
Using some algebra, which involves taking the rotational of Equations (1) and (2) twice, these equations can be written in velocity-vorticity form obtaining a fourth-order equation for V and a second-order equation for Ω y where h v and h g collect the nonlinear part of the Equations (4) and (5), and are given by The main advantage of this system is that pressure is not present, simplifying the problem. Notice that one can easily recover velocities and vorticities from the continuity equation and the definition of Ω y , Taking derivatives, and reorganizing the terms, we get As the flow is periodic in both x and z, it is natural to use Fourier methods in these directions. The Fourier transform of any field ϕ(x, y, z) is defined as where the hat indicates the Fourier coefficient, and k x and k z are the wavenumbers in x and z, respectively. Equations (10) and (11) can be trivially solved as they becomê Notice that this technique cannot be used for the (0, 0)-modes of U and W. Using the Navier-Stokes equations, one can write and solve these equations instead. The pressure gradient in the spanwise direction is negligible. However, the pressure gradient in the streamwise component can be used to keep the flow mass constant. Using Fourier transforms, Equations (3)-(5) are transformed into Fourier space in x and z, obtaining three decoupled problems, whereĥ t,k x k z is the Fourier transform of the energy equation nonlinear term.
Problem (14)- (16) is usually solved splittingV k x k z in three parts, (wavenumbers are omitted)V =V p + aV a + bV b , where a and b are chosen in order to fulfill the homogeneous Neumann condition: and although, due to the symmetry of the problem,V b (y) =V a (−y), so only Equations (21), (22), (24) and (25) have to be solved. Notice that 3D PDE have been transformed into N x × N z independent problems.

CFD Techniques
As it is said above, Fourier method in x and z is the best option to solve this problem. This, in fact, decouples the problems (21)- (25) in a large amount of problems in y. The numerical technique used for y is thus critical to obtain an accurate and fast algorithm. This kind of problem had been addressed mostly by spectral methods [9,31]. A typical technique used in channels is Chebyshev polynomials [11,32]. This has also been applied to thermoconvective problems [33][34][35]. However, in this case, it is more efficient the use of Compact Finite Differences, CFD. The CFD method was introduced in a groundbreaking article by SK Lele in 1992 [36]. Lele's idea was to use finite differences to solve problems presenting a range of spatial scales, generalizing some Padé schemes that had been used early [37,38]. Lele's CFD main advantage is that maintaining the freedom in choosing the mesh points of typical finite difference methods (FD), offering very high precision, comparable to spectral methods. This is critical in turbulent problems, as one typically needs many more points close to the wall than in the outer regions of the flow.
Throughout this part of this work, the following notation is going to be used. Let u(y) be a real evaluated function. Supposing that u is differentiable enough, the first and second derivative of u at the point y i will be denoted by u i = u (y i ) and u i = u (y i ). In the case of a derivative of grade n, we will use u n) i = d n dy n u(y i ). Here, y i is a point belonging to a certain discretization of the interval [a, b], where a and b are finite and y 0 = a, y n = b. Without loss of generality, we will focus on schemes for the second derivative. FD schemes aim to compute an approximation of u i trough a linear combination of the values of the function close to y i . CFD, instead, relates a linear combination of the second derivatives with the values of the function. To explain the practical use of CFD schemes, we are going to work with one specific example, the scheme used in [29]. In that work, the authors used a stencil of seven points in the function and five in the second derivative. Let us assume that the following relation holds for some unknowns coefficients, α j and a j . Without loss of generality, it is possible to assume that one of these coefficients is one, so from now on, α 0 = 1. Defining h j = (y i − y i+j ), Taylor's Theorem states that The relations between the coefficients a i and α i are derived by matching the Taylor series coefficients. In this case, the formal truncation error of the approximation is tenth order. Translating this information into an algebraic equation leads to the system Please note that this is the direct matrix coming from matching Taylor's expansions. Usually, this matrix is ill-conditioned, so results can be very inaccurate. A typical procedure to overcome this problem consists of normalizing each row by its absolute value maximum.
As the mesh can be nonuniform, it is necessary to solve this system for each point in the mesh. When approaching the boundaries, no ghost points are used but the stencils are adapted, removing the points which lay outside the interval. This reduces the formal truncation error of the system. Once the coefficients for every point have been computed, we obtain two sparse matrices, one containing the a i coefficients, A yy , and another one made by the α i coefficients, B yy , such that B yy (u 0 , ..., u n ) t = A yy (u 0 , ..., u n ) t , where the superscript () t denotes the transpose. Note that (28) can be used in both ways, to derive a function or to integrate it.

Time Discretization
To get good accuracy in reasonable computation times, a third-order Runge-Kutta method has been chosen, derived in [39]. For any problem with the form where L is a linear operator and N is nonlinear, the equation is discretized as follows: with ϕ i = ϕ t i , for t n = t 0 , t 1 , t 2 , t 3 = t n+1 . This method presents two problems. The first one is a problem of memory, due to the necessity of saving two nonlinear terms in steps two and three. The second problem is a possible loss of accuracy near the wall, due to the explicit computation of the linear operator in the right hand, Both problems are solved in the algorithm, using a little bit of algebra. Suppose that (30) is solved, and ϕ 2 is to be computed. Let us denote and we can compute and store in a single buffer of memory the right hand side of (31) except the nonlinear term N 1 . This tactic can be also applied in the computation of ϕ 1 but only in the cases where ∆t does not change. If the time step changes, the derivative of ϕ 0 has to be computed explicitly. To avoid this loss of accuracy, it is a good idea to recompute the maximum ∆t every few steps, using the Courant-Friedichs-Lewy condition The simulations made with this code ran with a CFL of 0.7, showing remarkable stability. Equations (30)-(32) are solved for φ, Ω y and Θ using CFD. In what follows we will work with the energy equation, given that the linear operator is L = Re −1 τ ∇ 2 for φ and Ω y , and L = (PrRe τ ) −1 ∇ 2 for Θ. The operator L in Fourier space can be written as Denoting by RHS i the right-hand side of (30)-(32), this system becomes In Fourier space, one gets Now using the two matrix described in the CFD section (28), and defining η as the first constant of the previous equation, Defining a new matrix M, as the final problem to solve is Mϕ i = B yy RHS.
Notice that this matrix is banded, which allows low-storage schemes and the use of very efficient LAPACK routines.

Parallelization Strategy
The most expensive part of the algorithm is the computation of the nonlinear termsĥ v , h g andĥ t . This is because it is necessary to compute these terms in physical space, due to the aliasing problem [31]. Notice that there are two global operations: 1.
Direct and inverse Fourier transforms in x and z, as the nonlinear term has to be computed in physical space.
As the data is distributed throughout the supercomputer, it is necessary to perform several all-to-all communications, which are critical and extremely demanding of the fast network of the supercomputer. The code demands a total of comm = 3 × (9 + 3h p ) global communications per step, where h p is the total number of heat transfer problems studied.
The number of points of the problem, and thus memory, depend on the Reynolds number studied. Another constraint is the efficiency of the fast Fourier transforms (FFT), so typically numbers made up of powers of 2 and 3 are chosen to increase the velocity of the FFT. Assuming a mesh with mx × my × mz points, generally mx is the largest one. It is then natural to start each substep of the Runge-Kutta scheme with the data set distributed in y-z planes. To avoid load imbalances, the number of nodes must be a divisor of the number of complex planes, which is mx/2 after dealiasing. This number is also the maximum number of nodes that can be used. The maximum number of OpenMP processes at each node is mx/2/nprocs, where nprocs is the number of processors of the node.
The code to implement the algorithm described above, where Figure 2 can serve as roadmap, is as follows 1.
Read data and configuration files: HDF5 2.
Runge-Kutta, data: φ, Ω y and Θ, in Fourier Space, distributed through the supercomputer with a (y, z, x) shape.
Compute statistical quantities of the flow if needed. iii.
Perform inverse transforms in z of U and Ω.
(b) Calculation on nonlinear terms (II Move to the step 2a.

3.
End of program.
The code spends 90% of the time and almost all communication time computing the nonlinear term and only 10% to actually solve the equations. A large share of this 90% is used to do global communications. This is a critical step and requires an in-deep study of the supercomputer where the simulation is going to run.
Finally, notice that while global communications uses MPI routines, the granularity of the code makes the use of OpenMP routines for the Fourier transforms, derivatives, and the viscous step. This is an important advantage due to the foreseen evolution of supercomputation.

Conclusions
The main goal of this work is to develop a code to perform direct numerical simulations of distinctive wall-bounded thermal canonical flows. This code has run for roughly 50 CPU-M hours in several supercomputers, already producing several articles. FFT and CFD techniques have been described-in particular, how to use a nonequispaced grid for CFD. In addition, the combination of the Runge-Kutta time-stepper and the previous techniques have been explained. Finally, the parallelization scheme has been outlined. Its main advantage is the granularity of the data, allowing a large amount of small problems in each step that can be solved efficiently using OpenMP techniques.
The raw data that support the findings of this study are available from the corresponding author upon reasonable request. One point statistics can be downloaded from the web page of our group : http://personales.upv.es/serhocal/ (accessed on 24 March 2021).

Abbreviations
The following abbreviations are used in this manuscript: