4.2. The Structure of the VM2D Code
The
VM2D code consists of three libraries called
VM2D,
VMcuda and
VMlib. The
VMlib library contains descriptions of auxiliary (not directly related to the vortex particle methods) data structures and implementations of some general algorithms, as well as descriptions of universal (abstract) data structures that can be used also for 3D flow simulations according to the closed vortex loops method [
75,
76]. Names of the last ones are highlighted with the suffix “
Gen”.
Main classes defined in the
VMlib library and
VM2D core are shown schematically in
Figure 11 and described below.
The VMcuda library defines functions that are responsible for calculations performed on GPUs by using the Nvidia CUDA technology. Data structures and classes from the VM2D core library represent descriptions of specific objects (airfoil, vortex wake, vortex sheet, etc.) and algorithms used in vortex methods (solving of the boundary integral equation, velocities computation of vortex particles, etc.).
The VMlib Library
The VMlib library contains classes that describe the computational pipeline: a queue of tasks to be solved, their distribution among MPI-processes, etc. In addition, this library contains auxiliary data structures, which are then used in the core VM2D library (e.g., such structures as a 2D vortex particle, geometrical vector, tensor, etc.), as well as abstract parent classes, which are later inherited by a specific implementation for 2D problems.
Classes defined in VMlib that provide the computational pipeline are listed below.
Queue stores the list of problems to be solved, organizes its solution in MPI parallel mode according to the number of required and available processors, it provides the “external” level of MPI-parallelization: different problems can be solved simultaneously on available cluster nodes;
Task stores the state (in the queue) of the particular problem and its full description (called hereinafter “passport”);
Parallel stores the properties of the MPI-communicator created for the particular problem and provides the “internal” level of MPI-parallelization according to which the particular problem is solved in parallel mode on several cluster nodes;
Preprocessor is the tool for input file preprocessing; the result is used as input data for StreamParser;
StreamParser contains the set of tools for input files (after being preprocessed) parsing; it is used for reading all the parameters and initial data, stored in text files;
LogStream provides interfaces for the necessary information output (including in parallel mode); note that it is more useful for debugging rather than for typical computations;
defs defines the namespace that contains default values for some parameters, the necessary mathematical functions, etc.
The following data structures together with the necessary operations on them have also defined in the VMlib library:
numvector is a template class for geometric vector that inherits standard wrapper class std::array<type, n> and defines the most common operations on vectors (including “&” for scalar product, “^” for vector product, “|” for outer product); numvector’s inheritors nummatrix and numtensorX define fixed size matrix and higher rank tensors, respectively, together with the necessary operations;
Point2D inherits numvector<double, 2> and has the necessary MPI-descriptor;
Vortex2D stores properties of a vortex particle (its position and circulation) and has the MPI-descriptor.
Three abstract classes are introduced:
WorldGen—the “sandbox” for each problem of flow simulation being solved;
PassportGen—full definition of the particular problem of flow simulation;
TimesGen—structures for time statistics assembling and tools for storing to timestat file.
4.3. The Core Library VM2D
The
VM2D library implements the algorithms for 2D flows simulation and FSI problems solving by using the Vortex Particle Method, namely the Viscous Vortex Domains (VVD) method. Although the general structure of the algorithm is the same as for VVD method, described in the
Section 3, it is obvious that some parts of the algorithm can be implemented in different ways (for example, the subroutine for solving the boundary integral equation as well as for velocities computation of vortex particles). To provide the possibility of using various numerical algorithms, abstract classes and their specific implementations have been developed. The classes defined in
VM2D are listed in
Table 2.
Four main abstract classes in VM2D, namely Airfoil, Boundary, Velocity and Mechanics are defined, whose implementations correspond to different modifications of VPM algorithm; some of them are briefly described at the beginning of this paper.
The inheriting classes have names, which consist of the name of the parent class and some additional words that specify the particular implemented method. List of the most important implementations of the abstract classes is given in
Table 3. Those implementations that are not fully implemented at that moment are marked by an asterisk; there is a groundwork for their implementation, as well as for a wide range of other methods and approaches.
The VMcuda Library
This library contains several subroutines, which are structurally identical to ones, introduced in the VM2D library, but are executed on GPU, and also the necessary functions that provide data exchange between the host computer (CPU) and the device (GPU). The implementations of “computational” kernels are in most cases quite different from ones implemented on the CPU due to the specific architecture of the GPUs.
The most time-consuming functions are implemented now for GPUs:
Convective velocities computation at a set of points induced by vortex particles in the vortex wake as well as by free and attached vortex sheets and attached source sheet;
Diffusive velocities calculation for vortex particles in the vortex wake and “virtual vortices”—vortex particles introduced to model the vorticity transfer from the free vortex sheet to the flow domain;
The right-hand side computation for the linear system that arises after discretization of the boundary integral equation on the airfoils’ contour lines, which are different depending on the applied numerical scheme;
Vortex pairs recognizing placed at a rather small distance in the vortex wake for their merging in the framework of vortex wake restructuring algorithm.
The other operations are much less time-consuming, so they are performed on CPUs; however, their transfer to GPU can allow for some additional speedup.
Special data structures are introduced for global memory of GPUs that provide optimized performance, especially for computations in problems with several airfoils in the flow, and also for CPU/GPU data transfer. In most parts of computational subroutines, shared memory is actively used, which allows for a significant increase in performance.
Note that CUDA-implementation is now available for the direct algorithm and not for fast methods. It seems that due to this reason, the implemented possibility of working with the pinned memory (the necessary memory allocator for the host machine is implemented for
std::vector container) leads just to a very small improvement in performance, and the same for asynchronous concurrent execution for CUDA-kernels, which makes it possible to perform computations simultaneously with data transfer between device and host. Certainly, in the future, fast methods will be also implemented for CUDA (e.g., for the vortex particles velocities computation the algorithm can be developed as a generalization with some modifications of the well-known implementation of the classical Barnes–Hut method, proposed in [
77]), so the described features can become more significant.
As for the CPU, all computational subroutines for GPU are implemented for double precision, which decreases significantly the performance of GPUs, especially for GeForce and Quadro families. Numerical experiments show, however, that some time-consuming operations in Vortex particle methods can be performed with single precision without notable loss of accuracy for the whole simulation. Nevertheless, in the current version of the code, the possibility of computations with single precision is not implemented.
4.5. Problems Description in VM2D
It is possible to use
VM2D for the solution of one particular problem as well as for the solution of a set of similar (or not similar) problems. Each problem is denoted with some label (text string without spaces), and a separate subdirectory should be created in the working directory with the name coinciding with the problem’s label. Then, all the problems should be listed in the
problems file, which is placed in the working directory where, if necessary, some parameters can be specified, that will be “passed” to the corresponding problems. The typical structure of the
problems file is shown in
Figure 13.
In the simplest case, it is enough to specify just empty brackets after the problem label (which also can be omitted), but there are two parameters (
pspfile and
np) that are always necessary. The
pspfile defines the name of the passport file with a description of the problem in its subdirectory; the
np defines the number of MPI-processes that should be runin parallel for the corresponding program. These parameters can be specified either explicitly, as for
np in the example in
Figure 13, or implicitly, since the default values are defined:
pspfile = passport and
np = 1. Note that for every multicore processor, the OpenMP technology is used for parallelization of the algorithm in shared-memory mode.
All other parameters in parentheses are definitions of arbitrary variables of arbitrary type (integer, double, boolean, string, as well as a list), which the user can later use inside “passport” files for their unification and for notational convenience and, as the result, for automatization of similar problems solution.
In the above-considered example, where user has a purpose to solve 3 different problems of flow simulation around wing airfoil (as it follows from the labels of the problems) at different angles of incidence,
passport file can be the same for all the problems and it can have the structure shown in
Figure 14.
All the input files including airfoil geometries and vortex wakes, should be written as dictionaries by using “quasi-C++” syntax: double slash “//” means inline comment, as well as “/∗ ... ∗/” for multiline comment; line breaks mean the same as spaces; semicolon “;” is the separator between different lines in file; comma “,” is the separator in lists and also the separator between parameters in parentheses; spaces and tabs are ignored. The register of parameter names is not sensible.
The sense of the parameters in the passport should be clear, more or less, from their names and short comments in the example in
Figure 14.
Not all of them should be specified explicitly since default values for some of them are defined. The default values, having the lowest priority, are introduced directly in the source code, such parameters marked with the asterisk in the comment in the above-given example in
Figure 14; their default values are the following:
timeStart | = 0.0 | distFar | = 10.0 |
accelVel | = RampLin(1.0) | delta | = 1.0 × 10 |
saveVtx | = ascii(100) | vortexPerPanel | = 1 |
saveVP | = ascii(0) | maxGamma | = 0.0 |
nameLength | = 5 | | |
The defaults with stronger priority for these and other parameters can be specified by the user in the defaults file that should be placed in the working directory. Verbal expressions of some options (such as velocityBiotSavart, boundaryConstantLayerAverage and others) can be defined in file switchers, also placed in the working directory.
Let us give here short descriptions of some parameters which sense can be not trivial:
vRef is reference velocity magnitude; it is required in problems without incident flow as a scale for dimensionless parameters, otherwise, it is equal to the magnitude of the incident flow velocity;
accelVel defines the way of the influence flow acceleration from zero to vInf value; it can take the following values:
Impulse means that the flow starts instantly (impulsively);
RampLin(T) means that the flow is accelerated linearly from zero to vInf during T seconds;
RampCos(T) means that the flow accelerates according to the cosine law from zero to vInf during T seconds;
saveVtx, saveVP zero values mean that the corresponding files should not be saved at all;
maxGamma zero value means no limitation for —maximal value of the vortex particles intensity.
In order to save the velocity and pressure fields to files, it is necessary to list the points for their computation in the file pointsVP in the problem’s directory.
Files with airfoils geometry are stored in airfoils directory; they are text files with very simple format, which is absolutely clear from tutorial examples. For the airfoils, the following parameters should be specified after the file name inside the brackets:
basePoint—point at which the airfoil center should be placed;
scale—scale factor for the airfoil;
inverse—boolean switch for internal flow simulation inside the airfoil;
angle—angle of incidence;
mechanicalSystem—numerical scheme for coupling strategy implementation in coupled FSI problems.
Note that in the example of the passport shown in
Figure 14, two parameters are not defined explicitly: the time step
dt and angle of incidence of the airfoil
angle. Such templates are marked by symbol “
$”, which means that their values are equal to user-defined variables, which are defined in the previously shown in
Figure 13 file
problems inside the parentheses after labels of the corresponding problems.
Moreover, in order to solve several similar problems, which differ only by some parameters values, it is suitable to specify
copyPspFile option in the
problems file, for example, as it is shown in
Figure 15. As a result, all the necessary subdirectories for the listed problems will be created automatically, and the files contained in the specified folder will be copied there (there should be at least
passport file, and
pointsVP file, if necessary); obviously, the passport file for such problems should contain templates for some parameters.
If the flow around a system of airfoils is simulated, more than one airfoil file should be specified. In this case, the corresponding section of the passport file has the structure of a list,
Figure 16:
In this example, the interference phenomenon for two immovable airfoils is simulated: small (2 times scaled) circular airfoil placed behind the square airfoil (installed “rhombic” at the angle of incidence ) in its vortex wake. All other parameters will be set to default values.
The
mechanicalSystem parameter can take different values that are labeled for different types of mechanical systems. At the moment, a few different types are implemented and their labels are decrypted in the
mechanics file (
Figure 17), where each label is assigned with the name of the class, which inherits the abstract
mechanics class (a brief description of these classes is given in
Table 3).
A user can also pass the specific parameters of the mechanical system in parentheses. For translatory motion with two degrees of freedom, the object mechanicsRigidOscillPart takes the following parameters: dimensionless eigenfrequencies sh of oscillations in horizontal and vertical directions, and airfoil’s mass m. For rotational oscillations, the object mechanicsRigidRotatePart takes the moment of inertia J, dimensionless eigenfrequency shw and load torque Mz. These values are equal to user-defined variables, which are defined in file problems.
As an example of simulation using mechanical system
mechanicsRigidRotatePart, we consider the flow around a rotating Savonius rotor. The Savonius rotor is an example of one of the simplest types of turbines and it is well investigated both experimentally and numerically [
78,
79,
80].
A Savonius rotor with two blades is considered with the following parameters: diameter
, blade thickness
, polar inertia moment
, incident flow velocity
, Reynolds number
. Motion of the Savonius rotor was carried out according to the following scheme: the rotor angular velocity was linearly increased from 0 to 2 during the first 15 s, then, up to 45 s, the rotor rotated under the action of hydrodynamic loads only, from 45 s, the rotor was loaded with a torque
. As a result, after a while the angular velocity of the Savonius rotor stabilized at approximately 1.35. The angular velocity
of the Savonius rotor rotating according to the described scheme is shown in
Figure 18.
The
Figure 19 shows the vortex wake around the rotating Savonius rotor at the beginning of the numerical simulation (
) and developed vortex wake after some simulation time (
).
Comparison with OpenFOAM has been performed for a similar problem of rotor autorotation simulation; unsteady lift and drag forces are shown in
Figure 20; at
, the rotor had zero angular velocity. It is seen that results are in acceptable agreement.
If the user wants to use a previously simulated vorticity distribution, it can be uploaded by specifying the corresponding file name in section fileWake of the passport. Files with vortex wakes description should be stored in wakes directory.
Some tutorial examples can be found in tutorials folder on github; in the run folder, the examples of files that should be placed in the working directory are also given.
4.8. Parallelization of Computations
As mentioned earlier, all time-consuming operations in VM2D are parallelized by using the OpenMP and MPI technologies, which allow performing calculations on multi-core/processor systems, including ones with distributed memory. Some most time-consuming subroutines are parallelized by using Nvidia CUDA technology. Its usage in VM2D turns out to be rather efficient, since the particles can be processed independently in many computational blocks of the algorithm (usually, the number of particles is sufficiently high in simulations).
Let us estimate the efficiency of parallelization. In the first model problem, the development of the vortex wake after the circular airfoil (
Figure 5) was simulated for
on the shared-memory computational node with two 18-core Intel Xeon Gold 6254 processors.
The circular airfoil was discretized into 1000 panels, and step execution time was measured for different time steps with different numbers of vortices in the wake. The graphs for achieved speedup is shown in
Figure 21 for different number of vortices in the wake. For a rather high number of vortex particles, a close to linear speedup is observed.
The calculations were carried out on a different number of cluster nodes, each equipped with 28 cores (2× Intel Xeon E5-2690v4). In order to estimate the efficiency of the MPI parallelization, let us consider flow around a circular cylinder for . The circular airfoil was discretized by 2000 panels, the time-step was chosen as . In this simulation, due to the small value of maximal circulation of vortex elements, , the number of vortex elements increases rather quickly and after the first 10 steps the value of 200,000 is already reached.
Speedup for different number of 28-core nodes with the MPI technology for flow simulation with different resolutions was examined;
N is the average number of vortex particles. To estimate the speedup of the algorithm the execution time of the first 500 time steps is analyzed for different number of cores.
Figure 22 shows the speedup of the algorithm for the described problem, where the maximum number of vortex elements was approximately 300,000, as well as the speedup for a similar problem where the number of vortex elements was approximately twice as large. It can be seen that the speedup of problems with 300,000 and 600,000 vortices approximately coincides with the speedup corresponding to Amdahl’s law with
and
% of the sequential code, respectively.
The main time-consuming operations that can be distinguished in the algorithm of
VM2D are shown above in
Figure 12. Italic type below indicates the name of the column in the
timestat file, which contains the execution time of the corresponding operation for every time step of simulation.
tMatrRhs—calculating the coefficients of matrix and right-hand side of linear system;
tSolve—solving the linear system (by Gaussian elimination);
tConvVelo—calculating convective velocities of vortices;
tDiffVelo—calculating diffusive velocities of vortices;
tForce—calculating the hydrodynamic forces acting on the airfoils;
tVelPres—calculating the velocity and pressure fields in the specified points;
tMove—calculating new positions of vortex particles;
tInside—detecting the vortices trapped inside the airfoil after movement;
tRestr—vortex wake restructuring (merging closely spaced vortices, removing from the simulation of vortices that are too far from the airfoils, etc.);
tSave—saving data to files.
In the
Table 4, execution times for most time-consuming operations are given for simulation on 1 node (
) and 84 nodes (
). In the second row, fractions of execution time that correspond to mentioned operations are given for simulation on one node. Execution time taken to complete all other operations is shown in the last column labeled
tOthers. The last row shows the speedup
of operations obtained using 84 nodes in comparison to 1 computational node.
It can be seen from the
Table 4 that the most significant speedup is observed for the operations of calculating the convective and diffusion velocities and restructuring of the vortex wake; it should be noted that these operations are also the most time-consuming.
For the same problem, time tests were performed on computer systems with various GPUs in order to estimate the efficiency of parallelization computations in
VM2D using graphics accelerators. The following GPUs were used for computational experiments: Quadro P2000, GeForce GTX 970, Tesla C2050, Tesla V100, Tesla A100.
Table 5 shows the properties of used GPUs and average execution time
T of one time step averaged over 700 first steps of simulation.
The
VM2D code provides the possibility of using several graphics accelerators to perform one simulation; the communication between video cards is carried out using the MPI technology. The efficiency of this feature was estimated using tests with two types of GPUs: Tesla C2050 and Tesla A100.
Figure 23 shows the speedup of the evaluation time of the algorithm time step when using a different number of Tesla C2050 GPUs compared to one Tesla C2050 GPU (each node of the cluster was equipped with 3 graphics accelerators). Separate lines correspond to different numbers of vortex elements in the wake.
Figure 24 shows the speedup with Tesla A100 GPUs depending on the number of vortex particles in the wake. Color lines correspond to different number of GPUs.
The results of performed numerical experiments have shown that the VM2D algorithm is quite scalable: on a cluster system with 2500 CPU cores, the parallelization efficiency reaches approximately 0.7. Since the vortex method implemented in VM2D belongs to a class of particle-based methods, the algorithm is quite efficiently adapted for graphics accelerators: it has been shown that one Tesla V100 graphics card in terms of performance in VM2D can replace 84 28-core CPU nodes. In addition, it is possible to use several GPUs in one simulation, although the use of several cards is not very efficient, but if time of computations is critical, it provides an additional speedup of calculations.