Abstract
With the development of engineering technology, engineering has higher requirements for the accuracy and the scale of simulation calculation. The computational efficiency of traditional serial programs cannot meet the requirements of engineering. Therefore, reducing the calculation time of the temperature control simulation program has important engineering significance for real-time simulation of temperature field and stress field, and then adopting more reasonable temperature control and crack prevention measures. GPU parallel computing is introduced into the temperature control simulation program of massive concrete to solve this problem and the optimization is carried out. Considering factors such as GPU clock rate, number of cores, parallel overhead and Parallel Region, the improved GPU parallel algorithm analysis indicator formula is proposed. It makes up for the shortcomings of traditional formulas that focus only on time. According to this formula, when there are enough threads, the parallel effect is limited by the size of the parallel domain, and when the parallel domain is large enough, the efficiency is limited by the parallel overhead and the clock rate. This paper studies the optimal Kernel execution configuration. Shared memory is utilized to improve memory access efficiency by 155%. After solving the problem of bank conflicts, an accelerate rate of 437.5× was realized in the subroutine of the matrix transpose of the solver. The asynchronous parallel of data access and logical operation is realized on GPU by using CUDA Stream, which can overlap part of the data access time. On the basis of GPU parallelism, asynchronous parallelism can double the computing efficiency. Compared with the serial program, the accelerate rate of inner product matrix multiplication of the GPU asynchronous parallel program is 61.42×. This study further proposed a theoretical formula of data access overlap rate to guide the selection of the number of CUDA streams to achieve the optimal computing conditions. The GPU parallel program compiled and optimized by the CUDA Fortran platform can effectively improve the computational efficiency of the simulation program for concrete temperature control, and better serve engineering computing.
1. Introduction
In the finite element calculation of temperature field and stress field of hydraulic concrete structures [1,2,3], especially the concrete structure with cooling water pipe [4], in order to get high accuracy [5,6], the calculators usually build a dense model, which leads to the increase of calculation scale [7]. In addition, reasonable temperature control calculation should be carried out in real time with the construction [8,9], so that the boundary conditions and parameters can be continuously adjusted and corrected with reference to the measured data, so as to obtain more reasonable results. Due to the low computational efficiency of the traditional serial program, it cannot meet the actual needs of engineering under the condition of ensuring the calculation accuracy. All the above problems show that the existing serial simulation calculation method needs comprehensive improvement to improve the computational efficiency.
In recent years, the growth of the CPU clock rate has been slowing down year by year. In the field of engineering computing [10], the exploration of improving computing efficiency is gradually transferred to the parallelization implementation of programs [11]. Many explorations have been made for the implementation of multi-core CPU parallelism. Mao parallelized the joint optimization scheduling of multiple reservoirs in the upper reaches of Huaihe River, which took 5671.1 s for 1 CPU core and 2104.6 s for 6 CPU cores, with an accelerate rate of 2.704, and an efficiency of 45% [12]. Jia proposed a master-slave parallel MEC based on MPI, and analyzed the effects of task allocation, communication overhead, sub-population size, individual evaluation time and the number of processors on the parallel speedup [13]. CPU parallelism relies on adding more CPU cores to achieve speedup. Whether buying a multicore CPU or cluster parallelism [14], an order of magnitude increase in the number of CPU cores is definitely very expensive. The GPU of a home grade graphics card contains hundreds or thousands of processing cores [15,16,17].
The full name of GPU is the graphics processing unit, which is designed for image computing, and the research in the field of image computing has been very mature. As a fine-grained parallel method, GPU was designed for compute-intensive, highly parallel computing, which enabled more transistors to be used for data processing, rather than data caching or flow control. GPU has an absolute advantage over CPU in the number of processors and threads. Therefore, GPU computing is very suitable for the field that requires large-scale parallel computing [18]. In 2009, Portland Group (PGI) and NVDIA jointly launched the CUDA Fortran compiler platform, which greatly expanded the ication range of GPU general purpose computing. The release of this platform made GPU computing widely used in medicine, meteorology, fluid calculation, hydrological prediction and other numerical computing fields [19,20,21,22]. T Mcgraw presented a Bayesian formulation of the fiber model which showed that the inversion method can be used to construct plausible connectivity [23]. An implementation of this fiber model on the graphics processing unit (GPU) is presented. Qin proposed a fast 3D registration technique based on CUDA architecture, which improves the speed by an order of magnitude while maintaining the registration accuracy, which is very suitable for medical clinical ications [24]. Takashi uses GPU to implement ASUCA, the next generation weather prediction model developed by the Japan Meteorological Agency, and achieves significant performance speedup [25]. Taking pres-tack time migration and Gazdag depth migration in seismic data processing as a starting point, Liu introduced the idea, architecture and coding environment of GPU and CPU co-processing with CUDA [26]. GPU acceleration for OpenFMO, a fragment molecular orbital calculation program, has been implemented and its performance was examined. The GPU-accelerated program shows 3.3× speedups from CPU. MA Otaduy presented a parallel molecular dynamics algorithm for on-board multi-GPU architectures, parallelizing a state-of-the-art molecular dynamics algorithm at two levels [27]. Liang tried the GPU platform and FDTD method to solve Maxwell equations with complex boundary conditions, large amount of data and low data correlation [28]. Wen presented a new particle-based SPH fluid simulation method based completely on GPU. By this method, a hash-based uniform grid is constructed firstly on GPU to locate the neighbor particles faster in arbitrary scale scenes [29]. Lin presented a new node reordering method to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs [30]. JC Kalita presented an optimization strategy for the BiCGStab iterative solver on GPU for computing incompressible viscous flows governed by the unsteady N-S equations on a CUDA platform [31]. Tran harnessed the power of accelerators such as graphics processing units (GPUs) to accelerate numerical simulations up to 23× times faster [32]. Cohen used a GPU-based sparse matrix solver to improve the solving speed of the flood forecasting model, which is of great benefit to the early warning and operation management during the flood [33]. Ralf used CUDA to numerically solve the equations of motion of a table tennis ball and performed statistical analysis of the data generated by the simulation [34]. The Mike (Mike11, Mike21 and Mike3) series of commercial software developed by the Danish Institute of Hydraulics has also introduced GPUs to improve the computational efficiency of models [35,36,37].
GPU parallel computing is rarely used in the field of concrete temperature and stress field simulation. Concrete simulation calculation is to simulate the construction process [38], environmental conditions [39], material property changes and crack prevention measures and other factors as accurately as possible. It is of great significance to GPU parallel computing to concrete temperature control simulation calculation for real-time inversion of temperature field and stress field and shortening calculation time. In this paper, the CUDA Fortran compiler platform is used to transform the large massive concrete temperature control simulation program into the GPU parallel program. An improved analysis formula of the GPU parallel algorithm is proposed. Aiming at the extra time consumption of GPU parallel in the indicators, two measures are proposed to optimize the GPU parallel program: one is to use shared memory to reduce the date access time, and the other is to hide the date access time through asynchronous parallelism.
2. Improved Analytical Formula for GPU Parallel Algorithms
The computation time of GPU parallel programs is affected by many factors. In order to better analyze and measure the advantages and disadvantages of parallel algorithms, a new analysis indicator based on the traditional parallel algorithm speedup is proposed. Through this theory, the relationship between the various elements of parallel computing can be better analyzed, so as to find the factors that restrict the improvement of parallel efficiency, and better serve the optimization of efficient GPU parallel computing programs.
2.1. CUDA Fortran Parallel Experimental Platform
The program is implemented in PGI compiler by the CUDA Fortran compiler platform [40]. The parameters of the experimental platform are shown in Table 1.
Table 1.
Parameters of experimental platform.
2.2. The Traditional Analytical Formula
The only indicator to evaluate the computational efficiency of traditional serial programs is the computation execution time. The formula for calculating the accelerate rate of the traditional parallel algorithm is as follows:
where is an accelerate rate, is the serial computation running time, n is the number of processors, is the parallel execution time using n processors. . The only index of the formula is time; although the formula can objectively reflect the acceleration effect of the calculation program, it cannot reflect the various factors that affect the calculation time.
2.3. Improved Analytical Formula
During the execution of a program, not all processes and commands can be merged. The speedup of the code is limited by the proportion of the whole program that can be parallelized. In view of this, this paper divides the computation time of the program into serial domain and parallel domain .
where is parallel time for execution by n processors in parallel domain, is the serial time by a single processor in serial domain. The total time the program takes to run is
Insert Equations (2) and (3) into Equation (1), new parallel computing accelerate rate:
The above formula is the speed ratio of parallel computation under ideal conditions, but the execution of the GPU parallel code also causes some other time-consuming events. Firstly, compared with the CPU parallel computing, it also costs time to transfer data between the Host and the Device (H2D and D2H). Secondly, if the processors are not allocated the same amount of work, the load will be unbalanced, and some threads will be idle, which will affect the execution speed. Finally, there is a time penalty for calling the GPU Kernel subroutine. To investigate the effects of these costs attributed to parallel computing, we define these extra time costs as .
At present, the computing power of a GPU single core is lower than that of a CPU single core, so the attenuation coefficient λ of GPU computing power is used to represent the ratio of GPU and CPU computing power. The actual operation time of computing the parallel domain by the GPU core is obtained as follows:
Replace in Equation (4) with , the total time of parallel operations at the GPU core, and insert Equation (2) into Equation (4). After collation, the parallel accelerate rate calculation formula considering computing power attenuation and extra time costs is as follows.
Formula (6) makes up for the shortcomings of the traditional calculation formula that only focuses on the calculation time, considering the accelerate rate of parallel computing, the core clock rate loss λ of GPU, the number of processors , the extra time cost and the proportion of serial domain .
In general, the speedup increases with the number of threads n participating in the computation for the same task, but the rate of increase slows down gradually due to the increase in communication overhead between threads. As approaches 0, which means that GPU computing power is infinite, . The speedup of the program is restricted by the serial duration and the extra time cost , and the value eventually tends to a constant . Similarly, as approaches 0, the speedup of the program is restricted by the serial duration and the extra time cost . The value of the speedup ratio is proportional to and inversely proportional to . In addition, the speedup will increase if the computational part of a task is larger. This is because the communication overhead at the device side and the host side is reduced in proportion to the computation time.
The acceleration effect of the program is restricted by many conditions. The more cores participate in the calculation, the larger the proportion of parallel fields, the smaller the attenuation of computing power and the less additional consumption, the more computationally efficient the program is. According to the improved parallel analytical formula (Equation (6)) and the above analysis, it can be seen that in order to obtain a better acceleration effect of the GPU parallel program, we should pay attention to the following aspects: 1. Using the GPU core with more computing power and more frequency to reduce the attenuation coefficient λ in GPU parallel computing. 2. Using more cores of the GPU to increase the value of n to reduce the operation time. 3. By improving the problem solving algorithm and parallel program, the proportion of parallel operation can be increased and the extra cost of parallel program can be reduced.
Aspects 1 and 2 are hardware level optimization; this paper mainly focuses on the third point, the optimization of parallel computing programs.
4. Research on Asynchronous Parallelism in GPU Computing
An important concept in CUDA is a stream, which represents the execution queue of a series of instructions. GPU asynchronous parallelism is that the data and resources are divided into multiple parts, and the Host starts each execution sequence for the CUDA stream to process separately. Thus, it has the effect of overlapping data transmission and device calculation. Compared with thread parallelism, asynchronous parallelism using the CUDA stream is a higher level of parallelism. Thread parallelism is a level of parallelism for data, and streams are a level of parallelism for instructions.
4.1. Comparison and Analysis of Different Asynchronous Parallel Methods
For a subroutine that needs to be computed in parallel on the GPU, two types of operations (transfer of data between Host and Device and Kernel calculation) are required. Taking GTX 1050 as an example, this section visualizes the time of Memcpy D2H, kernel function and Memcpy H2D through the Nvprof tool, so as to analysis the time of the CUDA stream under different conditions. According to the different tasks divided, GPU computing is divided into three versions.
Version 1: For computing tasks, one stream is used to realize the transmission of data from Host to Device, kernel execution, and then transferring data from Device to Host. The code is in Listing 2:
| Listing 2. Code of Version 1. | |
| 1 2 3 4 5 6 7 | !CPU Subroutine Call Function_1 (a) !GPU parallel Subroutine a_d = a call kernel <<<n/blockSize,blockSize>>>(a_d,0) a = a_d |
From code 1, compared to CPU serial computation, we can see that GPU parallel computation generates additional data transfer consumption (a_d = a,a = a_d). As can be seen from Figure 6, data transmission takes a lot of time; MemcpyH2D takes 1.2685 and MemcpyD2H takes 1.2581, accounting for 52.31% of the total computing time of version 1. This section investigates how to hide the time of data transmission: Version 2 and Version 3.
Figure 6.
Time analysis of Version 1 (nstreams = 1, RTX 3080).
Version 2: The whole computing tasks are divided into several subtasks. Each CUDA stream completes the whole process of subtask. The code is in Listing 3:
| Listing 3. Code of Version 2. | |
| 1 2 3 4 5 6 | do i = 1,nStreams offset = (i − 1)*streamSize istat = cudaMemcpyAsync(a_d(offset + 1),a(offset + 1),streamSize,stream(i)) call kernel <<<streamSize/blockSize,blockSize,0,stream(i)>>>(a_d,offset) istat = cudaMemcpyAsync(a(offset + 1),a_d(offset + 1),streamSize,stream(i)) end do |
As shown in Figure 7 and Figure 8 for Version 2, the object of the stream is the whole process of subtasks. The CUDA stream operates on the whole process, so the Memcpy H2D and Memcpy D2H cannot be parallel, as indicated by the red square in Figure 7 and Figure 8. When nstreams is 4, the computation is split into 4 parts, we can see that Memcpy can overlap with the Kernel in three of the four parts, and the data transfer overlap rate is 75%. When nstreams is 10, the computation is split into 10 parts, we can see that 9 parts of Memcpy H2D can overlap with the Kernel and 6.75 parts of Memcpy D2H can overlap with the Kernel. The total overlap of data transmission is 79%.
Figure 7.
Time analysis of Version 2 (nstreams = 4, RTX 3080).
Figure 8.
Time analysis of Version 2 (nstreams = 10, RTX 3080).
Version 3: The whole computing tasks are divided into three parts (Memcpy H2D, Kernel and Memcpy D2H). Then each part is divided into several subtasks to be executed by CUDA streams. The code is in Listing 4:
| Listing 4. Code of Version 3. | |
| 1 2 3 4 5 6 7 8 9 10 11 12 | do i = 1,nStreams offset = (i − 1)* streamSize istat = cudaMemcpyAsync(a_d(offset + 1),a(offset + 1),streamSize,stream(i)) end do do i = 1,nStreams offset = (i − 1)* streamSize call kernel <<<streamSize/blockSize,blockSize,0,stream(i)>>>(a_d,offset) end do do i = 1,nStreams offset = (i − 1)* streamSize istat = cudaMemcpyAsync(a(offset+1),a_d(offset+1),streamSize,stream(i)) end do |
As shown in Figure 9 and Figure 10 for Version 3, the object of stream processing is the three parts (Memcpy H2D, Kernel and Memcpy D2H) that are divided. Thus, Memcpy D2H do not need to wait for all Memcpy H2D to complete before executing. For example, Memcpy D2H of Stream14 and Memcpy H2D of Stream16 in Figure 9 have a certain overlap time. When nstream is 4, although the overlap of memcpyD2H and Memcpy H2D saves some time, the time saved is wasted in waiting for kernel operation. Therefore, Version 2 and Version 3 take almost the same amount of time. When nstream is 10, whether MemcpyD2H or MemcpyH2D, 9 parts can overlap. The overlap rate of the total data transmission is 90%. Therefore, the time consumption of version 3 is lower than that of version 2.
Figure 9.
Time analysis of Version 3 (nstreams = 4, RTX 3080).
Figure 10.
Time analysis of Version 3 (nstreams = 10, RTX 3080).
4.2. Overlap Rate Theory of Memcpy
In this paper, an asynchronous parallel overlap rate theory of Memcpy is proposed to explore the impact of different methods and different stream numbers on the overlap rate. The Memcpy version 1 cannot overlap, so it will not be discussed here.
- Overlap rate formula of Memcpy of version 2
For version 2, the time of Memcpy H2D and MemcpyD2H cannot overlap each other. Because the first and last Memcpy cannot be covered, when nstreams is n, the number of Memcpy that can be covered is 2(n − 1). The overlap rate of Memcpy depends on whether the n kernel operations can cover the time of 2(n − 1) Memcpy. The calculation time formula of version 2 is as follows:
where , is total time of kernel execution, is kernel time of subtask. is the time of Memcpy H2D, is the operation time of MemcpyD2H, is the number of CUDA streams. After extracting the common factor:
When is greater than 1, the computation time is constrained by the Kernel execution time, and vice versa, the computation time is constrained by the Memcpy time. It can be seen from Table 1, and are approximately equal. For the Memcpy time to be covered, must satisfy . Similarly, the number of fully covered Memcpy:
- 2.
- Overlap rate formula of Memcpy of version 3
For version 3, when the program executes the second kernel operation, the output value of the first kernel executes the MemcpyD2H command, and the input value of the third kernel executes the Memcpy H2D command at the same time. There are three commands going on at the same time. Similarly, for a task divided into n parts, the time of Memcpy H2D and MemcpyD2H can overlap except for the first and last Memcpy, so the computation time after overlapping is determined by the time of n kernels and (n − 1) Memcpy. The calculation time formula of version 2 is as follows:
Similarly, for the Memcpy time to be covered, must satisfy . the number of fully covered Memcpy:
Comparing the calculation time formulas of version 2 and version 3, we find that when both of them are in the optimal condition, version 2 can overlap more than version 3. This can show that version 3 is better than version 2 in theory. In the actual calculation process, we should analyze the relationship between Kernel and Memcpy, and select the appropriate number of CUDA streams according to to achieve the optimal working condition, as shown in Figure 11.
Figure 11.
The lowest in the optimal working condition.
4.3. Analysis of Theoretical Operation Time and Practical Operation Time
Taking GTX 1050 as an example, it can be seen from nvprof that the execution time of the kernel function is 2.1325 ms, the execution time of MemcpyH2D is 1.2633 ms, and the ratio is 1.69. Figure 12 shows that version 2 can reach the optimal working condition when nstream is 7. As nstream becomes larger, cannot meet the requirements of the optimal operating condition, and Memcpy cannot achieve sufficient overlap. This is consistent with the phenomenon in Figure 12a that the actual overlap rate reaches the highest at nstream = 7. For version 3, is greater than 1, which meets the conditions of the optimal working condition. Figure 12b shows that the actual overlap rate and the theoretical overlap rate are consistent for any nstream number.
Figure 12.
Actual and theoretical overlap rates (RTX 3080). (a) Version 2. (b) Version 3.
Under the condition of array size 4 × 106, the total time of CUDA stream operation is 4.971 ms, and the time of Memcpy(H2D) is 2.5266 ms. The theoretical computation time of the program can be calculated using Formulas (8) and (10), as shown in Figure 13.
Figure 13.
Actual and theoretical execution time (RTX 3080).
Through the comparison of the actual calculation time and the theoretical calculation time, although we can see that there is a certain gap between the two, it is very small and the change trend is consistent. The overlap rate of Memcpy can analyze the influence of the number of streams on the calculation time to a certain extent, and can be used to predict the execution time of the task, which is of guiding significance for parallel compilation and optimization analysis of programs.
As shown in Figure 13, the actual calculation time is greater than the theoretical calculation time for both version 2 and version 3. This is because the use of CUDA streams not only brings the gain of Memcpy time overlap to reduce the execution time, but also GPU parallel computing itself comes with additional time consumption, as shown in the red box in Figure 14. These include the processes of cuCtxDetach, cuMemcpyH2Dasync, launch kernel, etc., of which cuMemcpyH2Dasync and launch kernel will increase with the increase in the number of nstreams.
Figure 14.
Whole time of Version 2 (nstreams = 10, RTX 1050).
Despite the consumption of other extra time, asynchronous parallelism is better than traditional parallelism, and Version 3 takes the least time. The parallel accelerate rate of each version is shown in Table 5. Version 2 can improve the speed by 65% on the basis of ordinary GPU parallelism, and Version 3 can improve 100%. Compared with serial calculation, the accelerate rates are 50.90 and 61.42.
Table 5.
Accelerate ratio of each version.
4.4. Limitations of Parallel Algorithms
On the one hand, not all code is suitable for parallelization. When the amount of data to be transmitted is large and the calculation amount is relatively small, if parallel processing is adopted, the time consumption required to open up parallel fields and thread scheduling is even greater than the time consumption of computing itself, and the gain is not worth the loss.
On the other hand, not all loops can be parallelized. Before parallelizing loops, we must ensure that there is no data correlation (loop dependency or data competition) between the loops. When two threads operate on a variable at the same time and one of the operations is a write operation, the two threads have a data race. In this case, the data read out is not necessarily the data of the previous write operation, and the data written may not be required by the program.
For a loop with cyclic dependence, the calculation of the loop index J is related to the calculation result of the loop index I, or the loop index I iterates with itself, such as Listing 5. This loop can only be executed sequentially, otherwise the result of parallel execution will give the wrong result, as shown in Figure 15.
Figure 15.
Diagram of data errors caused by loop dependencies.
| Listing 5. Code of loop dependencies. | |
| 1 2 3 4 5 6 7 8 | do i = 1, n a(i) = a(i + 1) + c(i) end do do i = 1, n do j = 1, n a(i,j) = a(i + 1,j + 1) + c(j) end do end do |
As can be seen from Figure 15, for a dependent loop, the value of a(i) is related to the value of a(i + 1). If the calculation is serial, the loop index i is executed sequentially. The assignment of a(i + 1) is after the assignment of a(i), so the change of a(i + 1) will not affect the assignment of a(i). However, in parallel computation, we cannot determine and predict the execution order of the loop index i. For example, in parallel computation of Test-1, a(9) = a(10) + c(9) executed by thread 2 precedes a(8) = a(9) + c(8) executed by thread 1. Therefore, in the process of execution, a(8) used a(9) after the assignment change, thus obtaining the wrong calculation result. At the same time, by comparing the parallel computation of Test-1 and Test-2, we can see that for the same calculator, the thread number of the execution of the loop index i on each execution cannot be determined. In addition, the execution end times of different loop indexes i cannot be determined. Therefore, we cannot parallelize the code with loop dependence. When the program is reformed and optimized in parallel, the optimization effect of processing should be analyzed according to different situations.
5. Application of GPU Parallel Computing in Engineering
In the field of concrete temperature control simulation, many factors such as environmental conditions, materials and crack prevention measures should be considered in predicting cracks during construction. In order to obtain more accurate results, the element size of the finite element model is getting smaller and smaller, which leads to the increase of the calculation amount. In order to improve the calculation efficiency, GPU parallel is introduced into the temperature control simulation of massive concrete.
According to the GPU parallel optimization method proposed above, the Fortran finite element program for water pipe cooling temperature and stress field of our research group was reconstructed and optimized in parallel. After the program is parallelized, serial and parallel tests are carried out with an engineering example, and the calculation results of serial and parallel are compared.
5.1. Simulation Calculation Model and Material Parameters
According to the drawing of a project of a concrete gravity dam, the finite element model of 211,374 elements and 233,983 nodes was established to compare the efficiency and precision of parallel computing and serial computing. The finite element model is shown in Figure 16. In the simulation calculation of the temperature field, the surrounding and bottom surface of the foundation is the adiabatic boundary, and the upper surface is the heat dissipation boundary. The symmetry plane of the structure is the adiabatic boundary. The construction of a temporary seam surface, and the structure of a permanent seam surface, when not covered as the heat dissipation boundary, after covering is the adiabatic boundary. The other surfaces are heat dissipation boundaries. The foundation material below the floor of the structure is mainly andesitic breccia. C15 concrete is used for the central of the dam, C30 concrete is mainly used for the dam body and the overflow surface and other parts are made of C35 anti-impact and wear-resistant concrete. The thermal parameters of this paper are derived from the inversion of the measured temperature data of the project, and the mechanical parameters are derived from the laboratory mechanical tests.
Figure 16.
Finite element model of a project bottom hole section. (a) Vertical profile of dam bottom. (b) Integral finite element model. (c) Finite element model of dam section.
Calculation Condition: The concrete will be cooled by water from an early age with a long time and slow temperature drop rate. Moreover, enhancing surface insulation and a setting construction joint are applied.
The thermodynamic parameters of various materials are shown in Table 6. The temperature data of the dam site in recent years are analyzed. In order to facilitate calculation, the mean monthly temperature of many years is synthesized into a cosine curve.
Table 6.
Thermal and mechanical parameters of materials.
5.2. Experimental Platform Parameters
The debugging environment of the author’s component program adopts the Windows operating system. A Cuda-based hybrid compiler language is implemented in the PGI Fortran compiler platform. The GPU used in the calculation is NVIDIA® GeForce RTX™ 3080, and the CPU is Intel®Core™i7-12700k CPU @ 3.6 GHz. Details of the experimental platform are shown in Table 7.
Table 7.
Experimental platform parameters.
5.3. Comparison of Calculation Results
This section takes a concrete gravity dam as the research object, and simulates the temperature field and stress field of the concrete dam by using the GPU parallel algorithm proposed in this paper. The results of GPU parallel computation are compared with those of CPU, and the accuracy and efficiency of the GPU parallel algorithm are verified. Due to space constraints, we only select the middle section of the dam for temperature and stress description. The tensile stress is positive and the compressive stress is negative. The calculation results of temperature and stress are shown in Figure 17, and the calculation time is shown in Table 8.
Figure 17.
Temperature and stress envelope graphs for different calculation methods. (a) Temperature—CPU serial calculation. (b) Temperature—GPU parallel computation. (c) Stress—CPU serial calculation. (d) Stress—GPU serial calculation.
Table 8.
Calculation time and accelerate rate of parallel.
It can be seen from the above temperature and stress envelope diagrams that the results of GPU parallel calculation are basically consistent with those of CPU serial calculation, indicating that the accuracy of the optimized GPU parallel algorithm meets the needs of engineering calculation.
It can be seen from the Table 8, compared with serial computing, the total time of CPU parallel computing and GPU parallel computing decreased by 33.61% and 64.37%. GPU parallel computing significantly shortened the total time of computing. GPU parallel computation can effectively improve the efficiency of the concrete temperature control simulation program, so that the simulation program can better serve the prediction of concrete cracks during the construction period. Therefore, more reasonable temperature control and crack prevention measures are taken to protect the safety of the concrete structure. At the same time, we can adopt the principle and method of GPU parallelization to analyze and transform other computing programs in parallel. The theory of GPU parallel computing is universal for code that can be parallelized, but needs to be adjusted for its own characteristics (amount of data, amount of computation, number of iterations, etc.).
There is a certain gap between the acceleration efficiency of the overall program and the GPU acceleration ratio shown by shared memory optimization and asynchronous parallelism. Because on the one hand, there is code that can only be executed in serial in the program, which cannot be modified in parallel, and on the other hand, for the subroutine with a small amount of computation, parallelization will bring a large additional cost, which outweighs the gain. After adopting the parallel transformation and optimization method described in this paper, in the case of ensuring the accuracy of calculation, the GPU computing acceleration ratio of the whole program reaches 2.81 times.
The program is optimized for shared memory and CUDA stream’s hiding of memory access time, and the computing programs in other fields are not studied. Because each application targets different fields and solves different problems, it is impossible to give a detailed classification. At the same time, the performance of different models of GPU processors, the number of nested layers per Do loop, the number of loops, the amount of data, the amount of computation and the data type are different, which will affect the additional overhead of the program. Therefore, it is necessary for the programmer to make adjustments according to the characteristics of their own program under the guidance of GPU parallel computing theory and methods. And as mentioned above, the additional cost of parallel computing involves a wide range of factors. Therefore, it is of great significance to further study the effect of extra cost on the efficiency of parallel computing, and it is a worthy research direction.
6. Conclusions
Aiming at the problem that the current conventional serial computing efficiency is low and cannot meet the engineering requirements, GPU parallel computing is introduced into the large massive concrete temperature control simulation calculation program. An improved analytical formula for GPU parallel algorithms is proposed; it makes up for the shortcomings of the traditional algorithms that only focuses on time, which is conducive to finding out the direction of program optimization. Optimization of the parallel program is studied through two aspects: shared memory and CUDA stream. The optimized program obtains a better speedup ratio.
- From the improved analytical formula, GPU parallel programs should be optimized from the following aspects: Hardware level: replace the GPU with stronger performance to obtain more threads and higher clock rate. Algorithm level: modify the algorithm to increase the proportion of parallel operations, improve the running efficiency of kernel functions and overlap more data transmission time;
- The data access mode of parallel programs is optimized by using shared memory, and the problem of bank conflicts is further solved. For matrix transpose operations of finite element operations, 437.5× acceleration is achieved;
- This paper implements asynchronous parallelism on the GPU through CUDA streams, which can hide the time of data access. Overlap rate theory of Memcpy is proposed to guide the optimization of asynchronous parallel programs. For GPU kernel subroutines of matrix inner products, compared with ordinary GPU parallel programs that do not use asynchronous parallelism, it can achieve nearly twice the acceleration. Compared with serial programs, it can achieve 61.42× acceleration. Not all programs are suitable for parallelization and need to be analyzed on a case-by-case basis;
- The Fortran finite element program for temperature and stress fields of concrete is reconstructed and optimized in parallel. The GPU parallelization of the program plays a role in improving computational efficiency while ensuring the computational accuracy.
Author Contributions
Conceptualization, X.Z., J.J. and S.Q.; methodology, X.Z., Y.W. and S.Q.; software, X.Z., Y.W. and S.Q.; formal analysis, X.Z., Y.W. and S.Q.; investigation, X.Z. and M.Y.; resources, X.Z. and J.J.; data curation, X.Z., J.J. and M.Y.; writing—original draft preparation, X.Z. and J.J.; writing—review and editing, X.Z., Y.W. and S.Q.; visualization, X.Z., Y.W. and S.Q. supervision, Y.W. and S.Q.; project administration, Y.W. and S.Q.; funding acquisition, X.Z. and S.Q. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China, grant number 52079049, Water Conservancy Science and Technology Project of Henan Province, China in 2022.
Data Availability Statement
Data are contained within the article.
Acknowledgments
The authors acknowledge the Fundamental Research Funds for the Central Universities and the Postgraduate Research & Practice Innovation Program of Jiangsu Province.
Conflicts of Interest
Authors Jiping Jin and Yajun Wang were employed by the company The First Engineering Bureau of Henan Water Conservancy. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CUDA | Compute Unified Device Architecture |
| CPU | Central Processing Unit |
| GPU | Graphics Processing Unit |
| H2D | Host to Device |
| D2H | Device to Host |
| Dg/Db | Dim of Girde/ Dim of Block |
| Memcpy | Memory Copy |
References
- Aniskin, N.; Nguyen, T.C. Influence factors on the temperature field in a mass concrete. E3S Web Conf. 2019, 97, 05021. [Google Scholar] [CrossRef]
- Briffaut, M.; Benboudjema, F.; Torrenti, J.M. Numerical analysis of the thermal active restrained shrinkage ring test to study the early age behavior of massive concrete structures. Eng. Struct. 2011, 33, 1390–1401. [Google Scholar] [CrossRef]
- Silva, C.M.; Castro, L.M. Hybrid-mixed stress model for the nonlinear analysis of concrete structures. Comput. Struct. 2005, 83, 2381–2394. [Google Scholar] [CrossRef]
- Yuan, M.; Qiang, S.; Xu, Y. Research on Cracking Mechanism of Early-Age Restrained Concrete under High-Temperature and Low-Humidity Environment. Materials 2021, 14, 4084. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.B.; Qiang, S.; Sun, X. Dynamic Simulation Analysis of Temperature Field and Thermal Stress of Concrete Gravity Dam during Construction Period. Appl. Mech. Mater. 2011, 90–93, 2677–2681. [Google Scholar] [CrossRef]
- Zhang, G.; Liu, Y.; Yang, P. Analysis on the causes of crack formation and the methods of temperature control and crack prevention during construction of super-high arch dams. J. Hydroelectr. Eng. 2010, 29, 45–51. [Google Scholar] [CrossRef]
- Yin, T.; Li, Q.; Hu, Y.; Yu, S.; Liang, G. Coupled Thermo-Hydro-Mechanical Analysis of Valley Narrowing Deformation of High Arch Dam: A Case Study of the Xiluodu Project in China. Appl. Sci. 2020, 10, 524. [Google Scholar] [CrossRef]
- Zheng, X.; Shen, Z.; Wang, Z.; Qiang, S.; Yuan, M. Improvement and Verification of One-Dimensional Numerical Algorithm for Reservoir Water Temperature at the Front of Dams. Appl. Sci. 2022, 12, 5870. [Google Scholar] [CrossRef]
- Chang, X.; Liu, X.; Wei, B. Temperature simulation of RCC gravity dam during construction considering solar radiation. Eng. J. Wuhan Univ. 2006, 39, 26–29. [Google Scholar] [CrossRef]
- Nishat, A.M.; Mohamad, S.J. Efficient CPU Core Usage and Balanced Bandwidth Distribution using Smart Adaptive Arbitration. Indian J. Sci. Technol. 2016, 9, S1. [Google Scholar] [CrossRef][Green Version]
- Lastovetsky, A.; Manumachu, R.R. Energy-Efficient Parallel Computing: Challenges to Scaling. Information 2023, 14, 248. [Google Scholar] [CrossRef]
- Mao, R.J.; Huang, L.S.; Xu, D.J.; Chen, G.L. Joint optimization scheduling algorithm and parallel implementation of group Base in Middle and upper reaches of Huaihe River. Small Microcomput. Syst. 2000, 21, 5. [Google Scholar]
- Jia, M. Master-slave Parallel Mind Evolutionary Computation Based on MPI. J. North Univ. China Nat. Sci. Ed. 2007, 4, 66–69. [Google Scholar]
- Wu, W.; Wang, X. Multi-Core CPU Parallel Power Flow Computation in AC/DC System Considering DC Control. Electr. Power Compon. Syst. 2017, 45, 990–1000. [Google Scholar] [CrossRef]
- Bocci, A.; CMS Collaboration. CMS High Level Trigger performance comparison on CPUs and GPUs. J. Phys. Conf. Ser. 2023, 2438, 012016. [Google Scholar] [CrossRef]
- Zhang, S.; Zhang, L.; Guo, H.; Zheng, Y.; Ma, S.; Chen, Y. Inference-Optimized High-Performance Photoelectric Target Detection Based on GPU Framework. Photonics 2023, 10, 459. [Google Scholar] [CrossRef]
- Misbah, M.; Christopher, D.C.; Robert, B.R. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 87–100. [Google Scholar] [CrossRef]
- Xu, W.; Jia, M.; Guo, W.; Wang, W.; Zhang, B.; Liu, Z.; Jiang, J. GPU-based discrete element model of realistic non-convex aggregates: Mesoscopic insights into ITZ volume fraction and diffusivity of concrete. Cem. Concr. Res. 2023, 164, 107048. [Google Scholar] [CrossRef]
- Wang, J.; Kuang, C.; Ou, L.; Zhang, Q.; Qin, R.; Fan, J.; Zou, Q. A Simple Model for a Fast Forewarning System of Brown Tide in the Coastal Waters of Qinhuangdao in the Bohai Sea, China. Appl. Sci. 2022, 12, 6477. [Google Scholar] [CrossRef]
- Hu, H.; Zhang, J.; Li, T. Dam-Break Flows: Comparison between Flow-3D, MIKE 3 FM, and Analytical Solutions with Experimental Data. Appl. Sci. 2018, 8, 2456. [Google Scholar] [CrossRef]
- Umeda, H.; Hanawa, T.; Shoji, M. Performance Benchmark of FMO Calculation with GPU-Accelerated Fock Matrix Preparation Routine. J. Comput. Chem. Jpn. 2015, 13, 323–324. [Google Scholar] [CrossRef]
- Umeda, H.; Hanawa, T.; Shoji, M. GPU-accelerated FMO Calculation with OpenFMO: Four-Center Inter-Fragment Coulomb Interaction. J. Comput. Chem. Jpn. 2015, 14, 69–70. [Google Scholar] [CrossRef][Green Version]
- Mcgraw, T.; Nadar, M. Stochastic DT-MRI Connectivity Mapping on the GPU. IEEE Trans. Vis. Comput. Graph. 2007, 13, 1504–1511. [Google Scholar] [CrossRef]
- Qin, A.; Xu, J.; Feng, Q. Fast 3D medical image rigid registration Technology based on GPU. Ication Res. Comput. 2010, 3, 1198–1200. [Google Scholar] [CrossRef]
- Ishida, J.; Aranami, K.; Kawano, K.; Matsubayashi, K.; Kitamura, Y.; Muroi, C. ASUCA: The JMA Operational Non-hydrostatic Model. J. Meteorol. Soc. Japan Ser. II 2022, 100, 825–846. [Google Scholar] [CrossRef]
- Liu, G.F.; Liu, Q.; Li, B.; Tong, X.L.; Liu, H. GPU/CPU co-processing parallel computation for seismic data processing in oil and gas exploration. Prog. Geophys. 2009, 24, 1671–1678. [Google Scholar] [CrossRef]
- Novalbos, M.; Gonzalez, J.; Otaduy, M.A.; Sanchez, A. On-Board Multi-GPU Molecular Dynamics. In Euro-Par 2013 Parallel Processing; Wolf, F., Mohr, B., Mey, D., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8097, pp. 862–873. [Google Scholar] [CrossRef]
- Liang, B.; Wang, S.; Huang, Y.; Liu, Y.; Ma, L. F-LSTM: FPGA-Based Heterogeneous Computing Framework for Deploying LSTM-Based Algorithms. Electronics 2023, 12, 1139. [Google Scholar] [CrossRef]
- Wen, C.; Ou, J.; Jinyuan, A.J. GPGPU-based Smoothed Particle Hydrodynamic Fluid Simulation. J. Comput.-Aided Des. Comput. Graph. 2010, 22, 406–411. [Google Scholar] [CrossRef]
- Lin, S.Z.; Zhi, Q. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. J. Supercomput. 2016, 73, 433–454. [Google Scholar] [CrossRef]
- Kalita, J.C.; Upadhyaya, P.; Gupta, M.M. GPU accelerated flow computation by the streamfunction-velocity (ψ-ν) formulation. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2015; Volume 1648. [Google Scholar] [CrossRef]
- Thi, K.T.; Huong, N.T.M.; Huy, N.D.Q.; Tai, P.A.; Hong, S.; Quan, T.M.; Bay, N.T.; Jeong, W.-K.; Phung, N.K. Assessment of the Impact of Sand Mining on Bottom Morphology in the Mekong River in An Giang Province, Vietnam, Using a Hydro-Morphological Model with GPU Computing. Water 2020, 12, 2912. [Google Scholar] [CrossRef]
- Nesti, A.; Mediero, L.; Garrote, L. Probabilistic calibration of the distributed hydrological model RIBS ied to real-time flood forecasting: The Harod river basin case study (Israel). Egu General Assembly 2010, 12, 8028. [Google Scholar] [CrossRef]
- Schneider, R.; Lewerentz, L.; Lüskow, K.; Marschall, M.; Kemnitz, S. Statistical Analysis of Table-Tennis Ball Trajectories. Appl. Sci. 2018, 8, 2595. [Google Scholar] [CrossRef]
- Lee, T.L. Prediction of Storm Surge and Surge Deviation Using a Neural Network. J. Coast. Res. 2008, 24, 76–82. [Google Scholar] [CrossRef]
- Fonseca, R.B.; Gonçalves, M.; Guedes Soares, C. Comparing the Performance of Spectral Wave Models for Coastal Areas. J. Coast. Res. 2017, 33, 331–346. [Google Scholar] [CrossRef]
- Ganesh, R.; Gopaul, N. A predictive outlook of coastal erosion on a log-spiral bay (trinidad) by wave and sediment transport modelling. J. Coast. Res. 2013, 65, 488–493. [Google Scholar] [CrossRef]
- Zhu, Z.Y.; Qiang, S.; Liu, M.Z. Cracking Mechanism of RCC Dam Surface and Prevention Method. Adv. Mater. Res. 2011, 295–297, 2092–2096. [Google Scholar] [CrossRef]
- Zakonnova, A.V.; Litvinov, A.S. Analysis of Relation of Water Temperature in the Rubinsk Reservoir with Income of Solar Radiation. Hydrobiol. J. 2017, 53, 77–86. [Google Scholar] [CrossRef]
- Windisch, D.; Kaever, C.; Juckeland, G.; Bieberle, A. Parallel Algorithm for Connected-Component Analysis Using CUDA. Algorithms 2023, 16, 80. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
















