A Novel GPU-Based Acceleration Algorithm for a Longwave Radiative Transfer Model

: Graphics processing unit (GPU)-based computing for climate system models is a longstanding research area of interest. The rapid radiative transfer model for general circulation models (RRTMG), a popular atmospheric radiative transfer model, can calculate atmospheric radiative ﬂuxes and heating rates. However, the RRTMG has a high calculation time, so it is urgent to study its GPU-based efﬁcient acceleration algorithm to enable large-scale and long-term climatic simulations. To improve the calculative efﬁciency of radiation transfer, this paper proposes a GPU-based acceleration algorithm for the RRTMG longwave radiation scheme (RRTMG_LW). The algorithm concept is accelerating the RRTMG_LW in the g - point dimension. After implementing the algorithm in CUDA Fortran, the G-RRTMG_LW was developed. The experimental results indicated that the algorithm was effective. In the case without I/O transfer, the G-RRTMG_LW on one K40 GPU obtained a speedup of 30.98 × over the baseline performance on one single Intel Xeon CPU core. When compared to its counterpart running on 10 CPU cores of an Intel Xeon E5-2680 v2, the G-RRTMG_LW on one K20 GPU in the case without I/O transfer achieved a speedup of 2.35 × .


Introduction
The radiative process, one of the important atmospheric physics processes, is often used for calculating atmospheric radiative fluxes and heating rates [1]. To simulate the radiative process, several radiative transfer models were developed, such as the line-by-line radiative transfer model (LBLRTM) [2], and the rapid radiative transfer model (RRTM) [3]. The RRTM that is a validated model computing longwave and shortwave radiative fluxes and heating rates, uses the correlated-k method to provide the required accuracy and computing efficiency [4], but it still demands enormous computing resources for long-term climatic simulation. To address this issue, as an accelerated version of RRTM, the rapid radiative transfer model for general circulation models (RRTMG) provides improved efficiency with minimal loss of accuracy for atmospheric general circulation models (GCMs) [5]. As a coupled climate system model comprising eight separate component models and one central coupler, the Chinese Academy of Sciences-Earth System Model (CAS-ESM) [6,7] was developed by the Institute of Atmospheric Physics (IAP) of Chinese Academy of Sciences (CAS). As the atmospheric component model of the CAS-ESM, the IAP Atmospheric General Circulation Model Version 4.0 (IAP AGCM4.0) [8,9] used the RRTMG as its radiative parameterization scheme. (1) To further accelerate the RRTMG_LW with a massively parallel computing technology, a GPU-based accelerating algorithm in the g-point dimension is proposed. The aim is to explore the parallelization of the RRTMG_LW in the g-point dimension. The proposed algorithm adapts well to the advances in multi-threading computing technology of GPUs and can be generalized to accelerate the RRTMG shortwave radiation scheme (RRTMG_SW). (2) The G-RRTMG_LW was implemented in CUDA (NVIDIA's Compute Unified Device Architecture) Fortran and shows excellent computational capability. To some extent, the more efficient computation of the G-RRTMG_LW supports real-time computing of the CAS-ESM. Moreover, the heterogeneous computing of the CAS-ESM is implemented.
The remainder of this paper is organized as follows. Section 2 presents representative approaches that aim at improving the computational efficiency of physical parameterization schemes. Section 3 introduces the RRTMG_LW model and GPU environment. Section 4 details the CUDA-based parallel algorithm in the g-point dimension for the RRTMG_LW. Section 5 describes the parallelization implementation of the G-RRTMG_LW. Section 6 evaluates the performance of the G-RRTMG_LW in terms of runtime efficiency and speedup, and discusses some of the problems arising in the experiment. The last section concludes the paper with a summary and proposal for future work.

Related Work
There were many successful attempts at using GPUs to accelerate physical parameterization schemes and climatic system models. This section describes the most salient work along this direction.
The WRF Goddard shortwave radiance was accelerated on GPUs using CUDA C [24]. Via double precision arithmetic and with data I/O, the shortwave radiance obtained a 116× speedup on two NVIDIA GTX 590 s [25]. The WRF five-layer thermal diffusion scheme was accelerated using CUDA C, and a 311× speedup was obtained on one Tesla K40 GPU [26]. The WRF Single Moment 6-class microphysics scheme was also accelerated using CUDA C, and obtained a 216× speedup on one Tesla K40 GPU [27].
The WRF long-wave RRTM code was ported to GPUs using CUDA Fortran [28]. The RRTM on Tesla C1060 GPUs attained a 10× speedup [29]. The RRTM longwave radiation scheme (RRTM_LW) on the GTX480 obtained a 27.6× speedup compared with the baseline wall-clock time [1]. The CUDA Fortran version of the RRTM_LW in the GRAPES_Meso model was developed. It adopted some optimization methods for enhancing the computational efficiency, and obtained a 14.3× speedup [10].
The Fortran code of the WRF RRTMG_LW was rewritten in the C programming language, and then its GPU parallelization was implemented using CUDA C. With I/O transfer, the RRTMG_LW achieved a 123× speedup on one Tesla K40 GPU [30]. The Fortran code of the RRTMG_SW was also rewritten in the C programming language. Furthermore, the RRTMG_SW achieved a 202× speedup on one Tesla K40 GPU compared with its single-threaded Fortran counterpart running on Intel Xeon E5-2603 [31].
In a significantly different approach from the previous work, this study first proposes a new and detailed parallel algorithm in the g-point dimension for the CAS-ESM RRTMG_LW. Rewriting the original Fortran code of the RRTMG_LW would take considerable time, so CUDA Fortran rather than CUDA C was adopted in the parallelization implementation. The major concerns addressed by the proposed algorithm include the following: (a) runtime efficiency, and (b) common processes and technologies of GPU parallelization.

RRTMG_LW Model
As a critical process affecting our planet's climate, radiative transfer is the transport of energy by electromagnetic waves through a gas. Atmospheric and Environmental Research (AER) developed the RRTM and RRTMG. Their calculations exhibit an effective accuracy equivalent to that provided by the LBLRTM, but their computational cost is lower. In view of these advantages, the RRTM or RRTMG was adopted as the radiative transfer schemes of many climate models, such as the WRF and GRAPES [32].
The RRTM uses the correlated k-distribution method to calculate the broad-band radiative fluxes. In this method, the radiative spectrum is first divided into bands. Because of the rapid variation of absorption lines within the bands of gas molecules, the values of the absorption intensities within each band are further binned into a cumulative distribution function of the intensities. This distribution function is then discretized by using g intervals for integration within each band to obtain the band radiative fluxes, which are further integrated across the bands to obtain the total radiative flux to calculate atmospheric radiative heating or cooling. The g points are the discretized absorption intensities within each band.
The RRTM_LW is the RRTM for infrared radiation (longwave, LW); RRTM_SW is the RRTM for solar radiation (shortwave, SW). RRTM_LW has 16 bands and 256 g points. RRTM_SW has 16 bands and 224 g points. To speed up the calculations for climate and weather models, the spectral resolutions of RRTM_LW and RRTM_SW are further coarsened for applications in GCMs as RRTMG_LW and RRTMG_SW. RRTMG_LW has 16 bands and 140 g points. RRTMG_SW has 16 bands and 112 g points [33].
Below, we briefly describe the radiation flux and heating/cooling rate for calculating radiative transfer through a planetary atmosphere. The spectrally averaged outgoing radiance from an atmospheric layer is expressed using the following formula: In this expression, µ is the zenith direction cosine; v is the wavenumber; θ is temperature; v 1 and v 2 are the beginning and ending wavenumbers of the spectral interval, respectively; B(v, θ) is the Planck function at v and θ; I 0 is the radiance incoming to the layer; T v is the transmittance for the layer optical path; T v is the transmittance at a point along the optical path in the layer.
Equation (1) can derive the following Equation (2) under the necessary assumptions. In Equation (2), g is the fraction of the absorption coefficient; Be f f (g, T g ) is an effective Planck function for the layer; ρ is the absorber density in the layer; P is layer pressure; ∆z is the vertical thickness of the layer; ϕ is the angle of the optical path in the azimuthal direction; k(g, P, θ) is the absorption coefficient at P and θ.
The monochromatic net flux is where The total net flux is obtained by integrating over v The radiative heating (or cooling) rate is expressed as where c p is the specific heat at a constant pressure; P is pressure; g is the gravitational acceleration; ρ is the air density in a given layer [30]. Figure 1 indicates the profiling graph of the original Fortran RRTMG_LW. Here, the subroutine rad_rrtmg_lw is the driver of longwave radiation code. The subroutine mcica_subcol_lw is used to create Monte-Carlo Independent Column Approximation (McICA) stochastic arrays for cloud physical or optical properties. The subroutine rrtmg_lw is the driver subroutine for the RRTMG_LW. The subroutine rrtmg_lw (a) calls the subroutine inatm to read in the atmospheric profile from the GCM for use in the RRTMG_LW, and to define other input parameters; (b) calls the subroutine cldprmc to set cloud optical depth for the McICA based on the input cloud properties; (c) calls the subroutine setcoe f to calculate information needed by the radiative transfer routine that is specific to this atmosphere, especially some of the coefficients and indices needed to compute the optical depths by interpolating data from stored reference atmospheres; (d) calls the subroutine taumol to calculate the gaseous optical depths and Planck fractions for each of the 16 spectral bands; (e) calls the subroutine rtrnmc (for both clear and cloudy profiles) to perform the radiative transfer calculation using the McICA to represent sub-grid scale cloud variability; and (f) passes the necessary fluxes and heating rates back to the GCM. Algorithm 1 shows the computing procedure of rrtmg_lw. As depicted in Figure 1, rrtmg_lw took most computing time of rad_rrtmg_lw, so the study target was to use GPUs for accelerating the inatm, cldprmc, setcoe f , taumol, and rtrnmc subroutines.

GPU and CUDA Fortran
There are an array of streaming multiprocessors (SMs) inside a GPU. Several streaming processors inside each SM share the control logic and instruction cache. CUDA, a general purpose parallel computing architecture, fosters a software environment to make full use of many cores of GPUs in a massively parallel fashion. Functions or subroutines are defined as "kernels" by CUDA, and then they are executed on the GPU. Before running on the GPU, these kernels need to be invoked by the CPU. In the three-level hierarchy of CUDA, each kernel has a grid that is at the highest level. Each grid consists of thread blocks. Inside each thread block, data is efficiently shared by a group of threads through a fast shared memory [34]. CUDA supports many high-level languages, such as C/C++ and Fortran. NVIDIA and PGI jointly developed CUDA Fortran [28]. CUDA Fortran extends Fortran in memory management statements, declaration statements, CUDA runtime APIs, and kernel execution syntaxes [35]. Algorithm 1: Computing procedure of original rrtmg_lw. subroutine rrtmg_lw(parameters) //ncol is the number of horizontal columns 1. do iplon=1, ncol 2. call inatm(parameters) 3. call cldprmc(parameters) 4. call setcoef (parameters) 5. call taumol(parameters) 6. if aerosol is active then //combine gaseous and aerosol optical depths 7.

GPU-Enabled Acceleration Algorithm
In this section, the 2D acceleration algorithm of the RRTMG_LW is introduced. Then, the parallel strategy of the RRTMG_LW in the g-point dimension is described. Finally, a CUDA-based acceleration algorithm is proposed.

2D Acceleration Algorithm
The RRTMG_LW uses a collection of three-dimensional (3D) cells to describe the atmosphere. Its 1D acceleration algorithm with a domain decomposition in the horizontal direction assigns the workload of one "column," shown in Figure 2, to each CUDA thread. Here, the x-axis represents longitude, the y-axis represents latitude, and the z-axis represents the vertical direction. The RRTMG_LW in spatial structure has three dimensions, but the x and y dimensions in its CUDA code implementation are merged into one dimension to easily write the code. In the CAS-ESM, the IAP AGCM4.0 has a 1.4 • × 1.4 • horizontal resolution and 51 levels in the vertical direction, so the RRTMG_LW has nx×ny = 256 × 128 horizontal grid points. Thus, the first dimension of 3D arrays in its code has 256 × 128 elements at most.
To make full use of the GPU performance, the 2D acceleration algorithm with a domain decomposition in the horizontal and vertical directions for the RRTMG_LW was proposed in our previous study [23]. Figure 3 illustrates the 2D domain decomposition for the RRTMG_LW accelerated on the GPU. The 2D acceleration algorithm is illustrated in Algorithm 2. Because of data dependency, the acceleration of cldprmc and rtrnmc in the vertical direction is unsuitable, while inatm, setcoe f , and taumol are able to accelerate in the vertical direction. In the 1D acceleration, n is the number of threads in each thread block, while m = (real)ncol/n is the number of blocks used in each kernel grid. In the 2D acceleration, tBlock defines the number of threads used in each thread block of the x, y, and z dimensions by the derived type dim3. Furthermore, grid defines the number of blocks in the x, y, and z dimensions by dim3.

Parallel Strategy
In the RRTMG_LW, the total number of g points, ng, is 140. Therefore, there are iterative computations for each g point in inatm, taumol, and rtrnmc. For example, the computation of 140 g points is executed by a do-loop in the GPU-based acceleration implementation of 1D rtrnmc_d, as illustrated in Algorithm 3. To achieve more fine-grained parallelism, 140 CUDA threads can be assigned to run the kernels inatm_d, taumol_d, and rtrnmc_d. Thus, the parallel strategy is further accelerating inatm_d, taumol_d, and rtrnmc_d in the g-point dimension. Moreover, the parallelization between the kernels should also be considered in addition to that within the kernels.
It is noteworthy that the first dimension of 3D arrays in the CUDA code represents the number of horizontal columns, the second dimension represents the number of model layers, and the third dimension represents the number of g points. If one GPU is applied, in theory, nx×ny×nz×ng = 256 × 128 × 51 × 140 CUDA threads will be required for each kernel in the new parallel method. Figure 4 illustrates the domain decomposition in the g-point dimension for the RRTMG_LW accelerated on the GPU. The acceleration algorithm in the g-point dimension for the RRTMG_LW, is illustrated in Algorithm 4. The algorithm is described as follows: (1) In the acceleration algorithm, inatm consists of five kernels (inatm_d1, inatm_d2, inatm_d3, inatm_d4, and inatm_d5). Due to data dependency, a piece of code in inatm can be parallel only in the horizontal or vertical direction, so the kernel inatm_d4 uses a 1D decomposition. The kernels inatm_d1, inatm_d2, and inatm_d5 use a 2D decomposition in the horizontal and vertical directions. The kernel inatm_d3 uses a composite decomposition in the horizontal and vertical directions and g-point dimension. Due to the requirement of data synchronization, inatm_d1 and inatm_d2 cannot be merged into one kernel.

Acceleration Algorithm
(2) The kernel cldprmc_d still uses a 1D decomposition. (3) Similarly, the kernel setcoe f _d1 uses a 2D decomposition, and the kernel setcoe f _d2 uses a 1D decomposition. (4) The kernel taumol_d uses a composite decomposition in the horizontal and vertical directions and g-point dimension. In taumol_d, 16 subroutines with the device attribute are invoked. (5) Similarly, rtrnmc consists of 11 kernels (rtrnmc_d1-rtrnmc_d11). Here, rtrnmc_d1, rtrnmc_d4, rtrnmc_d8, rtrnmc_d10, and rtrnmc_d11 use a 1D decomposition. Furthermore, rtrnmc_d2 and rtrnmc_d9 use a 2D decomposition in the horizontal and vertical directions. In addition, rtrnmc_d5 and rtrnmc_d6 use a 2D decomposition in the horizontal direction and g-point dimension. Finally, rtrnmc_d3 and rtrnmc_d7 use a composite decomposition in the horizontal and vertical directions and g-point dimension.

Algorithm Implementation
In this section, the acceleration algorithm implementation is described. The implementations of 1D cldprmc_d and 2D setcoe f _d were described in our previous paper, so this section only considers the implementations of inatm_d, taumol_d, and rtrnmc_d with an acceleration in the g-point dimension.

Inatm_d
The implementations of CUDA-based 2D inatm_d1, inatm_d2, and inatm_d5 are similar to the procedure in Algorithm 5. Here, threadIdx%x identifies a unique thread inside a thread block in the x dimension, blockIdx%x identifies a unique thread block inside a kernel grid in the x dimension, and blockDim%x identifies the number of threads in a thread block in the x dimension. In the same way, threadIdx%y, blockIdx%y, and blockDim%y identify the corresponding configuration in the y dimension. Please note that the "%" symbol in Fortran is the access to fields of a structure and not the modulus operator. In addition, iplon, the coordinate of the horizontal grid points, represents the ID of a global thread in the x dimension, which can be expressed as iplon=(blockIdx%x-1)×blockDim%x+threadIdx%x. The coordinate of the vertical direction, lay, also represents the ID of a global thread in the y dimension, which can be expressed as lay=(blockIdx%y-1)×blockDim%y+threadIdx%y. Thus, a grid point in the horizontal and vertical directions can be identified by the iplon and lay variables as 2D linear data.
The implementation of inatm_d3 is illustrated in Algorithm 6. Here, threadIdx%z identifies a unique thread inside a thread block in the z dimension, blockIdx%z identifies a unique thread block inside a kernel grid in the z dimension, and blockDim%z identifies the number of threads in a thread block in the z dimension. The coordinate of the g points, ig, also represents the ID of a global thread in the x dimension, which can be expressed as ig=(blockIdx%x-1)×blockDim%x+threadIdx%x. Furthermore, inatm_d3 is used to assign a value to the four 3D arrays.
The implementation of 1D inatm_d4 is illustrated in Algorithm 7.

rtrnmc_d
The implementations of 1D rtrnmc_d1, rtrnmc_d4, rtrnmc_d8, rtrnmc_d10, and rtrnmc_d11 are similar to the procedure in Algorithm 3. The 2D rtrnmc_d2 and rtrnmc_d9 are used to assign a value to some arrays; their implementations are not described further here. The implementations of rtrnmc_d3, 2D rtrnmc_d5, and 2D rtrnmc_d6 are illustrated in Algorithm 9. The implementation of rtrnmc_d7 is similar to that of rtrnmc_d3.
do some corresponding work 7. end do 8. end if 9. When iband=11 ∼ 16, the algorithms are similar to that in the case of iband=10. end subroutine

Result and Discussion
Experimental studies were conducted to evaluate and compare the performance of the proposed algorithm with the solutions above. The results are described below.

Experimental Setup
To fully investigate the proposed algorithm, this paper conducted an ideal global climate simulation for one model day. The time step of the RRTMG_LW was 1 h.
The experiment platforms include two GPU clusters (K20 and K40 clusters). The K20 cluster at the Computer Network Information Center of CAS has 30 GPU nodes. Each GPU node has two Intel Xeon E5-2680 v2 processors and two NVIDIA Tesla K20 GPUs. Twenty CPU cores in each GPU node share 64 GB DDR3 system memory through QuickPath Interconnect. The basic compiler is the PGI Fortran compiler Version 14.10 that supports CUDA Fortran. The K40 cluster at China University of Geosciences (Beijing) has four GPU nodes, each with two NVIDIA Tesla K40 GPUs. Table 1 lists their detailed configurations. The serial RRTMG_LW was executed on an Intel Xeon E5-2680 v2 processor of K20 cluster. The G-RRTMG_LW was executed on one K20 or K40 GPU.

Influence of Block Size
The serial runtime of the subroutine rrtmg_lw on one core of an Intel Xeon E5-2680 v2 processor, which accounts for 66.07% of the total computing time of the subroutine rad_rrtmg_lw, is 647.12 s in this simulation, as shown in Table 2. Here, the computing time of the RRTMG_LW on the CPU or GPU, T rrtmg_lw , is calculated with the following formula: where T inatm is the computing time of the subroutine inatm or kernel inatm_d; moreover, T cldprmc , T setcoe f , T taumol , and T rtrnmc are the corresponding computing times of the other kernels. For exploring whether/how the number of threads in a thread block may affect the computation performance, the execution time of the G-RRTMG_LW with a tuned number of threads was measured over one GPU. Taking the case of no I/O transfer as an example, Figure 5 portrays the runtime of the G-RRTMG_LW on one K20 GPU. Indeed, the number of threads per block, or block size, affected the computation performance to some extent. The G-RRTMG_LW achieved optimal performance when the block size was 128. The G-RRTMG_LW on one K40 GPU resulted in a similar rule, as shown in Figure 6. Some conclusions and analysis are drawn as below. (1) Generally, increasing the block size can hide some memory access latency to some extent and improve the computational performance of parallel algorithms. Therefore, kernels with simple computation usually show optimal performance when the block size is 128 or 256. Thus, with a large amount of calculation, the kernel taumol with an acceleration in the g-point dimension achieved optimal performance on one K20 GPU when the block size was 128.
The runtime and speedup of the RRTMG_LW 2D acceleration algorithm on the K20 and K40 GPUs are shown in Table 2. From Table 2, Figures 5 and 6, the kernel taumol with an acceleration in the g-point dimension costs more computational time than its 2D version did. This is because there is much redundant computing in the current taumol. For example, each thread in 2D taumol ran the code shown in Table 3, but each thread in the current taumol still had to run the code although there were more threads. Therefore, the current taumol, with more threads than its 2D counterpart, did not take full advantage of having plenty of threads.
The rtrnmc with an acceleration in the g-point dimension on one K20 GPU and one K40 GPU both achieved optimal performance when the block size was 512. During its current acceleration, more threads were assigned, so that each thread had fewer calculations and required fewer hardware resources. When the block size was 512, the assigned hardware resources of each thread were in a state of equilibrium, so in this case, the current rtrnmc showed better performance.   Table 3. A piece of code in taumol.

Evaluations on Different GPUs
The runtime and speedup of the G-RRTMG_LW on the K20 and K40 GPUs are shown in Table 4. Due to the poor performance of the taumol with an acceleration in the g-point dimension, its 2D version was still used here. The speedups of the inatm and rtrnmc with an acceleration in the g-point dimension on one K20 GPU were 87.37× and 13.20×, respectively. Using one K20 GPU in the case without I/O transfer, the G-RRTMG_LW achieved a speedup of 25.47× as compared to its counterpart running on one CPU core of an Intel Xeon E5-2680 v2. Whereas, using one K40 GPU in the case without I/O transfer, the G-RRTMG_LW achieved a speedup of 30.98×. Some conclusions and analysis are drawn below. (1) The K40 GPU had a higher core clock and memory clock, more cores, and stronger floating-point computation power than the K20 GPU did. Thus, the G-RRTMG_LW on the K40 GPU showed better performance. (2) According to the testing, it was found that the transposition of four 3D arrays in the 1D or 2D inatm required most of the computational time. Because of discontinuous access for these arrays using a do-loop form, the transposition cost too much time. However, the inatm with an acceleration in the g-point dimension has no do-loops, as shown in Algorithm 6, so it can show an excellent performance improvement.
There are 11 kernels in the current rtrnmc, but only two of them have a composite decomposition in the horizontal, vertical and g-point dimensions. The two kernels only cost about 17% computational time, so the current rtrnmc did not achieve a noticeable performance improvement compared with its 1D version. When compared to its counterpart running on 10 CPU cores of an Intel Xeon E5-2680 v2, the speedup of the G-RRTMG_LW on the K20 GPU is shown in Table 5. In this case, using one K20 GPU in the case without I/O transfer, the G-RRTMG_LW achieved a speedup of 2.35×. This shows that running the RRTMG_LW on one K20 GPU still has a better computing performance than on 10 cores of a CPU. Table 5. Runtime and speedup of the G-RRTMG_LW on one GPU when compared to its counterpart running on 10 CPU cores of an Intel Xeon E5-2680 v2. Here, the block size = 512 and ncol = 2048; inatm and rtrnmc are with an acceleration in the g-point dimension; cldprmc is with a 1D decomposition; setcoef and taumol are with a 2D decomposition. (1)

Subroutines CPU Time (S) K20 Time (S) Speedup
In the simulation with one model day, the RRTMG_LW required integral calculations 24 times. During each integration for all 128×256 grid points, the subroutine rrtmg_lw had to be invoked 16 (128×256/ncol) times when ncol is 2048. Due to the memory limitation of the GPU, the maximum value of ncol on the K40 GPU was 2048. This means that the G-RRTMG_LW was invoked repeatedly 16×24 = 384 times in the experiment. For each invocation, the input and output of the 3D arrays must be updated between the CPU and GPU, so the I/O transfer incurs a high communication cost. (2) The computational time of the G-RRTMG_LW in the case without I/O transfer was fairly short, so the I/O transfer cost exhibited a huge bottleneck for the maximum level of performance improvement of the G-RRTMG_LW. In the future, compressing data and improving the network bandwidth will be beneficial for reducing this I/O transfer cost.

Error Analysis
During accelerating the computational performance of a climate system model, it is of vital importance to ensure that the model on one GPU can generate the same results within a small tolerance threshold. In this simulating experiment, Figure 7 illustrates the impact on the longwave flux at the top of the atmosphere in a clear sky. The outgoing longwave flux by running entirely the CAS-ESM RRTMG_LW on the CPU is shown in Figure 7a. The longwave flux differences between the simulations running the CAS-ESM RRTMG_LW only on the CPU and running the CAS-ESM RRTMG_LW (G-RRTMG_LW) on the K20 GPU is shown in Figure 7b. The results show that there are minor and negligible differences. Besides the impact of running the G-RRTMG_LW on the GPU, the impact of the slight physics change by running the G-RRTMG_LW code on the GPU also results in these differences.

Conclusions and Future Work
It is exceedingly challenging to use GPUs to accelerate radiation physics. In this work, a GPU-based acceleration algorithm of the RRTMG_LW in the g-point dimension was proposed. Then, the acceleration algorithm was implemented using CUDA Fortran. Finally, G-RRTMG_LW, the GPU version of the RRTMG_LW, was developed. The results indicated that the algorithm was effective. During the climate simulation for one model day, the G-RRTMG_LW on one K40 GPU achieved a speedup of 30.98× as compared with a single Intel Xeon E5-2680 CPU-core counterpart. Its runtime decreased from 647.12 s to 20.8876 s. Running the G-RRTMG_LW on one GPU presented a better computing performance than on a CPU with multiple cores. Furthermore, the current acceleration algorithm of the RRTMG_LW displayed better calculation performance than its 2D algorithm did.
The future work will include three aspects. First, the proposed algorithm will be further optimized. For example, read-only arrays in CUDA code will be considered for inclusion in the CUDA texture memory, rather than in the global memory. Second, the I/O transfer in the current G-RRTMG_LW still costs a great deal of time, so the methods of reducing the I/O transfer between the CPU and GPU will be studied. Third, the G-RRTMG_LW currently only runs on one GPU. The CAS-ESM often runs on several CPUs and nodes, so the acceleration algorithm on multiple GPUs will be studied. An MPI+CUDA hybrid paradigm will be adopted to run the G-RRTMG_LW on multiple GPUs.