Using a GPU to Accelerate a Longwave Radiative Transfer Model with Efﬁcient CUDA-Based Methods

: Climatic simulations rely heavily on high-performance computing. As one of the atmospheric radiative transfer models, the rapid radiative transfer model for general circulation models (RRTMG) is used to calculate the radiative transfer of electromagnetic radiation through a planetary atmosphere. Radiation physics is one of the most time-consuming physical processes, so the RRTMG presents large-scale and long-term simulation challenges to the development of efﬁcient parallel algorithms that ﬁt well into multicore clusters. This paper presents a method for improving the calculative efﬁciency of radiation physics, an RRTMG long-wave radiation scheme (RRTMG_LW) that is accelerated on a graphics processing unit (GPU). First, a GPU-based acceleration algorithm with one-dimensional domain decomposition is proposed. Then, a second acceleration algorithm with two-dimensional domain decomposition is presented. After the two algorithms were implemented in Compute Uniﬁed Device Architecture (CUDA) Fortran, a GPU version of the RRTMG_LW, namely G-RRTMG_LW, was developed. Results demonstrated that the proposed acceleration algorithms were effective and that the G-RRTMG_LW achieved a signiﬁcant speedup. In the case without I/O transfer, the 2-D G-RRTMG_LW on one K40 GPU obtained a speed increase of 18.52 × over the baseline performance on a single Intel Xeon E5-2680 CPU core.


Introduction
With the rapid development of computer technology, high-performance computing (HPC) is employed in a wide range of real-world applications [1][2][3]. Earth system models (ESMs) have a large amount of calculation and high resolution, so HPC is widely used to accelerate their computing and simulation [4]. In the past few years, the modern graphics processing unit (GPU), which combines features of high parallelism, multi-threaded multicore processor, high-memory bandwidth, general-purpose computing, low cost, and compact size far beyond a graphics engine, has substantially outpaced its central processing unit (CPU) counterparts in dealing with data-intensive, computing-intensive, and time-intensive problems [5][6][7][8][9]. In the era of pursuing green computing, the booming GPU capability has attracted more and more scientists and engineers to use GPUs instead of CPUs to accelerate climate system models or ESMs [10,11]. Due to the advent of NVIDIA's Compute Unified Device Architecture (CUDA) [12], GPU support has been added to many scientific and engineering applications. At present, CUDA, a general-purpose parallel computing architecture, supports many high-level languages, such as C/C++ and Fortran.

Related Work
A number of successful attempts have been made to accelerate the RRTM or the RRTMG with multicore and multi-thread computing techniques. In this section, the most salient work along this direction is described.
Ruetsch et al. used CUDA Fortran [29] to port the long-wave RRTM code of the Weather Research and Forecasting (WRF) model to GPU [30]. In the porting, data structures were modified, the code was partitioned into different kernels, and these kernels were configured. The RRTM attained a 10× performance improvement on Tesla C1060 GPUs. However, the RRTM relied heavily on lookup tables, so the performance optimization became extremely data dependent.
Lu et al. accelerated the RRTM long-wave radiation scheme (RRTM_LW) on the GTX470, GTX480, and C2050 and obtained 23.2×, 27.6×, and 18.2× speedups, respectively, compared with the baseline wall-clock time. Furthermore, they analyzed its performance with regard to GPU clock rates, execution configurations, register file utilizations, and characteristics of the RRTM_LW [13]. Afterwards, they continued to accelerate the RRTM_LW by exploiting CPUs and GPUs on a Tianhe-1A supercomputer and proposed a workload distribution scheme based on the speedup feedback [16]. Zheng et al. developed the CUDA Fortran version of the RRTM_LW in the the GRAPES_Meso model [27]. Some optimization methods such as code tuning, asynchronous memory transfer, and a compiler option were adopted to enhance the computational efficiency. After the optimization, a 14.3× speedup was obtained.
Mielikainen et al. rewrote the Fortran code of the RRTMG shortwave radiation scheme (RRTMG_SW) in C to implement its GPU-compatible version [31]. Compared to its single-threaded Fortran counterpart running on Intel Xeon E5-2603, the RRTMG_SW based on CUDA C had a 202× speedup on Tesla K40 GPU.
Bertagna et al. also rewrote the Fortran code of the High-Order Methods Modeling Environment (HOMME) in C++ and used the Kokkos library to express on-node parallelism. Then, HOMME achieved good performance on the GPU [32].
Rewriting the Fortran code of a radiative transfer model or climate model in C or C++ can achieve good performance on the GPU, but it would take considerable time. In a significantly different approach from the previous work, the current study proposes a systematic, detailed, and comprehensive parallel algorithm on a GPU for the RRTMG_LW in the CAS-ESM, resulting in increased speed performance. The parallelization is implemented by adopting CUDA Fortran rather than CUDA C. CUDA now supports Fortran, so it is not necessary to adopt CUDA C to implement the GPU computation of the CAS-ESM RRTMG_LW. Major concerns addressed by the proposed algorithms include (a) run-time efficiency; (b) common processes and technologies of GPU parallelization; and (c) feasibility of applying the research outputs in other physical parameterization schemes.

RRTMG Radiation Scheme
The RRTM is not fast enough for GCMs. To improve its computational efficiency without significantly degrading its accuracy, the RRTM was modified to produce the RRTMG [33,34]. The RRTMG and the RRTM have the same basic physics and absorption coefficients, but there are several modifications in the RRTMG [35]. (1) The total number of quadrature points (g points) in the RRTMG_LW is 140, while it is 256 in the RRTM_LW. In the shortwave, the number of g points is reduced from 224 in the RRTM shortwave radiation scheme (RRTM_SW) to 112 in the RRTMG_SW.
(2) The RRTMG_LW includes McICA (Monte-Carlo Independent Column Approximation) capability to represent sub-grid cloud variability with random, maximum-random, and maximum options for cloud overlap [36]; the RRTM_LW does not have the McICA, but it does include representations for random and maximum-random cloud overlap. (3) The RRTMG_LW performs radiative transfer only for a single (diffusivity) angle (angle = 53 deg, secant angle = 1.66) and varies this angle to improve accuracy in profiles with high water; the RRTM_LW can use multiple angles for radiative transfer. (4) The RRTMG_LW coding has been reformatted to use many Fortran 90 features. (5) The RRTMG_LW includes aerosol absorption capability. (6) The RRTMG_LW can be used as a callable subroutine and can be adapted for use within global or regional models. (7) The RRTMG_LW can optionally read the required input data either from a netCDF file or from the original RRTM_LW source data statements. (8) The RRTMG_LW can provide the change in upward flux with respect to surface temperature, dF/dT, and by layer for total sky and clear sky.
The further description on g points in the RRTM or RRTMG is as follows. The RRTM uses the correlated k-distribution method to calculate the broadband radiative fluxes. In this method, the radiative spectrum is first divided into bands. Because of the rapid variation of absorption lines within the bands of gas molecules, the values of the absorption intensities within each band are further binned into a cumulative distribution function of the intensities. This distribution function is then discretized by using g intervals for integration within each band to obtain the band radiative fluxes, which are further integrated across the bands to obtain the total radiative flux to calculate atmospheric radiative heating or cooling. The g points are the discretized absorption intensities within each band. RRTM_LW has 16 bands and 256 g points. RRTM_SW has 16 bands and 224 g points. To speed up the calculations for climate and weather models, the spectral resolutions of RRTM_LW and RRTM_SW are further coarsened for applications in GCMs as RRTMG_LW and RRTMG_SW. RRTMG_LW has 16 bands and 140 g points. RRTMG_SW has 16 bands and 112 g points [35].
The spectrally averaged outgoing radiance from an atmospheric layer is calculated according to the following formula: where v is the wavenumber; θ is temperature; µ is the zenith direction cosine; v 1 and v 2 are the beginning and ending wavenumbers of the spectral interval, repectively; I 0 is the radiance incoming to the layer; B(v, θ) is the Planck function at v and θ; T v is the transmittance for the layer optical path; and T v is the transmittance at a point along the optical path in the layer. Under some assumptions, Equation (1) becomes where g is the fraction of the absorption coefficient, P is layer pressure, ρ is the absorber density in the layer, ∆z is the vertical thickness of the layer, ϕ is the angle of the optical path in the azimuthal direction, k(g, P, θ) is the absorption coefficient at P and θ, and Be f f (g, T g ) is an effective Planck function for the layer. The monochromatic net flux is expressed as where The total net flux is obtained by integrating over v The radiative heating (or cooling) rate is expressed as where c p is the specific heat at constant pressure, P is pressure, g is the gravitational acceleration, and ρ is the air density in a given layer [37].

RRTMG_LW Code Structure
A profiling graph of the original RRTMG_LW Fortran code is shown in Figure 1. The subroutine rad_rrtmg_lw is the driver of long-wave radiation code. The subroutine mcica_subcol_lw is used to create McICA stochastic arrays for cloud physical or optical properties. The subroutine rrtmg_lw is the driver subroutine for the RRTMG_LW that has been adapted from the RRTM_LW for improved efficiency. The subroutine rrtmg_lw (a) calls the subroutine inatm to read in the atmospheric profile from the GCM for use in the RRTMG_LW and to define other input parameters; (b) calls the subroutine cldprmc to set cloud optical depth for the McICA based on the input cloud properties; (c) calls the subroutine setcoe f to calculate information needed by the radiative transfer routine that is specific to this atmosphere, especially some of the coefficients and indices needed to compute the optical depths, by interpolating data from stored reference atmospheres; (d) calls the subroutine taumol to calculate the gaseous optical depths and Planck fractions for each of the 16 spectral bands; (e) calls the subroutine rtrnmc (for both clear and cloudy profiles) to perform the radiative transfer calculation using the McICA to represent sub-grid scale cloud variability; and (f) passes the necessary fluxes and heating rates back to the GCM.
As depicted in Figure 1, rrtmg_lw takes most of the computation time in rad_rrtmg_lw, so the study target was to use the GPU to accelerate rrtmg_lw. The computing procedure and code structure of rrtmg_lw are shown in Algorithm 1. Therefore, more specifically, the subroutines inatm, cldprmc, setcoe f , taumol, and rtrnmc are accelerated on the GPU. Algorithm 1: Computing procedure of original rrtmg_lw. subroutine rrtmg_lw(parameters) //ncol is the number of horizontal columns 1. do iplon=1, ncol 2. call inatm(parameters) 3. call cldprmc(parameters) 4. call setcoef (parameters) 5. call taumol(parameters) 6. if aerosol is active then //Combine gaseous and aerosol optical depths 7.

Experimental Platform
Experiments in this paper were conducted over two GPU clusters (a K20 cluster and a K40 cluster). The K20 cluster is at the Computer Network Information Center of the Chinese Academy of Sciences and has 30 GPU nodes, each having two CPUs and two NVIDIA Tesla K20 GPUs. The CPU is the Intel Xeon E5-2680 v2 processor. In each GPU node, the CPU cores share 64 GB of DDR3 system memory through QuickPath Interconnect. The PGI Fortran compiler Version 14.10 that supports CUDA Fortran was used as the basic compiler in the tests. The K40 cluster is at the China University of Geosciences (Beijing). Their detailed configurations are listed in Table 1. The serial RRTMG_LW implementation was executed on an Intel Xeon E5-2680 v2 processor of the K20 cluster. The G-RRTMG_LW implementation was tested on GPUs of the K20 and K40 clusters.

Parallel Strategy
To run the subroutines inatm, cldprmc, setcoe f , taumol, and rtrnmc on a GPU, they are each implemented with a separate kernel. taumol and rtrnmc are too large to run efficiently as a single kernel, so they can each be split into more than one. The RRTMG_LW uses a collection of three-dimensional (3-D) cells to describe the atmosphere. The computation in the RRTMG_LW is independent in the horizontal direction, so for these kernels, each CUDA thread is assigned the workload of one "column" shown in Figure 2, where the x-axis represents longitude, the y-axis represents latitude, and the z-axis represents the vertical direction. To further exploit the fine-grained data parallelism, the computational amount in the vertical dimension is even divided. Moreover, the parallelization between kernels should be also considered in addition to the one within kernels. z y x Figure 2. Three-dimensional spatial structure of RRTMG_LW: Its physical dimensions include longitude, latitude, and model layers.
In the CAS-ESM, the IAP AGCM4.0 is with a 1.4 • × 1.4 • horizontal resolution and 51 levels in the vertical direction, so here, the RRTMG_LW has 128 × 256 = 32,768 horizontal grid points. If one GPU is applied, the GPU will finish computing the 32,768 grid points in the horizontal direction at each time step. Due to the limitation of global memory on a GPU, a K20 GPU only can compute for ncol = 2048 horizontal grid points and start 2048 CUDA threads. Thus, the 32,768 grid points will be divided into 32,768/2048 = 16 chunks, each having 2048 points. In other words, a K20 GPU will finish all the computations of 32,768 points by 16 iterations at each time step.

Acceleration Algorithm with One-Dimensional Domain Decomposition
The computation of the RRTMG_LW in the horizontal direction is independent, so each CUDA thread is assigned the computation of one column. It means that the RRTMG_LW is able to be parallel in the horizontal direction. It is noteworthy that the first dimension of 3-D arrays in the CAS-ESM RRTMG_LW CUDA code represents the number of horizontal columns, the second dimension represents the number of model layers, and the third dimension represents the number of g points. Figure 3 illustrates the one-dimensional (1-D) domain decomposition for the RRTMG_LW accelerated on the GPU. The acceleration algorithm in the horizontal direction or with a 1-D decomposition is illustrated in Algorithm 2. Here, n is the number of threads in each thread block and m = (real)ncol/n is the number of blocks used in each kernel grid. In Algorithm 1, inatm, cldprmc, setcoe f , taumol, and rtrnmc are all called repeatedly for ncol times. However, due to using ncol CUDA threads to execute these subroutines, they are only called once in Algorithm 2. Theoretically, their computing time will be reduced by ncol times. These kernels in Algorithm 2 are described as follows: (1) The implementation of the kernel inatm_d based on CUDA Fortran is illustrated in Algorithm 3.
Within, threadIdx%x identifies a unique thread inside a thread block, blockIdx%x identifies a unique thread block inside a kernel grid, and blockDim%x identifies the number of threads in a thread block. iplon, the coordinate of the horizontal grid points, also represents the ID of a global thread in CUDA that can be expressed as iplon = (blockIdx%x − 1) × blockDim%x + threadIdx%x. According to the design shown in Algorithm 3, ncol threads can execute inatm_d concurrently. (2) The other kernels are designed in the same way, as shown in Algorithms A1-A4 of Appendix A.1.
To avoid needless repetition, these algorithms are described only in rough form.
end do 31. end do 32.end if end subroutine

Acceleration Algorithm with Two-Dimensional Domain Decomposition
nlay, the number of model layers in the vertical direction of the RRTMG_LW, is 51 in the CAS-ESM. Besides the parallelization in the horizontal direction, the one in the vertical direction for the RRTMG_LW is also considered to make full use of the performance of the GPU. Figure 4 illustrates the two-dimensional (2-D) domain decomposition for the RRTMG_LW accelerated on the GPU. The acceleration algorithm in the horizontal and vertical directions or with a 2-D domain decomposition for the RRTMG_LW is illustrated in Algorithm 4. Here, tBlock defines the number of threads used in each thread block of the x, y, and z dimensions by the derived type dim3. grid defines the number of blocks in the x, y, and z dimensions by dim3. Because of data dependency, cldprmc and rtrnmc are unsuitable for the parallelization in the vertical direction. These kernels with 2-D decomposition in Algorithm 4 are described as follows: (1) The implementation of CUDA-based inatm_d with 2-D decomposition is illustrated in Algorithm 5. Here, threadIdx%x identifies a unique thread inside a thread block in the x dimension, blockIdx%x identifies a unique thread block inside a kernel grid in the x dimension, and blockDim%x identifies the number of threads in a thread block in the x dimension. In the same way, threadIdx%y, blockIdx%y, and blockDim%y identify the corresponding configuration in the y dimension. lay, the coordinate of the vertical direction, also represents the ID of a global thread in the y dimension, which can be expressed as lay = (blockIdx%y − 1) × blockDim%y + threadIdx%y. Therefore, a grid point in the horizontal and vertical directions can be identified by the iplon and lay variables as two-dimensional linear data.
Due to data dependency, a piece of code in inatm_d can be parallel only in the horizontal direction, so the kernel inatm_d3 in Algorithm 4 uses 1-D decomposition. The other code in inatm_d is able to use 2-D decomposition. Due to the requirement of data synchronization, inatm_d with the 2-D decomposition is divided into four kernels (inatm_d1, inatm_d2, inatm_d3, and inatm_d4). The 2-D parallelization of inatm_d1, inatm_d2, and inatm_d4 is similar to Algorithm 5. Their detailed implementations will not be described further here. According to the design shown in Algorithm 5, ncol × nlayers threads can execute 2-D inatm_d concurrently.
(2) Similarly, due to data dependency, a piece of code in setcoe f _d can be parallel only in the horizontal direction, so the kernel setcoe f _d2 in Algorithm 4 uses 1-D decomposition. The other code in setcoe f _d is able to use 2-D decomposition, as shown in the kernel setcoe f _d1. The 2-D parallelization of setcoe f _d1 is described in rough form in Algorithm A5 of Appendix A.2.

Result and Discussion
To fully investigate the parallel performance of the proposed acceleration algorithms described above, an ideal climate simulation experiment for one model day was conducted. The time step of the RRTMG_LW in the experiment was 1 h.

Influence of Block Size
The run-time of the serial RRTMG_LW on one core of an Intel Xeon E5-2680 v2 processor is shown in Table 2. Here, the computing time of the RRTMG_LW on the CPU or GPU, T rrtmg_lw , is calculated by the following formula: where T inatm is the computing time of the subroutine inatm or kernel inatm_d and where T cldprmc , T setcoe f , T taumol , and T rtrnmc are the corresponding computing time of the other kernels. To explore whether/how the number of threads in a thread block may affect the computation performance for a given scale, the execution time of the G-RRTMG_LW with a tuned number of threads was measured over one GPU. Taking the case of no I/O transfer as an example, Figure 5 portrays the run-time of the 1-D G-RRTMG_LW on one K20 GPU. Indeed, the number of threads per block or block size affected the computation performance to some extent. The G-RRTMG_LW achieved optimal performance when block size was 16. The G-RRTMG_LW on one K40 GPU resulted in a similar rule, as shown in Figure 6. Some conclusions and analysis are drawn as follows. (1) Generally speaking, increasing the block size can hide memory access latency to some extent and can improve the computational performance of parallel algorithms. Therefore, kernels with simple computations usually show optimal performance when the block size is 128 or 256. (2) The kernel taumol with a large amount of calculation achieved optimal performance when the block size was 16. With the increment of the block size, its computational time increased as a whole. During the 1-D acceleration of taumol, each thread with a large amount of calculation produced numerous temporary and intermediate values, which consumed a great deal of hardware resources. Due to limited hardware resources, each thread will have fewer resources if the block size is larger. Therefore, along with the increase of the block size, the speed of taumol will be slower.

Evaluations on Different GPUs
The run-time and speedup of the 1-D G-RRTMG_LW on the K20 and K40 GPUs are shown in Table 2. The speedups of the 1-D inatm, cldprmc, setcoe f , taumol, and rtrnmc on one K20 GPU were 7.36×, 2635.00×, 14.65×, 9.22×, and 11.83×, respectively. Using one K20 GPU in the case without I/O transfer, the 1-D G-RRTMG_LW achieved a speedup of 10.57× compared to its counterpart running on one CPU core of an Intel Xeon E5-2680 v2 whereas, using one K40 GPU in the case without I/O transfer, the 1-D G-RRTMG_LW achieved a speedup of 14.62×. The K40 GPU had higher core clock and memory clock, more cores, and stronger floating-point computation power than the K20 GPU, so the 1-D G-RRTMG_LW on the K40 GPU showed better performance.

Influence of Block Size
Taking the case of no I/O transfer as an example, Figure 7 portrays the run-time of the 2-D G-RRTMG_LW on one K20 GPU. The G-RRTMG_LW on one K20 GPU achieved optimal performance when the block size was 16. However, the 2-D G-RRTMG_LW on one K40 GPU achieved optimal performance when the block size was 32, as shown in Figure 8. The 2-D setcoe f and taumol achieved a better speedup than their 1-D versions, but the 2-D inatm ran more slowly. From Figures 7 and 8, some conclusions and analysis are drawn as follows.   (1) The 2-D inatm costs more computational time than its 1-D version. This indicated that inatm was not fit for 2-D acceleration. According to the testing, it was found that the assignment computing of four three-dimensional arrays in the 2-D inatm required about 95% of the computational time.
In the 2-D acceleration, more CUDA threads were started. Each CUDA thread needs to finish the assignment computing of the four arrays using a do-loop form as shown in Algorithm 5, so the computing costs too much time.
The 2-D taumol achieved optimal performance when the block size was 256 or 512. During the 2-D acceleration, a larger number of threads were assigned, so each thread had fewer calculations, requiring fewer hardware resources. When the block size was 256 or 512, the assigned hardware resources of each thread were in a state of equilibrium, so in this case, the 2-D taumol showed better performance.

Evaluations on Different GPUs
The run-time and speedup of the 2-D G-RRTMG_LW on the K20 and K40 GPUs are shown in Table 3. Due to the poor performance of the 2-D inatm, its 1-D version was still used here. The speedups of the 2-D setcoe f and taumol on one K20 GPU were 43.21× and 40.54×, respectively. Using one K20 GPU in the case without I/O transfer, the 2-D G-RRTMG_LW achieved a speedup of 14.63× compared to its counterpart running on one CPU core of an Intel Xeon E5-2680 v2 whereas, using one K40 GPU in the case without I/O transfer, the 2-D G-RRTMG_LW achieved a speedup of 18.52×. Some conclusions and analysis are drawn as follows. (1) In the same way, the K40 GPU had stronger computing power than the K20 GPU, the 2-D G-RRTMG_LW on the K40 GPU showed better performance.
The 2-D setcoe f and taumol showed excellent performance improvements compared to their 1-D versions, especially taumol. There was no data dependence in taumol with intensive computing, so a finer-grained parallelization of taumol was executed when more threads were used.  (1)

I/O Transfer
In the simulation with one model day, the RRTMG_LW required integral calculations 24 times. During each integration for all 128 × 256 grid points, the subroutine rrtmg_lw must be invoked 16 (128 × 256/ncol) times when ncol is 2048. Due to the memory limitation of the GPU, the maximum value of ncol on the K40 GPU was 2048. This means that the 2-D G-RRTMG_LW was invoked repeatedly 16 × 24 = 384 times in the experiment. For each invocation, the input and output of the three-dimensional arrays must be updated between the CPU and GPU, so the I/O transfer incurs a high communication cost.
(2) After the 2-D acceleration for the RRTMG_LW, its computational time in the case without I/O transfer was fairly shorter, so the I/O transfer cost was a huge bottleneck for the maximum level of performance improvement of the G-RRTMG_LW. In the future, compressing data and improving network bandwidth will be beneficial for reducing this I/O transfer cost.

Discussion
(1) The WRF RRTMG_LW on eight CPU cores achieved a speedup of 7.58× compared to its counterpart running on one CPU core [28]. Using one K40 GPU in the case without I/O transfer, the 2-D G-RRTMG_LW achieved a speedup of 18.52×. This shows that the RRTMG_LW on one GPU can still obtain a better performance improvement than on one CPU with eight cores. (2) The RRTM_LW on the C2050 obtained an 18.2× speedup [13]. This shows that the 2-D G-RRTMG_LW obtained a similar speedup with the RRTM_LW accelerated in CUDA Fortran.
The CUDA Fortran version of the RRTM_LW in the GRAPES_Meso model obtained a 14.3× speedup [27], but the CUDA C version of the RRTMG_SW obtained a 202× speedup [31].

Conclusions and Future Work
High-efficiency parallel computing for radiation physics is exceedingly challenging. This paper presents the acceleration of the RRTMG_LW on GPU clusters. First, a GPU-based acceleration algorithm with a one-dimensional domain decomposition was proposed. Then, a second GPU-based acceleration algorithm with a two-dimensional domain decomposition was put forward. The two acceleration algorithms of the RRTMG_LW were implemented in CUDA Fortran. The G-RRTMG_LW, the GPU version of the RRTMG_LW, was also developed. The experimental results indicated that the 2-D G-RRTMG_LW displayed better calculation performances. During the computation of one model day, the 2-D G-RRTMG_LW on one K40 GPU obtained a speedup of 18.52× as compared to a single Intel Xeon E5-2680 CPU-core counterpart, with the run-time decreasing from 647.12 s to 34.9384 s. Due to the acceleration of the G-RRTMG_LW, the CAS-ESM achieved faster computing.
The current implementation of the G-RRTMG_LW did not achieve an acceptable speedup. The future work to address this includes the following two aspects. (1) The G-RRTMG_LW will be further accelerated. It may be beneficial to accelerate the G-RRTMG_LW in the g-point dimension. A GPU-based acceleration algorithm with a three-dimensional domain decomposition for the G-RRTMG_LW will be attempted to explore this idea. (2) The G-RRTMG_LW currently only runs on one GPU. However, on the K20 cluster, one node is equipped with 20 CPU cores and 2 GPU cards. To fully use the CPU cores and GPUs in this type of installation, an MPI+OpenMP+CUDA hybrid paradigm [38] will be considered. However, such a programming model will be a huge challenge in the implementation. Without a doubt, the G-RRTMG_LW on multiple GPUs will obtain even better computational performances.

Conflicts of Interest:
The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: