CUDA-Optimized GPU Acceleration of 3GPP 3D Channel Model Simulations for 5G Network Planning

: The simulation of massive multiple-input multiple-output (MIMO) channel models is becoming increasingly important for testing and validation of ﬁfth-generation new radio (5G NR) wireless networks and beyond. However, simulation performance tends to be limited when modeling a large number of antenna elements combined with a complex and realistic representation of propagation conditions. In this paper, we propose an efﬁcient implementation of a 3rd Generation Partnership Project (3GPP) three-dimensional (3D) channel model, speciﬁcally designed for graphics processing unit (GPU) platforms, with the goal of minimizing the computational time required for channel simulation. The channel model is highly parameterized to encompass a wide range of conﬁgurations required for real-world optimized 5G NR network deployments. We use several compute uniﬁed device architecture (CUDA)-based optimization techniques to exploit the parallelism and memory hierarchy of the GPU. Experimental data show that the developed system achieves an overall speedup of about 240× compared to the original C++ model executed on an Intel processor. Compared to a design previously accelerated on a datacenter-class ﬁeld programmable gate array (FPGA), the GPU design has a 33.3% higher single-precision performance but a 7.5% higher power consumption. The proposed GPU accelerator can provide fast and accurate channel simulations for 5G NR network planning and optimization.


Introduction
Mobile communications have become an increasingly important part of modern life.They make it possible to perform routine tasks such as making video calls while traveling or staying in touch with family and the office from anywhere in the world [1][2][3][4].The process of digitization began in the early 1990s with the introduction of secondgeneration cellular technologies [5].Cellular technologies are undergoing continuous service development to improve customer experience and provide a higher level of service.The number of mobile devices such as smartphones, tablets, Internet of Things (IoT) devices, and laptops are expected to reach around 28.5 billion networked devices by the end of 2023 [6].This increases the demand for higher throughput, lower latency, and higher quality of service, which are the main drivers for the introduction of the fifth-generation (5G) wireless communication standard [7].The radio link is by far the most important part of any cellular technology.Its simulation must closely take into account propagation in real-world environments in order to predict and optimize network coverage and minimize post-deployment on-field measurements.
The COST 2100 model [9] is a geometry-based stochastic model (GSCM) for MIMO channels with cluster power and delays derived from fixed geometry and therefore has limited support for dual mobility propagation scenarios.Weiler et al. [10] presented a quasideterministic channel model where the channel impulse response (CIR) is a combination of a few strong quasi-deterministic rays, flashing rays, and weak random rays.METIS [11] is a ray-tracing based channel model that supports propagation scenarios such as blocking, diffraction, specular reflection, and spherical wave propagation.
However, existing channel model simulators are either too simple to accurately replicate the propagation environment or too computationally expensive to produce meaningful results in a reasonable amount of time.In addition, most existing simulators are designed for central processing unit (CPU) platforms, which have limited parallelism and throughput.Therefore, there is a need for efficient and accurate channel model simulators that can run on parallel platforms such as graphics processing units (GPUs)s, which offer high performance and scalability.This article proposes a GPU-based hardware acceleration for the 3rd Generation Partnership Project (3GPP) three-dimensional (3D) channel model, which is a highly parameterized and realistic channel model for fifth-generation new radio (5G NR) networks, and shows that the proposed GPU accelerator can significantly improve the simulation speed and accuracy over a CPU-based C++ model, and also has higher single-precision performance than a previously designedfield programmable gate array (FPGA)-based accelerator.
The rest of the paper is organized as follows.Section 2 discusses previous related work.Section 3 discusses the importance of 3D channel modeling in MIMO communication techniques and briefly describes the 3GPP channel model.Section 4 presents an introduction to GPU and compute unified device architecture (CUDA) programming, followed by optimization methods for GPU-based hardware acceleration and its key benefits.The architecture of the proposed accelerator is outlined in Section 5, and the impact of various optimizations on the final result is analyzed in Section 6.Finally, conclusions are drawn in Section 7.

Related Work
The field-testing and validation of wireless systems is expensive both in terms of equipment cost and time to market.This step can be replaced by fast and accurate softwarebased models with high repeatability.Several channel simulators have been reported in the literature [13][14][15][16][17]. Sun et al. [13] proposed a geometry-based channel model simulator for the link and physical layers.Jaeckel et al. [14] proposed ground reflection components to the existing geometry-based channel models.Ju et al. [15] presented a model simulator for spatially consistent channel realizations using pedestrian measurements for human blockage.Pessoa et al. [17] presented a simulator with support for dual mobility.
General-purpose CPU-based channel simulators are either too simple to accurately replicate the propagation environment or too computationally complex to produce meaningful results in a reasonable time.Channel model accelerators minimize simulation time.Acceleration technologies include application-specific integrated circuits, massively parallel GPUs, and FPGAs.
Several hardware-based channel accelerators have been reported in the literature.The A GSCM emulator on FPGA, which considers only discrete time segments was introduced [18].For the an FPGA implementation of the a 3GPP 3D channel model, a variety of high level synthesis (HLS)-based optimizations are discussed, which are required to achieve acceleration [19].A CUDA-based multipath fading accelerator has been proposed [20], as well as another wireless channel emulator [21].It lacks complex-valued channel coefficient emulation, which reduces accuracy.Buscemi and Sass [22] emulated a scalable wireless channel architecture on a cluster of 64 FPGAs, but its high hardware cost limits its appli-cation.Recently, Endovitskiy et al. [23] proposed a technique to reduce the complexity of the 3GPP channel model for 5G NR by reducing the number of sub-paths, thus reducing the computational cost, but it analyzes only a subset of wireless channel propagation characteristics, limiting its application.
We investigate various optimization techniques for the effective deployment of a 3D GSCM channel for frequencies from 0.5 GHz to 100 GHz as proposed by ETSI [12] for GPU platforms.Despite its computational complexity, this model simulates the propagation environment more accurately.The goal is to maximize the planning quality of 5G mobile networks by leveraging the generality and accuracy of the 3GPP channel model given in TR.38.901 [12].The channel model was designed for CPU platforms, then optimized for NVIDIA GPUs using CUDA-based techniques.
We developed a GPU-based simulator that is more accurate than state-of-the-art programs and delivers data to network designers quickly.When larger antenna arrays and user mobility are included, CPU-oriented 5G simulation stack executions can take days or weeks.The accelerated channel model is integrated into our implementation of the 5G simulation stack using a socket-based client/server architecture for shared use by multiple network planners.Our acceleration efforts have resulted in a remarkable 240× increase in speed, allowing the simulation to be completed in hours instead of weeks.
The main contributions of the article are: • proposing a GPU-based hardware acceleration for the 3GPP 3D channel model, which is a highly parameterized and realistic channel model for 5G NR networks; • application of various CUDA-based optimization techniques to efficiently utilize GPU resources and increase the overall performance of the channel model simulator; • evaluation of the performance and accuracy of the GPU accelerator using benchmark parameters and comparison with both a CPU-based C++ model and a previous design on an FPGA based on the same 16 nm technology node as the GPU; • showing that the GPU accelerator can achieve an overall speedup of about 240× compared to the CPU model and 33.3% higher single-precision performance than a comparable FPGA design, while maintaining high accuracy and flexibility.

The 3GPP Channel Model for 5G NR
Currently, 5G mobile networks have a high device and base station density, low latency, and high data rates.MIMO channels and multi-antenna transmission increase radio link reliability and efficiency.Additionally, Two-dimensional spatial channel models models wireless channel behavior and performance with low computational complexity [24]; however, by considering only a two-dimensional plane, they poorly capture transmission channel characteristics and limit MIMO techniques to azimuth (beamforming, spatial multiplexing, and precoding).Three-dimensional channel models include channel azimuth and elevation.GSCM calculates channel parameters using randomly distributed scatterers [25].
The 3GPP specifications [12] propose an accurate and reliable stochastic channel model for building, optimizing, and evaluating 5G systems.The GSCM consists of two parts: (1) a large-scale fading model that includes path loss, line-of-sight (LOS) probability, and additional losses combined with (2) a small-scale fading model characterized by the CIR (also called "channel coefficients" in the following).In the context of multipath propagation, the received signal is composed of various attenuated replicas of the original transmitted signal.To calculate the channel coefficients, a step-by-step procedure is recommended [12] (Figure 7.5-1).A simplified representation of multipath scattering is shown in Figure 1 for the propagation of n clusters, each resolvable into m subpaths.The azimuth and elevation angles at the base station (BS) and user side are φ and θ.The small-scale parameters include cluster powers, delays, and arrival and departure angles in elevation and azimuth, respectively.Channel coefficients include the LOS and non-LOS (NLOS) propagation subpaths.The NLOS component is calculated as TxLocation where F rx,u,θ , and F rx,u,φ are the field patterns of the uth element of receiving antenna in the direction of the spherical basis vector, F tx,s,θ , and F tx,s,φ are the field patterns of the sth element of transmitting antenna, (θ, φ) α,β,γ are the elevation angles γ for ray β in cluster α, and r and d are the spherical unit vector and location vector of antenna elements rx, tx, respectively.Similarly, rtx,n,m is the spherical unit vector with azimuth departure angle φ n,m,AOD and elevation departure angle θ n,m,ZOD , and XPR is the cross-polarization power ratio matrix.Rays from the same cluster have identical power levels, denoted by P n .The LOS component of the CIR is calculated as in Equation ( 2).The Doppler frequency component depends on the UE speed v with velocity vector v, arrival angles (AOA, ZOA), travel elevation angle θ v , and azimuth angle The CIR can be expressed as the sum of the LOS channel coefficient and the NLOS channel impulse response as In communication over the wireless channel, a transmitted signal x(t) arrives at the receiver with a time delay of x(t − τ).The received signal comprises multiple reflections or refraction, resulting in identical duplicates of the originally transmitted signal.
This represents the convolution of the CIR and the transmitted signal, i.e., an finite impulse response (FIR) filter where the CIR represents the set of coefficients, and can be rewritten as where denotes convolution.User equipment (UE) spatial coordinates in a 3D space affect the CIR.Mobile devices require continuous location calculations, so this variability is critical.This model uses delays and ray mapping from [12] (Table 7.5-5).To test the channel model, we used the values in Table 1.The total execution time to compute these coefficients and the respective speedup over the CPU baseline are reported in Section 6.
Cluster-delay line (CDL) serves as a modeling tool in scenarios where the received signal comprises several delayed clusters.Each cluster is composed of multipath components that share a common delay, albeit exhibiting slight variations in angles of departure and arrival.Various CDL profiles have been defined by 3GPP for link-level simulations.For NLOS, three CDL profiles, namely, CDL-A, CDL-B, and CDL-C, are defined and CDL-D and CDL-E are constructed for LOS clusters.

GPU-Based Acceleration Using NVIDIA CUDA
GPUs are a type of single-instruction-multiple-data (SIMD) architecture where the same instruction is executed repeatedly on different data in parallel.GPUs are specifically designed to run thousands of threads in parallel for higher throughput and use multi-threading to hide memory latency.Efficient management of GPU resources can be achieved through high-level programming languages based on the underlying computing architectures, resulting in improved performance.
Popular parallel computing architectures in the industry include the Open Computing Language (OpenCL) [26], Open Multi-Processing (OpenMP), and CUDA [27], a parallel programming language for managing computations on NVIDIA GPUs.Several code optimization techniques, both generic to GPU code and specific to CUDA, are required to efficiently utilize the on-chip resources and increase the overall performance.
CUDA-based acceleration code consists of two components: the host code, which runs on the general-purpose CPU and is responsible for memory and device management and a collection of functions called the kernel code, which runs on the GPU accelerator device.
Threads in CUDA are the units of computation and are modeled as functions in the kernel code.They are completely concurrent unless synchronized by the hardware or by the designer.In order to efficiently map threads to the architecture of the GPU, they are arranged in 3D clusters called blocks.These clusters are then combined into a 3D grid.The CUDA programming model groups a set of 32 threads into a single entity known as a warp.Concurrent threads (1) within a warp are automatically synchronized in lockstep by the hardware, whereas (2) threads within a block can be synchronized via barriers by the designer, e.g., to enable all threads to complete data transfers before starting a computation on those data, and (3) thread blocks cannot be synchronized with each other at all.
When a designer has to port an application that was originally written for a CPU to a GPU, the code must be completely restructured to explicitly expose parallel computations and optimize memory accesses, as the implicit optimizations provided by compilers are usually insufficient.
GPU architecture for acceleration and CUDA programming model prospective of GPU are shown in Figure 2. The main characteristics of the GPU programming languages, and of CUDA in particular, are discussed below: 1.
Allocating arrays to explicit levels in the memory hierarchy.

Thread Synchronization
Explicit designer-driven thread group synchronization via barriers is the most commonly used synchronization mechanism between otherwise independent threads.It allows, for example, kernel code to transfer data between (1) large and slow off-chip memory and (2) smaller and faster on-chip memory, ensuring that:

•
All threads involved in a concurrent set of memory transfers, where each thread copies one or a few words of a large off-chip memory buffer to an on-chip memory one, are finished when computations using the transferred data begin, • All threads performing parallel computations are finished when the results begin to be transferred back from on-chip memory to off-chip memory.
Implicit automatic thread synchronization occurs in programs with divergent control flows, i.e., where conditional branches in the code may have different outcomes for different threads in a warp.Programmers must carefully consider using conditionals (if-then-else and switch statements) in kernel code, because it may cause significant performance losses in a GPU architecture.If a thread has two nested if-then-elses, and the conditions are independent, then typically only 25% of each GPU processor can be exploited, because all four combinations of the condition values must be executed in sequence, rather than in parallel.
As mentioned above, the CUDA programming model employs three types of thread parallelism: • Parallelism between thread blocks, where synchronization is impossible; • Parallelism within a thread block, where synchronization can be requested by the designer; • Parallelism within thread warp, where synchronization is automatically ensured by the GPU hardware.
From a hardware perspective, there are three execution hierarchies: cooperativethread-array (CTA) (also known as streaming multiprocessor (SM)), warp, and SIMD lanes.At kernel startup, each thread block is assigned to a CTA and each thread is assigned to a SIMD lane.If the block-level explicit synchronization barriers are used, then the CTA hardware will wait for all threads in a block to reach the barrier before any thread is allowed to continue beyond it.Using the warp-level synchronization feature of the CUDA cooperative thread array, threads are synchronized only at the warp level, and other warps can continue to execute.This is especially important in our case, because we can map elements in a cluster to threads in a warp and partitioned block into tiles of size equal to warp size.Because each cluster is modeled independently, we can synchronize threads at the warp level and avoid frequent block-level synchronizations.

Register-Based Parallel Reduction
This programming technique allows a thread to read a register directly from another thread within the same warp and allows them to exchange or broadcast data among each other very efficiently.The idea of parallel reduction is illustrated in Figure 3.Where the __shfl_down() CUDA instruction calculates the source and the destination of each reduction step, so that in N steps 2 N data elements are reduced via an associative operation (e.g., addition) within a warp, without the need for expensive explicit synchronization barriers (i.e., the maximum value of N for which this can be performed with warps of size 32 is 5).The final stage of reduction, beyond the five iterations supported by a warp, is performed less efficiently in shared or global memory for all warps belonging to the same block via explicit barrier synchronization.These two kinds of reduction are both exploited in our model to optimize the final accumulation of the results computed by each warp to generate the total CIR.

Global Memory
Global memory is the off-chip dynamic RAM (DRAM) available on the GPU board, and it is typically separate from the CPU memory.It is used as a communication buffer for large amounts of data between the CPU and the GPU.It has high latency and relatively low bandwidth, similar to a CPU, compared with lower levels of the hierarchy.The host code is in charge of transferring data between the host memory space and the global memory.Arrays (less frequently scalars) allocated in global memory must be tagged as __device__ in CUDA.

Shared Memory
Shared memory is an on-chip memory with low latency and very high bandwidth (similar to an L1 cache), local to each streaming multiprocessor and accessible only by threads in the same block.Developers must explicitly specify shared memory data, using the __shared__ storage attribute to allocate arrays in shared memory, and move data between global and shared memory using kernel code.In our work, threads compute the channel response for each transmitter-receiver antenna port in a cluster and require repeated reading of the input data (1).Because the CUDA global memory is not fast enough to provide data to all processing elements, a two-step loading mechanism is used.First, the input data are loaded into the on-chip shared memory in a coalesced fashion, and then the data are accessed for CIR computation.

Channel Emulator Acceleration on GPU
The channel model output computation is a set of FIR filters, one per path.Consequently, the sampled signal at the receiver can be expressed as the sum over paths of a convolution between the taps of this FIR filter and the channel model input signal.In this study, we used a two-kernel acceleration: 1.
A less computation-intensive kernel computes the FIR coefficients, i.e., the CIR, according to (4).Its pseudocode is shown in Listing 1.

2.
A more computation-intensive FIR kernel that applies the coefficients to each input symbol, as in Equation ( 6).Its pseudocode is shown in Listing 2.
In addition to the two kernels, our accelerator also includes a host code that is written in C++ and executed on the host CPU.It is responsible for interacting with the simulation clients via sockets, performing preliminary model configurations and data transfers with the GPU.
The architecture of the proposed accelerated channel model is shown in Figure 4.
It uses CUDA cooperative groups to eliminate the need for block-level synchronization, because each cluster is computed independently.For efficient use of GPU resources, the long chain of computations is split into parts as shown in Listing 1 where the SpeedVect and ClusterVect are computed in shared memory.This allows threads to remain active because there is no penalty for context switching.The register-based warp-wise parallel reduction in FIR taps helps improve latency and resource utilization.

Results and Discussion
The baseline CPU performance was determined using an Intel Core i7-6900K @3.2 GHz CPU.The baseline channel model is implemented in C++ and runs as a MEX C++ function within a MATLAB R2021a environment.The performance of the channel model is evaluated using the benchmark values in Table 1.To evaluate the performance for link-level simulations, we consider two CDL profiles, i.e., CDL-B for NLOS clusters and CDL-D for LOS clusters.Figure 5 illustrates various MIMO antenna element configurations for single-polarized antennas in Figure 5a,b and dual-polarized arrays in Figure 6c,d on transmitter and receiver end for CDL-B profile.Similarly, the same is reported for CDL-D in Figure 6 where Figure 6a,b show antenna patterns for single-polarized arrays and Figure 6c,d illustrate dual-polarized antennas on transmitting and receiving end.
The accelerator discussed in this paper was developed using the CUDA development tools [27], targeting the NVIDIA GeForce GTX 1070 GPU [28] which features 1920 CUDA cores, 120 texture mapping units (TMUs), 1.5 MB of shared memory, 4 MB of local memory, 8 GB of GDDR5 memory, and 15 SMs.We compare the performance of the GPU accelerator with an FPGA implementation [19], which was developed using the Vitis Unified Software Platform [29] for the AMD Alveo U280 [30].The FPGA used in [19] is based on the same 16 nm technology node as the GPU and contains 9024 digital signal processing (DSP) blocks, 41 MB of on-chip static RAM, 1,303,680 look-up tables, and 8 GB of high bandwidth memory (HBM2).Thus, its computational power is comparable to that of the GPU used in this work, because (1) a DSP unit can be used to implement a single-precision (SP) multiply and add, and (2) in [19] are used only 1/3 of the total FPGA resources so that the kernel can fit on one chiplet to avoid routing problems.
The primary goal of this work is to reduce the overall execution time of the channel model under resource constraints.We report the achieved performance for the kernels in Listings 1 and 2 on GPU platforms.To analyze the performance for both LOS and NLOS scenarios, we consider CDL-B and CDL-D profiles and uplink and downlink connection types (Tables 2 and 3).Table 2 reports the execution latency for various combination of R x and T x antenna elements considering NLOS clusters in CDL-B profile for the parameters listed in Table 1.
Figure 7 illustrates a comparison of link-level simulation latency on CPU and GPU platforms in the two CDL profiles.The values on the horizontal axis represent the number of receiving and transmitting antenna elements, whereas the vertical axis denotes the total execution time in logarithmic scale.It can be inferred from Figure 7 that the GPU implementation greatly reduces the simulation time and enables the network planners to simulation more complex propagation scenarios with higher Doppler shift and even more antenna elements.
To analyze the effect of both transferring larger amounts of data (64 bits vs. 32 bits per value) and performing computations with greater precision, the accelerators were designed for both double-precision (DP) and SP on both acceleration platforms.Table 4 reports the achieved performance and energy consumption for FPGA and GPU acceleration platforms for CDL-B delay profile.Overall, the optimizations result in a large speedup of 240× in comparison to the baseline CPU implementation.The achievable performance is memory bound due to the limited on-chip shared memory of the GPU, hence the need to repeatedly read large amounts of data from the DRAM rather than storing it on-chip as was performed on the FPGA.For power analysis, CPU results are calculated based on its thermal design power (TDP) because we have no way to measure its power consumption in real time.Energy consumption is very high due to high execution latency on CPU platform.The energy consumption of the FPGA is lower than that of the GPU because the data are copied only once into on-chip buffers (our FPGA has more on-chip memory than our GPU).The DP version consumes more power in both cases due to more data being copied from DRAM and higher execution latency due to both memory access and on-chip computation.The GPU and FPGA power consumption is measured using the respective runtime support.In both cases, they are lower than their respective TDPs because only one-third of the on-chip compute resources were utilized due to memory bandwidth constraints, as shown in Table 5.Table 5 reports resource usage for the acceleration platforms.For the FPGA, only one of the three chiplets (also called SLRs) in the package was used to achieve a good clock period.
Coding Style: CUDA vs. HLS Although both FPGA and GPU provide parallel computation, writing source code to efficiently program them is very different.In the case of GPU, it is necessary to explicitly exploit the multi-threaded nature of the platform by exploiting the three-dimensional parallel loop structure of the thread blocks, as shown in Listings 1 and 2 (note the absence of any explicit loop construct).On the other hand, the accelerated code for the FPGA is actually more similar to the CPU version, with only the addition of (1) loops to transfer data from DRAM to on-chip memory, and (2) loop pipelining, loop unrolling, and array partitioning directives to expose parallelism in the computation and memory architecture in a form appropriate for HLS.

Conclusions
In this paper, we have presented an efficient implementation of a 3GPP 3D channel simulation model for GPU platforms, using various CUDA optimization techniques to exploit the parallelism and memory hierarchy of the GPU.The channel model is highly parameterized and can simulate a wide range of configurations required for real-world optimized 5G network deployments.The proposed GPU accelerator can provide fast and accurate channel simulations for 5G NR network planning and optimization, reducing the simulation time by about 240× compared to a CPU-based C++ model.The degree of performance improvement is limited by the amount of on-chip memory, which limits concurrency.The GPU design also has higher single-precision performance than a previous FPGA design, but at the cost of higher power consumption.It is interesting to note that although FPGAs typically have lower floating-point performance than GPUs, in this case, the FPGA has higher performance due to the larger amount of on-chip memory used to store the data and reduce DRAM accesses, thus offsetting the lower computational performance compared to a GPU for this very memory-intensive channel model.This work demonstrates the feasibility and benefits of using GPU-based hardware acceleration for 5G NR channel model simulations, and provides a valuable tool for network designers and researchers.Future work could include extending the channel model to support more propagation scenarios and antenna configurations, as well as integrating the channel model with other components of the 5G simulation stack, such as physical layer and link layer models.

Figure 5 .
Figure 5. MIMO antenna configuration in CDL-B profile for NLOS clusters.

Figure 6 .
Figure 6.MIMO antenna configuration in CDL-D profile for LOS clusters.

Figure 7 .
Figure 7. Execution time on CPU and GPU platforms.

Table 1 .
Summary of channel model emulator parameters.

Table 2 .
Kernel latency for a combination of MIMO elements in CDL-B NLOS.

Table 3 .
Kernel latency for a combination of MIMO elements in CDL-D LOS.

Table 4 .
Accelerated kernel latency and energy consumption.

Table 5 .
Resource utilization of accelerated designs.