GPU Accelerated PIC and SIC for OFDM-NOMA

: Non-orthogonal multiple access (NOMA) is a candidate multiple access scheme for the ﬁfth-generation (5G) cellular networks. In NOMA systems, all users operate at the same frequency and time, which poses a challenge in the decoding process at the receiver side. In this work, the two most popular receiver structures, successive interference cancellation (SIC) and parallel interference cancellation (PIC) receivers, for NOMA reverse channel are implemented on a graphics processing unit (GPU) and compared. Orthogonal frequency division multiplexing (OFDM) is considered. The high computational complexity of interference cancellation receivers undermines the potential deployment of NOMA systems. GPU acceleration, however, challenges this weakness, and our numerical results show speedups of about from 75–220-times as compared to a multi-thread implementation on a central processing unit (CPU). SIC and PIC multi-thread execution time on different platforms reveals the potential of GPU in wireless communications. Furthermore, the successful decoding rates of the SIC and PIC are evaluated and compared in terms of bit error rate


Introduction
Non-orthogonal multiple access (NOMA) in the power domain has been proposed as a promising multiple access technique for the upcoming fifth-generation cellular networks.In theory, NOMA fulfills the spectral efficiency requirements of 5G by serving multiple users simultaneously in the same frequency band [1].For example, in the reverse link of a NOMA system, each user transmits at the same time and in the frequency band so that the base station receives the superimposed version of the transmitted signals by each user.This is in contrast to the multiple access methods used in conventional cellular systems where one user is allocated per time slot or frequency unit [2].Early works on NOMA demonstrated that NOMA can significantly improve the sum capacity and cell-edge user throughput [3,4].
In NOMA systems, since each user is seen as an interference to others, advanced interference cancellation is required for successful decoding [5].In the literature, successive interference cancellation (SIC) and parallel interference cancellation (PIC) techniques are considered for NOMA.However, both techniques require massive computation power.For example, in [6], the authors reported that their NOMA testbed can support only nine users due to the limitations in the computation power of their personal computers.5G networks are expected to boost the data rates while significantly reducing the latency, which requires powerful and efficient baseband processing at the receiver.In this work, the two most popular interference cancellation schemes, PIC and SIC, are implemented and compared in a graphics processing unit (GPU) using CUDA to speed up the most time-consuming process of a NOMA receiver.Numerical results show that PIC, due to its parallel architecture, is more suitable for GPU implementation and outperforms the SIC with a speedup of over 70-times.Moreover, an interference cancellation processing times as low as 7 ms, even for large number of users, is observed with PIC in GPU.The results also reveal the feasibility of interference cancellation techniques for NOMA and their competitiveness in commercial solutions particularly when accelerated by GPU using CUDA.
In addition to their computation performances, the reliability of the transmission with SIC and PIC receivers may differ [7].In SIC receivers, multi-user interference is removed in a successive manner so that the reliability of the system heavily depends on the correct decoding of the first user.Moreover, in SIC receivers, the received powers from each user should be different so that the receiver can distinguish the users in the power domain and perform successful decoding.In PIC receivers, on the other hand, all the users are decoded at once in parallel, and its performance does not depend on the difference between the received powers as in SIC receivers.However, in order to support a large number of users, PIC-based NOMA systems may require signature signals for each user as in code division multiple access (CDMA) and are more suitable for code domain NOMA [8,9].In this work, we further present the relevant signal models for both SIC and PIC receivers and discuss their bit-error-rate (BER) performances.
This work is organized as follows.In the next section, we present the system models where the SIC and PIC receivers for OFDM are described.In the Section 3, the CUDA implementation of the receivers is described.Numerical results are presented and discussed in Section 4. We summarize the study in Section 5.

System Model
The considered that the NOMA system model in the reverse link consists of a single cell with K users.Each user uses all the N subcarriers of the orthogonal frequency division multiplexing (OFDM) symbol, and they are distinguished at the base station by interference cancellation techniques.Let α k be the channel attenuation between the k th user and base station.We assume that the K users are distributed in the coverage area of the base station such that |α 1 | 2 > |α 2 | 2 > ... > |α K | 2 so that the first user is the closest to the base station and the K th user is the farthest from the base station.The signal received by the base station is the superimposed signal of all the transmitted signals and can be written as: where P is the transmitted power, which is equal for each user, n(t) is the additive white Gaussian noise term with variance σ 2 n , and s k (t) is the transmitted OFDM symbol by the k th user.The discrete-time domain OFDM symbol transmitted by each user can be written as: where S k [n] is a complex symbol at the subcarrier n (either phase shift keying (PSK) or quadrature amplitude modulation (QAM) symbols) and N is the OFDM size.Then, the continuous-time OFDM waveforms transmitted by the k th user s k (t) can be obtained as where T s is the baud rate and p(t) is the pulse-shaping filter.
The received signal by the base station in Equation ( 2) includes all the signals from each user, and in order to decode each user successfully, the base station implements interference cancellation.The block diagrams of the two considered SIC and PIC receivers for OFDM are given in Figures 1 and 2, respectively.Both techniques include OFDM demodulation and modulation operations.To be specific, the fast Fourier transform (FFT) operation obtains the complex symbols at each subcarrier, the modulation/demodulation blocks deal with mapping/demapping of bit sequences to complex symbols, and the inverse fast Fourier transform (IFFT) obtains the time domain OFDM signal from frequency domain complex symbols.MATLAB functions used for FFT and IFFT computations use multiple CPU cores, while the rest of the computations including loops are not parallelized.In tne SIC receiver (see Figure 1), the information signal of each user is decoded sequentially in an iterative manner [10].The received signal by the base station is basically the sum of the OFDM signals received from each user.The first signal the receiver decodes (i.e., in the first iteration) belongs to the user that is closest to the base station, and its signal has the highest weight in the received signal, while other users' signals are considered as interference.After the bit sequences are obtained (ideally without any error), the OFDM signal of that particular user is regenerated by following the exact procedure at an OFDM transmitter and adjusting its amplitude and phase as it passes through the channel from that particular user to the base station.The regenerated signal is then subtracted from the received signal, and the process is repeated until all the users are decoded.At iteration k, the time domain OFDM signal for the k th user ŝk (t) becomes, assuming perfect cancellation at each previous iteration and perfect phase/amplitude adjustment during regeneration [11], The signal-to-interference and noise ratio (SINR) of the k th user per subcarrier can be written as: Then, the bit error rate (BER) for the k th user per subcarrier with the SIC can be written for binary transmission as [12].In this work, the channel is assumed to be frequency flat, so that the SNR and BER per subcarrier values become the overall values for that particular user k.
In the PIC receiver (see Figure 2), interference cancellation occurs in two stages.In the first stage, the information signal of each user is decoded collectively [13].For example, in order to decode the k th user's signal, the other users' signals are decoded and regenerated in parallel, then their sum is subtracted from the received signal.The stripped signal includes the signal for the k th user only so that, in the second stage, it can be decoded using OFDM demodulation.Once the set of regenerated OFDM signals is obtained in the first stage, the signals of each user can easily be obtained.For PIC, the time domain OFDM signal for the k th user ŝk (t) can be written as: Derivation of closed-form SINR for PIC is mathematically intractable since it depends on the decoding errors in the first stage.The BER for the k th user per subcarrier for a binary transmission can still be expressed as [7,14,15]: where s is the number of stages in the PIC receiver (which is two in our case).Note that Equation ( 6) is obtained based on a similar approach as in multi-stage receivers in code division multiple access (CDMA) with unit processing gain [7].In CDMA systems, however, each user has signature codes that help to distinguish them.In power domain NOMA systems, on the other hand, these unique codes may not be present, and they can fail to support large number of users since the interference grows stronger as the number of users increases.Code domain NOMA can be considered in order to have comparable reliability for both SIC and PIC; however, comparison of computation performances will be analogous to the ones in the power domain [16,17].In both techniques, any decoding error that occurs in the intermediate stages propagates to the other stages of the receiver.Furthermore, both receivers rely on accurate power allocation among the users to ensure successful interference cancellation.These will have an impact on the reliability of the receivers (e.g., bit error rate performances); however, they will not change the computational complexity.
In this work, without loss of generality, equal power allocation among the users and power domain NOMA is considered.

CUDA Implementation
The SIC and PIC techniques were implemented on both a central processing unit (CPU) and GPU, then their computational speeds were compared.The CUDA codes were compiled on a machine that ran Ubuntu OS with a 12-GB memory NVIDIA TITAN Xp graphics card that had 3840 CUDA cores and a clock rate of 1582 MHz.For computing, the NVCC compiler for the CUDA 9.2 platform and the GCC compiler for C++ object-oriented programming language were used.The CPU codes were compiled with an Intel Core i7 (four cores) 2.3-GHz, DDR3L 1600-MHz 16-GB memory machine that ran Mac OS Mojave.The results with CPU were obtained on MATLAB R2018b.
Figure 3 summarizes the functions and kernels used for the implementation of SIC and PIC on GPU using CUDA.As for the FFT and IFFT tasks, the cuFFTlibrary functions with forward and inverse parameters were called.The functions had O(nlogn) computation time complexity.One CUDA thread was assigned per subcarrier.A CUDA block of our GPU allowed calling up to 1024 threads per block.For SIC, the chain of operations for each user included FFT, demodulation, modulation, and IFFT with phase/amplitude adjustment (see Figure 1).For the FFT and IFFT computations, functions from the cuFFT library with O(nlogn) time complexity were used [18].The CUDA kernels were developed for demodulation, modulation, and subtraction tasks.Due to the nature of the SIC receiver, the same chain of operations had to be repeated for each user and could run in parallel.The entire SIC computational time on GPU, however, can still be decreased by processing demodulation, modulation, and subtraction operations per subcarrier, one after the other, but computing OFDM in parallel for each user using the divide and conquer algorithm [19].Therefore, a grid of N 1 blocks each having N 2 threads was created so that the total number of parallel tasks for each kernel was equal to the number of subcarriers, i.e., N 1 ×N 2 = N.These parallel tasks computed OFDM for demodulation, modulation, and subtraction, and this process constitutes one SIC iteration.Then, the process was repeated K times in order to decode the signal of each user.
In the implementation of PIC on GPU, all the functions and kernels per subcarrier of each user can be executed in parallel.In this case, the number of threads per block was set equal to the number of users K.Each thread handles the tasks per subcarrier for each user.The grid of blocks was then created with the number of blocks equal to the FFT size N, which made the total number of parallel tasks K×N.

Numerical Results and Discussion
In this section, first we discuss the BER performances of PIC and SIC and then present their implementation on GPU.Single-cell NOMA with different numbers of active users was considered.The K users were distributed in the coverage area of the base station such that The user locations were assumed to be fixed.The received power by the base station from the closest user (P|α 1 | 2 ) was set at −90 dBm, and the received power differences between the users was 2 dB, i.e., 10log 10 4 shows the BER of the first user versus the total number of users (K) for both PIC and SIC receivers.The results were obtained for different SNR levels of the first user, SNR = P|α 1 | 2 /σ 2 n .When SNR was taken as 15 dB, σ 2 n was set at −105 dBm, and similarly, when SNR was 20 dB, σ 2 n was −110 dBm.The results in Figure 4 show that for a large number of users in a cell, an increase in the SNR had an insignificant impact on the BER performance of the SIC receiver.This is because the performance was mainly limited by the interference by the other users (see Equation ( 4)).For the PIC receiver, on the other hand, an increase in the SNR had a significant impact on the BER performance even for a high number of users.Here, it should be noted that the BER performance for the PIC receiver was obtained using Equation (6), which assumed that signature signals were present for each user, which helped in the detection.
Next, we discuss the implementation of the two interference cancellation methods on the GPU platform.Tables 1 and 2 summarize the computation times obtained both on GPU and CPU for the two interference cancellation techniques.The results with GPU include only the execution times of SIC and PIC; in other words, the time for communication between GPU and CPU spent to copy global variables back and forth to CPU was neglected.MATLAB functions used FFT and IFFT computations used multiple CPU cores, while the rest of the computations including loops were not parallelized.The size of OFDM was taken as 2048 in Table 1 and 4096 in Table 2, and quadrature phase shift keying (QPSK) with maximum likelihood (ML) decoding was considered [20][21][22].A typical NOMA cell was expected to support about 50 users [16]; however, this can be increased by clustering, grouping, or using multiple input multiple output (MIMO) techniques [23,24].In order to observe the trend with a large number of users, the number of users was varied from 50-350.The time spent on CPU sharply varied for the SIC and PIC schemes in both OFDM FFT sizes.PIC performed summation (see Figure 2) and also iterated FFT computation and demodulation tasks apart from the same iterated tasks as SIC.Consequently, it took about twice more time to execute PIC than SIC in both tables with CPU.It was also observed that FFT size had a marginal effect on SIC execution time on the GPU platform.It took about from 69 ms for 50 UEsand reaching 515 ms for 350 users.On the one hand, the SIC scheme on GPU was executed slower than on CPU, but on the other hand, it was faster than PIC on CPU.SIC ran slower on GPU, because the process was iterative and depended on the frequency (clock rate) of each called core, rather than on the number of cores.As was mentioned earlier, our CPU had four cores with 2.3 GHz each, and the frequency of our GPU cores was 1.5 GHz (1582 MHz).Moreover, PIC on a CPU with an FFT size of 4096 was the slowest time of all the results.This was due to serial approach of CPU running the scheme and FFT computation.Fifty users may be decoded in about 88 ms.The time spent to decode 350 users impractically reached about 616 ms.
As seen in Tables 1 and 2, the SIC execution time on CPU started from only about 28 ms and 47 ms for 50 users.The time gradually increased to 157 ms and 330 ms for 350 users in a cell respectively for FFT sizes of 2048 and 4096; whereas, PIC on GPU was the fastest.The execution time of PIC on GPU was almost 90-and 138-times faster than on the CPU for different OFDM FFT sizes and less sensitive to the number of users.It took only 2.54 ms for 50 users and 6.88 ms for 350 users with an FFT size of 4096 and approximately 2 ms for any number of users for an FFT size of 2048.Furthermore, it was observed that PIC was nearly 75-times faster than SIC on GPU for a large number of users with a 4096 FFT size and 220-times faster for 2048 FFT size.This was because SIC had mutual data dependency within users during execution iterations and had to be executed serially on both CPU and GPU.

Conclusions
In this work, a CUDA platform was proposed to implement computationally-challenging interference cancellation schemes in the reverse channel of OFDM-NOMA networks.Both SIC and PIC schemes were implemented, and their algorithms along with OFDM were illustrated.Finally, their computation times were compared.Numerical results showed significant speedups of the PIC scheme with GPU implementation as compared to CPU.Furthermore, for a large number of users, PIC was found to be approximately 75-times and 220-times faster than SIC on GPU for FFT sizes of 4096 and 2048, respectively.
5G base stations are expected to face computationally-heavy tasks for baseband processing.In addition to interference cancellation, some other techniques such as MIMO at improved data rates are going to put a tremendous baseband processing load on 5G base stations.The results here demonstrated that PIC on GPU took only 3 ms to decode 50 users and left room for other computationally-heavy processes to satisfy the strict latency requirement of 5G networks.

Figure 1 .
Figure 1.Block diagram of a typical successive interference cancellation (SIC).

Figure 2 .
Figure 2. Block diagram of a typical parallel interference cancellation (PIC).

Figure 3 .
Figure 3. Functions and kernels for SIC and PIC using CUDA.

Figure 4 .
Figure 4. Bit error rate (BER) versus the number of users (K) for PIC and SIC.

Table 1 .
Comparison of the computational times of SIC and PIC in ms with an FFT size of 2048.

Table 2 .
Comparison of the computational times of SIC and PIC in ms with an FFT size of 4096.