1. Introduction
Non-orthogonal multiple access (NOMA) in the power domain has been proposed as a promising multiple access technique for the upcoming fifth-generation cellular networks. In theory, NOMA fulfills the spectral efficiency requirements of 5G by serving multiple users simultaneously in the same frequency band [
1]. For example, in the reverse link of a NOMA system, each user transmits at the same time and in the frequency band so that the base station receives the superimposed version of the transmitted signals by each user. This is in contrast to the multiple access methods used in conventional cellular systems where one user is allocated per time slot or frequency unit [
2]. Early works on NOMA demonstrated that NOMA can significantly improve the sum capacity and cell-edge user throughput [
3,
4].
In NOMA systems, since each user is seen as an interference to others, advanced interference cancellation is required for successful decoding [
5]. In the literature, successive interference cancellation (SIC) and parallel interference cancellation (PIC) techniques are considered for NOMA. However, both techniques require massive computation power. For example, in [
6], the authors reported that their NOMA testbed can support only nine users due to the limitations in the computation power of their personal computers.
5G networks are expected to boost the data rates while significantly reducing the latency, which requires powerful and efficient baseband processing at the receiver. In this work, the two most popular interference cancellation schemes, PIC and SIC, are implemented and compared in a graphics processing unit (GPU) using CUDA to speed up the most time-consuming process of a NOMA receiver. Numerical results show that PIC, due to its parallel architecture, is more suitable for GPU implementation and outperforms the SIC with a speedup of over 70-times. Moreover, an interference cancellation processing times as low as 7 ms, even for large number of users, is observed with PIC in GPU. The results also reveal the feasibility of interference cancellation techniques for NOMA and their competitiveness in commercial solutions particularly when accelerated by GPU using CUDA.
In addition to their computation performances, the reliability of the transmission with SIC and PIC receivers may differ [
7]. In SIC receivers, multi-user interference is removed in a successive manner so that the reliability of the system heavily depends on the correct decoding of the first user. Moreover, in SIC receivers, the received powers from each user should be different so that the receiver can distinguish the users in the power domain and perform successful decoding. In PIC receivers, on the other hand, all the users are decoded at once in parallel, and its performance does not depend on the difference between the received powers as in SIC receivers. However, in order to support a large number of users, PIC-based NOMA systems may require signature signals for each user as in code division multiple access (CDMA) and are more suitable for code domain NOMA [
8,
9]. In this work, we further present the relevant signal models for both SIC and PIC receivers and discuss their bit-error-rate (BER) performances.
This work is organized as follows. In the next section, we present the system models where the SIC and PIC receivers for OFDM are described. In the
Section 3, the CUDA implementation of the receivers is described. Numerical results are presented and discussed in
Section 4. We summarize the study in
Section 5.
2. System Model
The considered that the NOMA system model in the reverse link consists of a single cell with
K users. Each user uses all the
N subcarriers of the orthogonal frequency division multiplexing (OFDM) symbol, and they are distinguished at the base station by interference cancellation techniques. Let
be the channel attenuation between the
user and base station. We assume that the
K users are distributed in the coverage area of the base station such that
so that the first user is the closest to the base station and the
user is the farthest from the base station. The signal received by the base station is the superimposed signal of all the transmitted signals and can be written as:
where
P is the transmitted power, which is equal for each user,
is the additive white Gaussian noise term with variance
, and
is the transmitted OFDM symbol by the
user. The discrete-time domain OFDM symbol transmitted by each user can be written as:
where
is a complex symbol at the subcarrier
n (either phase shift keying (PSK) or quadrature amplitude modulation (QAM) symbols) and
N is the OFDM size. Then, the continuous-time OFDM waveforms transmitted by the
user
can be obtained as
=
where
is the baud rate and
is the pulse-shaping filter.
The received signal by the base station in Equation (
2) includes all the signals from each user, and in order to decode each user successfully, the base station implements interference cancellation. The block diagrams of the two considered SIC and PIC receivers for OFDM are given in
Figure 1 and
Figure 2, respectively. Both techniques include OFDM demodulation and modulation operations. To be specific, the fast Fourier transform (FFT) operation obtains the complex symbols at each subcarrier, the modulation/demodulation blocks deal with mapping/demapping of bit sequences to complex symbols, and the inverse fast Fourier transform (IFFT) obtains the time domain OFDM signal from frequency domain complex symbols. MATLAB functions used for FFT and IFFT computations use multiple CPU cores, while the rest of the computations including loops are not parallelized.
In tne SIC receiver (see
Figure 1), the information signal of each user is decoded sequentially in an iterative manner [
10]. The received signal by the base station is basically the sum of the OFDM signals received from each user. The first signal the receiver decodes (i.e., in the first iteration) belongs to the user that is closest to the base station, and its signal has the highest weight in the received signal, while other users’ signals are considered as interference. After the bit sequences are obtained (ideally without any error), the OFDM signal of that particular user is regenerated by following the exact procedure at an OFDM transmitter and adjusting its amplitude and phase as it passes through the channel from that particular user to the base station. The regenerated signal is then subtracted from the received signal, and the process is repeated until all the users are decoded. At iteration
k, the time domain OFDM signal for the
user
becomes, assuming perfect cancellation at each previous iteration and perfect phase/amplitude adjustment during regeneration [
11],
The signal-to-interference and noise ratio (SINR) of the
user per subcarrier can be written as:
Then, the bit error rate (BER) for the
user per subcarrier with the SIC can be written for binary transmission as
where
is the standard Q-function [
12]. In this work, the channel is assumed to be frequency flat, so that the SNR and BER per subcarrier values become the overall values for that particular user
k.
In the PIC receiver (see
Figure 2), interference cancellation occurs in two stages. In the first stage, the information signal of each user is decoded collectively [
13]. For example, in order to decode the
user’s signal, the other users’ signals are decoded and regenerated in parallel, then their sum is subtracted from the received signal. The stripped signal includes the signal for the
user only so that, in the second stage, it can be decoded using OFDM demodulation. Once the set of regenerated OFDM signals is obtained in the first stage, the signals of each user can easily be obtained. For PIC, the time domain OFDM signal for the
user
can be written as:
Derivation of closed-form SINR for PIC is mathematically intractable since it depends on the decoding errors in the first stage. The BER for the
user per subcarrier for a binary transmission can still be expressed as [
7,
14,
15]:
where
s is the number of stages in the PIC receiver (which is two in our case). Note that Equation (
6) is obtained based on a similar approach as in multi-stage receivers in code division multiple access (CDMA) with unit processing gain [
7]. In CDMA systems, however, each user has signature codes that help to distinguish them. In power domain NOMA systems, on the other hand, these unique codes may not be present, and they can fail to support large number of users since the interference grows stronger as the number of users increases. Code domain NOMA can be considered in order to have comparable reliability for both SIC and PIC; however, comparison of computation performances will be analogous to the ones in the power domain [
16,
17].
In both techniques, any decoding error that occurs in the intermediate stages propagates to the other stages of the receiver. Furthermore, both receivers rely on accurate power allocation among the users to ensure successful interference cancellation. These will have an impact on the reliability of the receivers (e.g., bit error rate performances); however, they will not change the computational complexity.
In this work, without loss of generality, equal power allocation among the users and power domain NOMA is considered.
3. CUDA Implementation
The SIC and PIC techniques were implemented on both a central processing unit (CPU) and GPU, then their computational speeds were compared. The CUDA codes were compiled on a machine that ran Ubuntu OS with a 12-GB memory NVIDIA TITAN Xp graphics card that had 3840 CUDA cores and a clock rate of 1582 MHz. For computing, the NVCC compiler for the CUDA 9.2 platform and the GCC compiler for C++ object-oriented programming language were used. The CPU codes were compiled with an Intel Core i7 (four cores) 2.3-GHz, DDR3L 1600-MHz 16-GB memory machine that ran Mac OS Mojave. The results with CPU were obtained on MATLAB R2018b.
Figure 3 summarizes the functions and kernels used for the implementation of SIC and PIC on GPU using CUDA. As for the FFT and IFFT tasks, the cuFFTlibrary functions with forward and inverse parameters were called. The functions had
computation time complexity. One CUDA thread was assigned per subcarrier. A CUDA block of our GPU allowed calling up to 1024 threads per block. For SIC, the chain of operations for each user included FFT, demodulation, modulation, and IFFT with phase/amplitude adjustment (see
Figure 1). For the FFT and IFFT computations, functions from the cuFFT library with
time complexity were used [
18]. The CUDA kernels were developed for demodulation, modulation, and subtraction tasks. Due to the nature of the SIC receiver, the same chain of operations had to be repeated for each user and could not run in parallel. The entire SIC computational time on GPU, however, can still be decreased by processing demodulation, modulation, and subtraction operations per subcarrier, one after the other, but computing OFDM in parallel for each user using the divide and conquer algorithm [
19]. Therefore, a grid of
blocks each having
threads was created so that the total number of parallel tasks for each kernel was equal to the number of subcarriers, i.e.,
. These parallel tasks computed OFDM for demodulation, modulation, and subtraction, and this process constitutes one SIC iteration. Then, the process was repeated
K times in order to decode the signal of each user.
In the implementation of PIC on GPU, all the functions and kernels per subcarrier of each user can be executed in parallel. In this case, the number of threads per block was set equal to the number of users K. Each thread handles the tasks per subcarrier for each user. The grid of blocks was then created with the number of blocks equal to the FFT size N, which made the total number of parallel tasks .
4. Numerical Results and Discussion
In this section, first we discuss the BER performances of PIC and SIC and then present their implementation on GPU. Single-cell NOMA with different numbers of active users was considered. The
K users were distributed in the coverage area of the base station such that
. The user locations were assumed to be fixed. The received power by the base station from the closest user (
) was set at −90 dBm, and the received power differences between the users was 2 dB, i.e.,
= 2 dB.
Figure 4 shows the BER of the first user versus the total number of users (
K) for both PIC and SIC receivers. The results were obtained for different SNR levels of the first user, SNR =
. When SNR was taken as 15 dB,
was set at −105 dBm, and similarly, when SNR was 20 dB,
was −110 dBm. The results in
Figure 4 show that for a large number of users in a cell, an increase in the SNR had an insignificant impact on the BER performance of the SIC receiver. This is because the performance was mainly limited by the interference by the other users (see Equation (
4)). For the PIC receiver, on the other hand, an increase in the SNR had a significant impact on the BER performance even for a high number of users. Here, it should be noted that the BER performance for the PIC receiver was obtained using Equation (
6), which assumed that signature signals were present for each user, which helped in the detection.
Next, we discuss the implementation of the two interference cancellation methods on the GPU platform.
Table 1 and
Table 2 summarize the computation times obtained both on GPU and CPU for the two interference cancellation techniques. The results with GPU include only the execution times of SIC and PIC; in other words, the time for communication between GPU and CPU spent to copy global variables back and forth to CPU was neglected. MATLAB functions used for FFT and IFFT computations used multiple CPU cores, while the rest of the computations including loops were not parallelized. The size of OFDM was taken as 2048 in
Table 1 and 4096 in
Table 2, and quadrature phase shift keying (QPSK) with maximum likelihood (ML) decoding was considered [
20,
21,
22]. A typical NOMA cell was expected to support about 50 users [
16]; however, this can be increased by clustering, grouping, or using multiple input multiple output (MIMO) techniques [
23,
24]. In order to observe the trend with a large number of users, the number of users was varied from 50–350.
The time spent on CPU sharply varied for the SIC and PIC schemes in both OFDM FFT sizes. PIC performed summation (see
Figure 2) and also iterated FFT computation and demodulation tasks apart from the same iterated tasks as SIC. Consequently, it took about twice more time to execute PIC than SIC in both tables with CPU. It was also observed that FFT size had a marginal effect on SIC execution time on the GPU platform. It took about from 69 ms for 50 UEsand reaching 515 ms for 350 users. On the one hand, the SIC scheme on GPU was executed slower than on CPU, but on the other hand, it was faster than PIC on CPU. SIC ran slower on GPU, because the process was iterative and depended on the frequency (clock rate) of each called core, rather than on the number of cores. As was mentioned earlier, our CPU had four cores with 2.3 GHz each, and the frequency of our GPU cores was 1.5 GHz (1582 MHz). Moreover, PIC on a CPU with an FFT size of 4096 was the slowest time of all the results. This was due to serial approach of CPU running the scheme and FFT computation. Fifty users may be decoded in about 88 ms. The time spent to decode 350 users impractically reached about 616 ms.
As seen in
Table 1 and
Table 2, the SIC execution time on CPU started from only about 28 ms and 47 ms for 50 users. The time gradually increased to 157 ms and 330 ms for 350 users in a cell respectively for FFT sizes of 2048 and 4096; whereas, PIC on GPU was the fastest. The execution time of PIC on GPU was almost 90- and 138-times faster than on the CPU for different OFDM FFT sizes and less sensitive to the number of users. It took only 2.54 ms for 50 users and 6.88 ms for 350 users with an FFT size of 4096 and approximately 2 ms for any number of users for an FFT size of 2048. Furthermore, it was observed that PIC was nearly 75-times faster than SIC on GPU for a large number of users with a 4096 FFT size and 220-times faster for a 2048 FFT size. This was because SIC had mutual data dependency within users during execution iterations and had to be executed serially on both CPU and GPU.