This section presents our proposed baseband modulator architecture. First, the modulators for each selected 5G waveform are described; then, the top-level reconfigurable architecture is presented. Our baseband modulator was implemented on a cost-optimized Zynq xc7z020 device.
3.1. Baseband Modulation for 5G Waveform Candidates
The waveforms supported by the baseband modulator are OFDM, FBMC, and UFMC. These multi-carrier waveforms efficiently perform waveform synthesis using the Inverse Fast Fourier Transform (IFFT) operation. The differences between the selected waveforms are mainly related to the techniques adopted for time-domain windowing (pulse-shaping in the frequency domain) and/or time-domain filtering (equivalent to frequency domain windowing).
OFDM is the most prominent waveform in current wireless communications, and it is characterized by the orthogonality between subcarriers, which eliminates inter-carrier interference. Every OFDM symbol is prepended with a Cyclic Prefix (CP), which mitigates inter-symbol interference, but contributes to the degradation of spectral efficiency. Currently, 4G LTE systems improve the frequency response of Cyclic Prefix OFDM (CP-OFDM) by applying time-domain windowing of the CP-extended OFDM symbols and overlapping the edge transition of adjacent symbols: Weighted Overlap and Add (WOLA). The OFDM modulator implemented here follows a CP-OFDM with WOLA approach, and its datapath structure is shown in
Figure 1. The main baseband parameters involved in this waveform are the IFFT size
, which is equivalent to the number of subcarriers per OFDM symbol, the CP length
, and the number of time-domain samples used for WOLA–
.
Due to its sinc-pulse shapes transmission, OFDM does not provide genuinely band-limited signals, and the high side lobe power levels can cause unwanted interference with adjacent spectrum bands. FBMC achieves better spectral containment by filtering each subcarrier individually. This eliminates the need for a guard interval like the cyclic prefix and contributes to a higher spectral efficiency. Quite often, the improved spectral efficiency of FBMC systems comes at the cost of relaxed signal orthogonality. In these cases, Offset QAM (OQAM) is employed to ensure real-part orthogonality of FBMC symbols: OQAM-FBMC. In this work, FBMC modulation (
Figure 2) follows the approach from [
30], where frequency spreading is applied before the IFFT. The frequency spreading operation, which is characterized by the overlapping factor
, comprises an up-sampler module and an FIR filter with
non-zero coefficients. Due to the frequency spreading operation, the waveform synthesis is performed with an IFFT of size
, where
is the number of subcarriers. Finally,
IFFT output blocks are overlapped and added to create an FBMC output multi-carrier symbol.
UFMC is a waveform with better out-of-band suppression than CP-OFDM and a better multi-antenna compatibility than FBMC [
28]. This multi-carrier scheme divides the
available subcarriers that represent the whole frequency band into blocks of subcarriers—Physical Resource Blocks (PRB)—that represent individual sub-bands. Usually, only a part of the PRBs is used for transmission: active PRBs. Then, for each active PRB, IFFT and bandpass
-order FIR filtering are performed. The same filter can be applied to all sub-bands, but its center frequency must be shifted. At the end, the filtered sub-bands are superimposed to form the UFMC symbol to be transmitted. The classic UFMC modulation scheme from [
31] considers an
-point IFFT and frequency-shifted FIR filters with complex coefficients for each sub-band. To counteract this increased computational complexity, Knopp et al. [
29] proposed an algorithm that combines a reduced size
-point IFFT with
upsampling and keeps real-valued coefficient FIR filters by performing frequency shifting after filtering (
Figure 3).
The transition from 4G to 5G will not be as abrupt as in previous generations. Instead, 5G should enable the coexistence and tight interworking between different radio access technologies in order to facilitate the gradual penetration of 5G systems [
2]. Thus, the waveform numerologies adopted in our work are based on the 4G LTE standards. In particular, OFDM Modes 1 and 2 correspond to LTE 5
and 10
channelizations, respectively. Like [
18], we assume that a primary user communicates using OFDM and that secondary users opportunistically transmit using OFDM, FBMC, or UFMC. Thus, the numerologies for FBMC and UFMC should be compatible with the OFDM numerologies.
Table 1 presents the modes of operation and numerologies supported in this work, and
Figure 4 depicts the combination of periodograms for Mode 1 OFDM, FBMC, and UFMC baseband signals, in what would be a scenario where these waveforms coexist by sharing a portion of the spectrum band. In all cases, the 16-QAM constellation scheme was used for digital modulation.
Regarding the hardware implementations for the baseband modulator datapaths, we will describe the UFMC modulator design in more detail, and for OFDM and FBMC, we refer to our previous works [
22,
26]. The implemented modulator datapaths have AXI4-Stream-compatible input/output data interfaces, and all arithmetic operations are done in fixed-point precision, considering real and imaginary parts represented in the Q5.11 format.
The UFMC modulator architecture follows the algorithm description from [
29], also illustrated in
Figure 3. The first module of a sub-band branch is the QAM mapper. For the 16-QAM case, the module is simply implemented with a 16:1 multiplexer: a four-bit input signal selects the corresponding complex value out of 16 values that compose the constellation. The subcarrier mapping module maps the 12 PRB subcarriers to the central bins of an
-element array and zeroes the remaining
subcarriers. This module comprises a double buffer of
elements and read/write control engines. The double buffer is implemented using dual-port block RAMs embedded in the FPGA logic fabric and allows for simultaneous read and write of consecutive
-element arrays without causing any data conflicts.
The IFFT computation involves complex arithmetics, and it is replicated for each sub-band processing branch. Consequently, the design choice for the IFFT module should consider a balanced trade-off between performance and resource usage. There are two dominant categories of IFFT/FFT architectures: pipelined and memory-based. In pipelined architectures, the IFFT datapath is tightly synchronized in time and can simultaneously execute transform calculations on the current data frame, load the next input data frame, and unload the results from the previous data frame. This allows for the continuous flow of data along the datapath at the cost of a higher resource utilization. In turn, memory-based architectures are characterized by an iterative processing nature, and the input data loading and results unloading operations cannot occur simultaneously with transform processing. Compared with pipelined architectures, memory-based architectures consume less circuit area/resources, but provide a lower performance.
Although pipelined-based IFFT modules were adopted in the OFDM and FBMC modulators, we chose a memory-based approach to design the IFFT modules in the UFMC modulator, motivated by three aspects. First, UFMC is a preferred waveform for short-burst transmissions. The iterative nature of memory-based architectures is well adapted to this scenario where the ability for continuous data-stream processing is not a priority requirement. Second, the lower resource utilization of memory-based architectures allows for a more scalable replication of IFFT cores per sub-band branch. Third, memory-based architectures allow for the application of pruning algorithms [
32]. Before starting the IFFT processing, the location of the non-zero values within the
-element input array are known. This can be used to prune arithmetic operations between zero values and thus reduce the transform processing time. The IFFT architecture implemented follows a Decimation-In-Frequency (DIF) Radix-2 algorithm, where the processing of an
-point IFFT is divided into
processing stages of
processing steps. The processing steps are executed by a butterfly unit that picks two input values and produces two results: (1) the sum of the two input values; and (2) the difference of the two input values multiplied by a complex twiddle factor.
The memory-based IFFT architecture implemented is depicted in
Figure 5, and its main constituent elements are: a control engine, a Radix-2 butterfly unit, two
-element memory banks (M0 and M1) and a ROM memory used to control IFFT pruning (pruning ROM). Due to the DIF algorithm employed, IFFT results are not produced in natural order. Therefore, a reordering unit is attached to deliver IFFT output results in natural order to the subsequent datapath modules. The operation of the IFFT module can be divided into two phases: load input/unload results and process transform. During the load input/unload results phase, the control engine issues read/write operations on M0 and M1 to fill the memory banks with the incoming data samples, while forwarding the results from previous transform processing to the reordering unit. The process transform phase corresponds to the execution of the processing steps of each Radix-2 IFFT processing stage. The control unit fetches values from M0 and M1 to the butterfly unit that performs a processing step. Then, the results are stored back in M0 and M1. In this architecture, the control engine uses a binary counter to generate all the signals to control the butterfly unit and memory bank addressing. We adopted the butterfly structure and address generation scheme from [
33] and further extended the architecture to support IFFT pruning. The complex multiplier used for twiddle factor multiplications was implemented with three real multipliers, one adder, and two subtractors.
The profile of the IFFT input data array is known in advance: 12 subcarriers are mapped to the central bins of a 64-element array, and the remaining 52 elements are zero. Following the DIF Radix-2 algorithm, it is possible to pre-determine the processing steps that need to be executed and those that can be pruned. The pruning ROM contains information about the number of processing stages where pruning occurs—pruning stages—and for each of these stages, it provides the number of processing steps to be executed, as well as their corresponding control binary counter values to be used by the control engine. For the pruning stages, the control engine fetches the binary counter values from the pruning ROM. When the end of the pruning ROM is reached, the control engine knows that there are no more pruning stages and, thus, internally generates the binary counter values simply by incrementing it from 0 to
. In our case, there are two pruning stages comprising 12 and 24 processing steps each. Therefore, the pruning ROM is made of 39 words: one to indicate the amount of pruning stages, two to indicate the amount of processing steps of each pruning stage, and the
binary counter values for each processing step. The binary counter word length is eight bits (
, with
), as indicated in [
33]. As in [
29], the
-point IFFT is followed by upsampling. The upsampler introduces
zeros between consecutive IFFT output samples, and its implementation consists of an FSM alternating between output data and output zero states.
Bandpass FIR filtering for each sub-band is carried by a Dolph–Chebyshev filter with filter length equal to the LTE CP length plus one (
). A FIR filter architecture with a transpose structure was adopted because, unlike the direct FIR model, it does not require an extra input shift register, nor a tree of pipelined adders to achieve high throughput. For the UFMC numerologies from
Table 1, the filter lengths are odd, and the coefficients are symmetric with a single center coefficient equal to one. The multiplications by the center coefficient can be ignored, as they do not affect the input value. However, the remaining
coefficients imply non-trivial multiplications. The amount of non-trivial multiplications per FIR filter can be halved (
) by exploiting the coefficient symmetry. As the sub-band signal to be filtered is complex-valued, both real and imaginary parts have to be filtered. Therefore, for each sub-band branch, we have two FIR filters that combine
non-trivial multiplications.
In Xilinx FPGAs, non-trivial multiplications can be efficiently executed by DSP blocks, which are embedded into the logic fabric in a column arrangement. Cost-optimized devices have a smaller amount of DSP blocks, and their utilization should be carefully managed. For instance, the xc7z020 device has 220 DSP blocks, while the total amount of multiplications for FIR filtering in all three UFMC sub-bands () is 108 for Mode 1 and 216 for Mode 2. The high DSP utilization and its sparse distribution within the logic fabric degrade the scalability of the UFMC modulator. Moreover, it also hampers the place-and-route tasks by EDA tools, affecting overall timing closure.
In these circumstances, we adopted a multiplier-less architecture for FIR filtering where FIR coefficients in the Q1.5 format are represented using the Canonic Signed Digit (CSD) system with minimum non-zero bits. Multipliers are then substituted by shift-and-add graphs. As an example, for a coefficient equal to 0.90625, we have:
Figure 6 illustrates the shift-and-add graph to implement
. This filter design eliminates the use of DSP blocks, but increases slice utilization. Yet, slices are the predominant resource type in the FPGA logic fabric (13,300 slices in the xc7z020 device), which makes our approach viable. After FIR filtering, it is necessary to shift the sub-band signal to the corresponding frequency band. The frequency shift module for each sub-band has a ROM memory to store the complex exponential values and a complex multiplier similar to the one used in the IFFT module. Thus, the overall DSP block utilization in the UFMC modulator consists of three DSPs in the IFFT and three DSPs in the frequency shift module per sub-band branch. Finally, the filtered sub-band responses are summed to create the aggregate UFMC signal.
3.2. Top-Level Architecture
From a top-level perspective, our design (
Figure 7) makes use of the hybrid (HW/SW) nature of the Xilinx Zynq architecture that contains two sections: the Processing System (PS) and the Programmable Logic (PL). The PS comprises an ARM Cortex-9 processor and a 512 MB DDR memory controller. The ARM core manages and triggers reconfiguration procedures and sets up data transfers between the DDR memory and the PL. The PL section is then divided into three domains: the baseband processing domain, the DFS domain, and the DPR domain.
The baseband processing domain contains the Reconfigurable Partitions (RPs) that can be dynamically reconfigured to implement different 5G waveform modulators. In this design, three independent RPs are considered: RP implements OFDM modulation modes only; RP implements FBMC and OFDM modulation modes; and RP implements UFMC and OFDM modulation modes. At this stage, an alternative system partitioning strategy could consider an RP for each block in the datapaths or by identifying hardware modules common to all configurations and keeping them in the system static part (outside the RPs). However, a higher reconfiguration resolution would enlarge the amount of partial bitstreams to store. The implementation of the whole modulators in a single RP also permits the global place-and-route optimization of the processing chain, contributing to an overall smaller reconfigurable area and, consequently, smaller reconfiguration latencies.
When the system is started up, input data files are downloaded from an SD card to the DDR memory. Then, the operation cycle of the baseband processing domain consists of the following steps: (1) fetch input data from, the DDR and feed it to the baseband modulator(s); (2) perform baseband modulation and send the modulated data back to the DDR memory. In a real application, the reconfigurable baseband processor implemented in this work would be integrated with all-digital transceivers such as the ones proposed in [
34]. The DMA controller alleviates the PS load related to data transfers to/from the DDR and thus improves baseband processing throughput.
The baseband modulators were designed to run at a clock frequency of 100
. However, through DFS, the clock frequency can be changed at run-time in order to adapt the system to different throughput requirements or power consumption constraints. The DFS implementation follows an approach similar to [
23], comprising a Mixed-Mode Clock Manager (MMCM) primitive and a DFS controller engine. The MMCM provides access to the Dynamic Reconfiguration Port (DRP) that allows for writing configuration bits to change MMCM output clocks at run-time. A 100
input reference clock signal provided by the PS (FCLK0) is used by the MMCM to generate an output clock signal for baseband processing purposes. Four MMCM output clock modes are considered: 100
,
,
, and
. The 100
clock frequency was the reference frequency for implementation, while the other values were based on the
scaling of the LTE sampling frequency proposed for 5G systems [
8]. Applying it to the sampling frequency for 10
LTE channelization (
), we have
and
. To change the baseband processing clock frequency, the PS defines the MMCM output clock mode through the DFS controller mode port and writes ‘1’ to the en port. After the locked signal becomes active, the baseband modulators are ready for processing.
The DPR implementation provides an infrastructure to access the FPGA configuration memory. In real-time scenarios like wireless communications, it is necessary to reduce the reconfiguration latency because if the system takes too long to reconfigure, quality-of-service is degraded. Moreover, there is also an energy consumption overhead during the reconfiguration, and shorter reconfiguration latencies are crucial to reduce it. The DPR latency depends on the size of the partial bitstreams and on the configuration port bandwidth.
The configuration interface adopted in this work is the ICAP. This high-bandwidth internal interface permits the FPGA to reconfigure itself. Xilinx sets the maximum ICAP bandwidth at 400
/
, for a 100
clock frequency and 32-bit data width [
35]. Nevertheless, the ICAP can be overclocked to further enhance the reconfiguration throughput [
36]. In the present work, the ICAP is overclocked at 200
, using another clock signal (FCLK1) provided by the PS. Like the input data files, the partial bitstreams for all modulator and demodulator configurations are loaded from an SD card to the DDR memory upon system start-up. To take advantage of ICAP overclocking, a dedicated DMA controller is used to accelerate the partial bitstream transfer to the ICAP.