A Parallel Timing Synchronization Structure in Real-Time High Transmission Capacity Wireless Communication Systems

: In the past few years, parallel digital signal processing (PDSP) architectures have been intensively studied to fulfill the growing demand of channel capacity in coherent optical communication systems. However, to our knowledge, real-time timing synchronization in such architectures is until now not implemented on a Field Programmable Gate Array (FPGA). In this article, a parallel timing synchronization architecture is proposed. In the architecture, a parallel First In First Out (FIFO) structure based on an index associated rearranging method, and a dual feedback loop based on the Gardner’s algorithm, are adopted. Taking advantages of the FIFO structure, 67% Look Up Table (LUT) is saved in comparison with earlier results, meanwhile the Numerically Controlled Oscillator (NCO) is efﬁciently improved to meet the FPGA timing requirements for real-time performance. MATLAB simulations are run to evaluate the Bit Error Rate (BER) deterioration of the architecture. The ﬂoat-and ﬁxed-point simulation results have shown that, The BER deteriorations are less than 0.5 dB and 1 dB, respectively. Further, the implementation of the architecture on a Xilinx XC7VX485T FPGA chip is achieved. A 20 giga bit per second (Gbps) 16 Quadrature Amplitude Modulation (16QAM) real-time system is achieved at the system clock of 159.524 MHz. This work opens a new pathway to improve the transmission capacity in real-time wireless communication systems.


Introduction
In modern society, the demand of transmission capacity in wireless communication systems is continuously increasing. Forecasted by Cisco Visual Networking Index (VNI) published in 2017, global mobile data traffic will rise to 49 exabytes per month by 2021, which only reached 7.2 exabytes per month at the end of 2016 [1]. To fulfill this trend, high-speed parallel digital signal processing (PDSP) technologies [2] in wireless communication systems are becoming overwhelmingly prevalent in recent years.
In 2013, Microelectronics and Nanotechnology (IEMN) demonstrated an offline photoelectronic system [3] with an 8.2 giga bit per second (Gbps) communication rate. In 2012, Fraunhofer Institute for Applied Solid State Physics (IFA) fulfilled an offline electronic system [4] with a 24 Gbps rate, and a high offline bit rate has been achieved at about 50 Gbps from the Nippon Telegraph and Telephone Public Corporation (NTT) of Japan in 2014 [5]. Although with high complexity, the potential of PDSP architectures attracts many research interests, off-line. In optical fiber communication, recent literatures have reported systems based on PDSP architectures [6,7]. However, to our best knowledge, until now, only offline (non-real time) PDSP architectures [8] have been realized.
Offline systems can be applied in many areas, such as High Definition (HD) movies and other applications. However, in many applications, real-time is required. For instance, an HD video call system has a large amount of data to be processed. If the baseband digital signal processing is running offline, the users will have to wait for a rather long time of processing before the user on the receiver side can get the information transmitted from the user in the transmitted side. To solve this problem, online (real-time) communication system is becoming an overwhelming prevalent technology nowadays.
To enjoy the real-time features of communication systems, high order Quadrature Amplitude Modulation (QAM) [9,10] is thought to be the key to improve the spectrum efficiency. High order QAM modulation communication systems always require quite complicated demodulator architectures. Hardware complexity and resource limitation are thought to be the main challenges in implementing such a real-time tremendous architecture. Field Programmable Gate Array (FPGA)-based [2] PDSP architecture is supposed to be the key to solve this problem, novel architectures emerge endlessly. Baseband PDSPs [11][12][13] are necessary in these demodulator architectures, while parallel timing synchronization is essential in baseband PDSPs.
In this article, an improved two times oversampling parallel timing synchronization architecture aimed at real-time performance is proposed and then implemented on a Xilinx XC7VX485T FPGA chip. The key technology is PDSP on FPGA, which greatly reduces the system clock frequency and makes it feasible to achieve real-time performance with current existing hardware devices. Specifically, a parallel First In First Out (FIFO) structure based on an index associated rearranging method and a dual feedback loop based on the Gardner algorithm are adopted in our parallel architecture.
The rest part of this article is organized as follows. Section 2 describes several shortages of two existing parallel structures. The improved parallel structure and FPGA implementation is carried out in Section 3. Section 4 presents the simulation and implementation results. Finally, a conclusion is made in Section 5.

Shortages of Existing Parallel Architectures
As it has been discussed in Section 1, many offline communication systems with high transmission rate have been developed. Nonetheless, improving the communication rate of real-time systems is still a big problem that needs to be worked out, especially the baseband digital signal processing technology.
Most parallel structures are derived from serial algorithms. Gardner [14][15][16] and Oerder and Meyr (O&M) [17] algorithms are the two most commonly used serial timing synchronization algorithms. The authors have developed a parallel Gardner architecture [8,18,19] in a Gardner algorithm based two times oversampling coherent optical system. Nevertheless, this architecture could not achieve real-time performance, mainly because their Numerically Controlled Oscillator (NCO) structure could not meet FPGA timing requirements. Besides, the loop filter dismisses information provided by some of the parallel error detectors, which leads to a rather great deviation in recovered signals. To achieve real-time performance, Lin and his collaborators have successfully implemented parallel O&M on FPGA in [20]. However, its requirement of four times the oversampling costs an extremely large amount of hardware resources and needs higher speed Analog to Digital Converters (ADCs) in parallel systems.

Non-Real Time
Zhou and her collaborators have explained their parallel structure in [8]. However, their digital signal processing (DSP) structure was not implemented on any existing hardware facilities, even though their structure is offline verified on the MATLAB platform. The main reason that limits their real-time performance is that when timing error accumulates to a certain amount, information loss will occur in their structure.
It is also too difficult to get real-time performance on Digital Signal Processors (DSPs). Take one of the highest performance DSPs, Texas Instruments (TI) C66x series DSP, for example. The highest frequency C66x DSP could work at is only 1.2 GHz, which is impossible to implement a real-time communication system over a 20 Gbps rate.

High Hardware Resource Consumption
From [17], it is easy to find out that a structure based on the O&M algorithm will need at least four times the oversampling rate, because the O&M based parallel error detector in [20] needs four sampled points to get one timing error.

Improved Parallel Architecture
The architecture of the serial Gardner algorithm can be found in [14,15]. The parallel architecture improvements will be carried out in this section. A parallel First In First Out (FIFO) structure based on an index associated rearranging method and a dual feedback loop based on the Gardner's algorithm are adopted in the proposed parallel architecture. The overall architecture is depicted in Figure 1. To generate parallel digital signals, I Q analog signals are firstly sampled by two high-speed ADCs. The sample frequency relation of ADC and parallel structure is shown in Equation (1).
where, f adc is the sampling frequency of high speed ADC, m is the number of parallel channels, and f s is the sampling frequency of parallel structure. The digital signals are stored by parallel FIFO and then rearranged to ensure the stability of data flow. Afterwards, interpolated signals from interpolators will be imported to (timing error detector) TEDs to obtain timing errors, and then filtered by loop filter. Ultimately, timing errors are compensated in interpolators by fractional interval and basepoint index provided by NCOs. FPGA implementation block diagram with delays and bit width of the parallel architecture is depicted in Figure 2. The stability of feedback loop is quite sensitive to timing error, so the error processing related modules require bit width more than 8 bits, even though the source and output are all 8-bit signals. The FPGA sources consumed and delay caused by each module will be discussed in detail in 3.1 and 3.2.
If not specified, the following descriptions are based on m channel parallel module, and the delay adder and multiplier brought into FPGA are 1 and 3 respectively.

Parallel Preprocess
Parallel preprocessing composed of parallel FIFO, data rearrange and data select module is responsible for the stability of data flow.

Parallel FIFO
The source signals need to be stored in the parallel FIFO before all the other procedures in case of any information loss. 2m FIFOs with 8 bits of write/read depth and 512 bits of write width of I Q signal are required on FPGA in an m parallel architecture.

Resource-Saving Data Rearrangement
Timing error is caused by timing frequency and phase offset. The phase offset caused error is a constant value, but timing error will continuously increase or decrease to infinite if there exists timing frequency offset. To restrict the error, source signals need to be deleted or kept when the error has accumulated to a certain amount. However, in a parallel structure, the parallel source signal sequence will be disordered once the delete or keep operation occurs.
Data rearrangement is adopted to delete/keep the source signal and then adjust the disordered signals into a correct order. In [10], the authors have taken a parallel FIFO based delete-keep method to make this adjustment. However, as shown in the first column of Table 1, the subscripts of indexes are variable. This leads to Look Up Table (LUT) being consumed in every parallel channel. More seriously, the LUT consumption increases exponentially with parallel number.
To solve this problem, an index-associated method is proposed. It is easy to find out that the increment of subscript shares the same value with the increment of index value from 1. Thus, the variables can be translated into index(i) plus the corresponding subscripts, as exhibited in the second column of Table 1. With the proposed associated index method, only one LUT is consumed because the other LUTs in [10] can be replaced by add operations. Taking an m parallel system for instance, the LUT are carried out only in index(i), and the other m − 1 LUTs are replaced with m − 1 adders.

In [10]
Associated Index Our work aims at achieving a 20 Gbps rate communication system, a 64 parallel architecture will be a wise choice. Because the FPGA clock frequency will be running around 156.25 MHz, which will not be limited by the current hardware. So the comparison of synthesised FPGA resource utilization with 64-parallel FIFO is exhibited in Table 2. It shows that the proposed method saves about 67% LUTs compared to the method in [10].

. Data Select
A data select module is used to send rearranged source signal to the corresponding interpolator, and each interpolator needs four data. Specifically, in a parallel structure, four extra source signals from the beginning of the next clock cycle need to be attached to the end of current rearranged queue. Otherwise, the last four interpolators could not have enough source data. A data select module needs 5m registers on FPGA in a 2m parallel system.

Parallel Dual Feedback Loop
The improved parallel dual feedback loop is composed of a Parallel Module (PM) and a loop filter. Each PM has two interpolators, one TED and one NCO.

Parallel Module
Every PM needs five source signals and each interpolator has four, the three source signals in the middle are used by both interpolators.

Coefficient Multiplier-Free Interpolator
As it has been summarized in [15], three multipliers/dividers will be consumed with cubic interpolator while updating the coefficients every time, and two while updating with four-point piecewise-parabolic interpolator (α = 1/2) in serial systems. The coefficients are shown in Equation (2).
where x stands for the input signals. Specifically, 3/2 · x in Equation (2) is separated into 1/2 · x + x . Then all Farrow coefficients of four-point piecewise-parabolic interpolator with α = 1/2 will be an integer multiple of 2. This makes the multiplier/divider could be implemented with shift operations on FPGA [21]. In other words, four-point piecewise-parabolic with α = 1/2 is the best choice for FPGA that could balance the maximum resource savings with minimal performance deterioration. The Farrow structure on FPGA is shown in Figure 3. where, D stands for hardware delay, and the symbols are corresponding to those in Equation (2).
In particular, one more extra delay is caused by the coefficient 3/2 which is separated into 1 + 1/2 described above. While each adder brings in 1 delay, and the multiplier brings in 3. So it is easy to find out that the number of delays caused by interpolator is 11. Where, 2 Error Detector Gardner s timing error detector (TED) [16] is shown in Equation (3).
where e(n) is the timing error, I(n) and Q(n) are real and image parts of the interpolator's output, n − 1, n − 1/2, n are three continuous indexes. TED equation in a parallel system is shown in Equation (4) below e(n, i) = I 1 (n, i)[I 2 (n, i) − I 2 (n, i − 1)] where e(n, i) is the timing error of the ith PM at time n. For I signal, I 1 (n, i) is the first interpolator s output of the ith PM at time n, I 2 (n, i) is the second interpolator s output of the ith PM at time n, I 2 (n, i − 1) is the second interpolator s output of the i − 1th PM at time n. When i=1, I 2 (n, i − 1) stands for the second interpolator s output of the last PM at time n-1. Q signal has the same explanation. For FPGA implementation, each error detector contains two adders and two multipliers. The hardware block diagram is depicted in Figure 4. where D stands for hardware delay, and the symbols are corresponding to those in Equation (4). From Figure 4, it can be seen that the total delay is 5, in which the adders bring in 1 delay each and multipliers bring in 3.

Simplified NCO
In [8], the overflow moments are obtained by comparators. Nevertheless, as we have discussed in Section 2, their comparison logic has difficulties to meet FPGA timing requirements. To get real-time performance, a direct-calculation based parallel NCO structure is proposed. In [14], Gardner gives Equation (5) below to calculate m k where int[z] means the largest integer not exceeding z, T s is the sample period before synchronization, and T i is the synchronized sample period. Equation (5) can be translated into Equation (6) below where f ix[z] stands for the largest integer toward z, and R is half of the oversampling ratio, W(n) is the control word of NCO at time n. Thus, m k of each parallel module can be calculated directly and accurately instead of the comparison logic. When implemented on FPGA, the NCO only needs to locate the initial position of the interpolator with one 8-bit control signal in a serial structure. However, in a parallel structure, not only the initial positions of each parallel module are required, another 2-bit control signal is required for rearrangement. Luckily, in our proposed direct calculation method NCO, this 2-bit signal is the first two bits of the 8-bit control signal in the mth NCO, so there are no more hardware resources required, as depicted in Figure 1.

High Precision Loop Filter
A proportional integral filter is employed in our structure. In [8], the information carried by most of the TEDs, except for the last one, are dismissed and only the last TED's timing error serves as the input of proportional element, which leads to a great deviation. To guarantee the accuracy, the average value of all timing errors is employed as the input of loop filter in our work. Simulation results confirmed that the performance is better when using the average value. The equations of the improved loop filter is shown from Equation (7) to Equation (9).
Proportional element P n = k 1 × (err n,1 + err n,2 + · · · + err n,32 ), Integral element I n = I n−1 + k 2 × (err n,1 + err n,2 + · · · + err n,32 ), Then the output of loop filter is where, k 1 and k 2 stand for the coefficient of proportional element and integral element separately, err (n,i) is the error of the ith PM at time n, W n is the output of loop filter at time n. The structure is depicted in Figure 5. When implemented on FPGA, there will be an error smooth module before the first adder, because the input of loop filter is modified to the average value of parallel timing errors as mentioned above. In an m parallel system there exists m/2 TEDs, so only a log 2 m/2-bit shift operation on FPGA can accomplish the smoother. Besides, the first adder is an m/2 input adder, so in an m parallel system the delay brought in by the smoother is log 2 m/2. To save the hardware resources, the multipliers will be replaced by shift operations on FPGA as aforementioned. So k 1 and k 2 in the loop filter are approximated to the nearest integer power of 1/2 to replace the multipliers by shift operations on FPGA. As the multiplier operation on FPGA is achieved by shift and add operation, so even though these approximations would change the closed loop bandwidth and the damping factor achieved by the system, the approximation is reasonable. Therefore, the total adders consumed in the loop filter is 1 + m/2, and the corresponding delay is 2 + log 2 m/2.

Simulation and Fpga Implementation
Our work aims at achieving a 20 Gbps rate wireless communication system, which has been mentioned before. A 64-parallel architecture could make the FPGA clock frequency running at around 156.25 MHz. A 32-parallel system will lead to 312.5 MHz FPGA clock frequency, which makes it very difficult for FPGA to ensure the stability while running. Even though a 128-parallel system needs the FPGA circuit runs only at 78.125 MHz, which is quite easy for nowadays FPGA circuit to guarantee its stability, but with the increase of the number parallel channels, the system error grows drastically. On the other hand, the number of parallel channels other than an integer multiple of two will lead to a waste of waste hardware resource on FPGA chip while routing, because FPGA is based on binary system. So, a 64-parallel architecture is the best choice for a 20 Gbps rate system.
The proposed algorithm is verified in a baseband communication system. The modulation type is 16 QAM, bit rate is 20 Gbps, roll-off factor is 0.4, oversampling frequency f s is 2 times the symbol rate R s , ( f s = 2R s ), and the timing frequency and phase offset are 32 kHz and π respectively. The parallel source signal is quantized to 8 bits. The BER performance of the improved parallel architecture simulated on MATLAB reveals its high efficiency. Furthermore, the implemented parallel architecture on Xilinx XC7VX485T FPGA shows perfect consistency with simulation.

MATLAB Simulation
The constellation diagrams are shown in Figure 6, where Figure 6a,b are the constellation diagrams before and after timing synchronization respectively. Here, SNR is set to 20 dB. The converged constellation diagram proves that the timing module works correctly.
(a) Before Timing Synchronization.
(b) After Timing Synchronization. BER performance for 100 frames (with 16,384 bits each) transmission is carried out in Figure 7. The blue curve is the theoretical BER, * and ∆ represent MATLAB fix and float point simulation respectively. The BER performance indicates that the algorithm can work efficiently with deterioration less than 0.5 dB and 1 dB in float and fix point simulation.

FPGA Implementation
Implementation is demonstrated on a Xilinx XC7VX485T FPGA chip. 128 parallel ROMs are employed to store the source data as equivalent to the 10 GHz ADCs. In order to evaluate the performance difference between simulation and FPGA implementation, periodic source is embedded into the ROMs. The write and read clock of parallel FIFO module are set to 156.268 MHz and 159.524 MHz respectively by an Mixed Mode Clock Manager (MMCM). The device utilization of the whole system is summarized in Table 3. The output of fractional interval is depicted in Figure 8a. Where, * and ∆ represent MATLAB fix point simulation and hardware behavior simulation respectively. It is easy to find out that these two signals are totally overlapped, which indicates the high efficiency of hardware design. To further verify the accuracy, the difference value of the two aforementioned signals is exhibited in Figure 8b. The constant zero proves the NCO output in MATLAB is exactly the same as in hardware behavior simulation. Moreover, not only fractional interval, but all the signals achieved in behavior simulation is exactly the same as those in MATLAB, which confirmed the correctness and effectiveness.  The constellation diagram achieved by Xilinx XC7VX485T FPGA chip is displayed in Figure 9a. Figure 9b shows the difference value (image part) of the interpolator s output of behavior simulation and FPGA implementation.  The difference here can not be guaranteed to always be zero because the initial source signals are impossible to be precisely controlled on an FPGA chip. However, the difference value is always less than 2 from about the 15,000th datum. This means only quantization error less than 1.5% will be brought in for an 8-bit signal and further confirms that the proposed algorithm can work effectively on FPGA.

Conclusions
Through this paper, we have proposed an improved parallel timing synchronization architecture to solve the urgent problem of enhancing transmission capacity in real-time wireless communication systems. Besides, we have demonstrated that the proposed architecture can be successfully implemented on FPGA. In addition, our work saves 67% LUTs resources on FPGA compared with eariler results. Meanwhile, the NCO is further improved to meet the FPGA timing requirements by direct-calculation instead of comparator based structure in related work. The key technology is parallel signal processing, both in theory and FPGA implementation. Parallelization of m channels could reduce the system clock to 1/2 m of that required in serial processing, which makes it feasible to achieve real-time performance with current existing hardware devices. Accordingly, a parallel digital timing synchronization theoretical model was established. Simulation result of 64-parallel channels, 20 Gbps rate and 16 QAM system shows high consistency with the theoretical model. The BER deterioration is less than 0.5 dB and 1 dB in float and fix point simulation respectively. Simultaneously, FPGA implementation shows excellent agreement with simulation. Furthermore, the proposed algorithm is not limited to 64-parallel, higher capacity can be achieved with faster clocks or more parallel channels. The proposed structure would be potentially optimized in future work of high capacity wireless communication.