A 1.55-to-32-Gb/s Four-Lane Transmitter with 3-Tap Feed Forward Equalizer and Shared PLL in 28-nm CMOS

This paper presents a fully integrated physical layer (PHY) transmitter (TX) suiting for multiple industrial protocols and compatible with different protocol versions. Targeting a wide operating range, the LC-based phase-locked loop (PLL) with a dual voltage-controlled oscillator (VCO) was integrated to provide the low jitter clock. Each lane with a configurable serialization scheme was adapted to adjust the data rate flexibly. To achieve high-speed data transmission, several bandwidth-extended techniques were introduced, and an optimized output driver with a 3-tap feed-forward equalizer (FFE) was proposed to accomplish high-quality data transmission and equalization. The TX prototype was fabricated in a 28-nm CMOS process, and a single-lane TX only occupied an active area of 0.048 mm2. The shared PLL and clock distribution circuits occupied an area of 0.54 mm2. The proposed PLL can support a tuning range that covers 6.2 to 16 GHz. Each lane’s data rate ranged from 1.55 to 32 Gb/s, and the energy efficiency is 1.89 pJ/bit/lane at a 32-Gb/s data rate and can tune an equalization up to 10 dB.


Introduction
As the data centers rapidly evolve to accommodate higher information transfer rates, a high-speed serial interface has become the main candidate to deliver data transmission [1][2][3][4][5][6][7]. Therefore, multiple industrial standards and protocols have been introduced, such as JESD204B, Thunderbolt, Peripheral Component Interconnect Express, and Universal Serial Bus [8][9][10][11]. The bandwidth requirements of those protocols keep increasing, and the decreased unit interval (UI) period becomes a bottleneck in high-speed transmitter (TX) design, which makes the timing budget extremely tight. Hence, the TXs must support a wide range of data rates and the appropriate equalization, which is the most challenging issue in this design.
Despite the availability of high-speed CMOS circuits, high-speed transmission data is still severely restricted by the wireline channels. The bandwidth-limited channel attenuates the high-frequency gain of the transmitted data due to the skin effect and dielectric loss [12][13][14][15], resulting in inter-symbol interference. A feed-forward equalizer (FFE) is usually used in the TXs to compensate for the channel loss. While an FFE is embedded in a current-mode logic (CML) output driver, the output impedance and the swing can be adjusted by the termination resistor and the bias current, respectively. On the other hand, a CML-based output driver can fully exploit the process potentials as its compact NMOS driving topology naturally features fast current switching speed and small parasitic capacitance [16,17]. In addition, as the data rate increases, a low-jitter clock is needed to meet the timing budget, leading to an LC voltage-controlled oscillator (VCO) based phase-locked loop (PLL) [18,19]. This paper reports a 1.55-to-32-Gb/s four-lane TX fabricated in 28-nm CMOS technology. To apply a wide operating range and multiple protocols, a high-operating range PLL is designed to generate the multi-frequency differential clocks, and the multi-rate TX lanes are proposed, in which the signal frequency of the multi-phase clocks can be configured according to the expected data rate. Meanwhile, the proposed TX prototype also supports high-speed data transmission. Hence several circuit techniques are adopted to expand circuit bandwidth and relieve the severe timing constraints. The optimized combiner with the 3-tap FFE is proposed to reduce the high-frequency channel loss. This paper is organized as follows: Section 2 presents the top architecture of the proposed four-lane TX. Sections 3 and 4 present the design details and critical considerations of the on-chip PLL and TX lane. Measurement results are illustrated in Section 5. And the conclusion is drawn in Section 6. Figure 1 shows the overall architecture of the proposed four-lane TX, which consists of a BIST module, four TX lanes with the same structure, a shared PLL, clock distribution circuits (CDCs), and a clock buffer. The BIST module integrates the pseudo-random binary sequence (PRBS) generation, data coding, and register control. It sends the parallel data on the feedback clock to the TX lane. A multi-rate signal lane is important for the wide data-rate range TX. It can be realized by the configurable signal mode and data serialization adjustment. The high-speed data is driven by an optimized CML-based buffer with the 3-tap FFE. Additionally, an LC-PLL takes advantage of excellent jitter performance and reduced high-frequency clock routing. A single LC-VCO-based PLL can be shared by 4 to 8 lanes as a central multiplying unit [20]. Thus, the proposed TX adopts a shared LC-VCObased PLL for the local half-rate clock generation. In this way, the power consumption and chip area are amortized over the TX lanes, improving the overall energy efficiency.  Figure 2 depicts the implementation of our single-lane TX. Each lane employs a halfrate architecture to relax the timing constraint of the critical path, and it is composed of a clock path and a data path. The input differential clocks generated by the shared PLL are first converted to rail-to-rail clocks by the CML-to-CMOS circuits in the clock path. They are divided to produce the proper clocks for the serialization trees and latch arrays. The frequency division factor is configured according to the desired transmission data rate. It uses the CMOS-logic-based scheme for the clock dividers and data multiplexers (MUXs) to reduce power consumption. In the data path, both the odd and even bits of the 40-bit parallel data are serialized by the MUX trees and retimed by an interleaved latch array to generate differential data sequences, i.e., DO PRE/MAIN/PST and DE PRE/MAIN/PST . The high-speed 2:1 MUXs generate the full-rate data streams applied to the final CML-based output driver. The 3-tap FFE is inserted into the output driver to improve its eye diagrams. The proposed single lane can be configured for multi-rate data through the serialization process at the front end. The BIST module transfers the encoded parallel data to the PHY layer and decides the effective data bits. In TX lanes, the divider and MUXs tree is configured to the corresponding data rate. Therefore, the half-rate DE/DO is consistent with the protocol requirements. In this way, the data rate can be altered to multi-rate flexibly, e.g., if the data transfer rate needs to be DR = 16 Gb/s, and the local clock generate by PLL is locked to f CK = 8 GHz, then the dividers are set to DIV1 and DIV5, respectively; hence, the final TX OUT is at 16 Gb/s, as desired.

Multi-Rate Lane Timing
The most critical timing constraint appears at the final 2:1 MUX, which will become more severe once the timing constraint changes at different data rates. Hence, a latch array and three 2:1 MUXs are employed to satisfy the changing timing, as shown in Figure 3a. For high-speed data, A CML-based MUX can easily merge an active peaking technique in predriver [21] or charge enhancement techniques [22] to extend the required bandwidth, but it consumes large power. The CMOS-based MUX is employed in this design to minimize the transmission delay in the critical timing path and reduce the power consumption and area overhead. Figure 3b illustrates the timing diagram of the latch array and MUX. The latch array guarantees the phase relationship between the clock and data paths and then outputs the complementary data streams with a 1-UI timing offset for the FFE combiner. For example, in the MUX (taking pre-tap), the differential half-rate clock works as a selection signal, and the logic gates accomplish the data serialization. The negative-half and the positive-half sides have the same structure and timing constraint. In this way, the phase difference of CK rising edge and data transition edge is fix to Tsetup, and is not influenced by the UI changes. An appropriate Tsetup is set through the post-simulation, considering the PVT variations. The maximum and minimum Tsetup are separately 12 ps and 5 ps under all the PVT variations in a 28-nm CMOS process.

The High-Speed CML-Based Output Driver
The proposed TX can operate with a wide data rate and high-speed data transmission simultaneously. Thus, the most critical circuit in the signal path is the output driver, requiring both sufficient bandwidth and reasonable gain. The conventional differential CML topology utilizes a tail current source, and the input devices need to be sized sufficiently large to keep the tail current source in the saturation region. Hence, as shown in Figure 4a, a tailless CML-based output driver is proposed to reduce the size of the NMOS input devices and their parasitic capacitance. It is worthy of mentioning that the resistors in parallel with the input transistors are adopted to keep the cascade transistors in saturation region and set a low drain-source voltage (around 0.25 V) when the input transistors are turned off. Here, the values of the parallel resistors in the main/post/pre-tap slice are 4/8/12 kΩ, and the total current of the parallel resistors is around 110 µA in total, which can be ignored compared to the power consumption of the output driver. Another primary function of the output driver is to combine the three data sequences to implement signal equalization. The 3-tap FFE is a finite impulse response filter embedded in the proposed output driver. The pre-tap and post-tap data streams are built by parallel connection of the CML-based output driver. They have an opposite polarity compared with the main-tap signal to realize the addition and subtraction operations. The equalization levels are determined by the ratio of the tail current of the CML drivers, which are tuned by the bias generator shown in Figure 4b. A DAC is used to set the TX output swing. The shunt currents of the pre-tap and post-tap are adjusted by weight control signals (W PRE and W PST ), and the amplifier is adapted to generate the bias voltage and ensure sufficient phase margins. The value of load resistor R T is 50 Ω to realize impedance matching, considering the PVT variations, R T can also be adjusted from 39 Ω to 75 Ω. In addition, for AC coupling mode, the driver swing is compressed as the common voltage of the output signal is pulled down. Hence we add a parallel current source to provide an appropriate bias and increase the driver output level at a higher transmission data rate.

Wide-Operating-Range PLL
As shown in Figure 5a, the designed PLL is composed of a phase-frequency detector (PFD), a charge pump (CP), a second-order loop filter (LF), two parallel VCOs, and frequency dividers. The CML-based buffers are used to transfer the half-rate clocks to each lane and the output PADs for PLL performance measurement. Typically, the reference clock (CK REF   The presented TX supports a wide data-rate range; hence, several techniques are proposed to support a wide operation range. The dual VCOs combining with the switchcontrolled capacitance cover the required operating range coarsely. The designed CP supporting a large range of V ctrl is designed to finely tunes the clock frequency. An LC-VCO is superior to a ring-VCO for multi-GHz serial links in terms of noise characteristics such as phase noise and clock jitter. However, its limited tuning range remains a challenge. A single LC-VCO centered at frequency f1 can cover only up to its tuning range. To support a wider range of data rates, the additional LC-VCO centered at f2 is adopted. Therefore, it is possible to support multi-standard at the cost of the acceptable area overhead. In addition, the switch-controlled capacitor array shown in Figure 5b is integrated into each VCO, which can tune the VCO operating range.

Measurement Results
The proposed four-lane TX was realized in a 28-nm CMOS process. The chip micrograph and block description are shown in Figure 6. The total chip size was 2.97 mm × 1.08 mm, mainly dominated by input/output testing PADs and on-chip PLL. A single-lane TX merely occupied an active area of 0.048 mm 2 , and the shared PLL and CDCs occupy a core area of 0.54 mm 2 . The prototype chip was wire-bonded on the printed circuit board for all measurements. The four lanes shared a 0.9-V supply for core circuits and a 1.2-V supply for the bias generator and output driver; the corresponding power consumptions were 26.5 mW and 34.1 mW, respectively. The PLL and shared circuits were assessed with a 0.9-V and a 1.8-V supply, and the corresponding power consumptions are 63.2 mW and 8.4 mW, respectively. Note that the independent power supplies can help suppress the output jitter effectively. Therefore, the TX dissipates 60.6 mW at a 32-Gb/s data rate per lane, corresponding to the energy efficiency of 1.89 pJ/bit, and the power consumption of the PLL is 71.6 mW. The measurement environments are shown in Figure 7, in which the TX chip was wire bonded to the demo PCB, the SPI control signal was connected to the FPGA designed kit (VCU118) for register configuration, and the input reference clock for PLL was provided by the analog signal generator (Keysight N5173). The PLL performance is measured through the signal analyzer (Agilent Technologies N9030A). The TX lanes were measured with both the FPGA environment and the oscilloscope (Teledyne LeCroy SDA MCM-ZI-A) to characterize the overall performance.
The PLL was measured through clock buffers in the proposed TX prototype. Figure 8a gives the phase noise and spur performance operating at 10 GHz, where the phase noise is around −92.6 dBc with a 1-MHz offset, and the spur is better than −56 dBc. Figure 8b shows the tunning range of the designed PLL, where VCO1 supports an operation range of 6.2 to 10.7 GHz and VCO2 supports an operation range of 10.2 to 17.1 GHz. Figure 8c further shows a group of the measured eye diagrams. The designed PLL with dual VCOs can cover a wide frequency range from 6.2 to 16 GHz with a low output jitter. In measurement environment 1, the FPGA design kit receives the four-lane differential data signal and checks the TX data with the specified PRBS patterns. The four-lane data were all checked, and the received data were correct, and the BER was less than 1e-10 up to 25 Gb/s data rate, which proved that the TX data were correctly serialized and the delay mismatch between each lane would not influence the chip function. The oscilloscope was adopted in measurement environment2 to observe the eye diagram and to confirm the high-frequency performance of the proposed TX chip. Figure 9 summarizes the measured eye diagrams of the TX lane-0; the TX can realize an eye height of >0.9 V ppd and eye width of >0.72 UI under different data rates, such as 1.55, 10, 28, and 32 Gb/s. As shown in Figure 10, the four TX lanes can achieve consistent performance. Figure 11a displays the measured s-parameter curves of the 0.5-m, 2.2-m, and 3.2-m paired cables. Figure 11b shows the eye diagrams before and after applying the equalization at a data rate of 32 Gb/s. It can be observed that the designed 3-tap FFE can optimize the eye-opening for all cable channels with −4.1/−6.6/−9.7-dB equalization.
(c)    Table 1 summarizes the chip performance and compares this work with recent TXs [23][24][25][26] operating at a similar data rate and in a similar CMOS process. Our work shows a wider range of data rates, better energy efficiency, and wider eye width.

Conclusions
A four-lane transmitter suit for multiple serial communication protocols is presented in this study. To support a wide operation range, the dual VCO scheme is proposed in the designed LC-PLL to produce the local half-rate clock, and the configurable data serialization architecture is adopted in signal lanes to match the transmission data rate. Additionally, an optimized output driver with 3-tap FFE is proposed to provide sufficient bandwidth and compensate for the wireline channel loss for high-speed communication.
The transmitter prototype is fabricated in a 28 nm CMOS process; the active area of the transmitter lane and PLL with CDCs are 0.048 mm 2 and 0.54 mm 2 , respectively. The prototype can support a wide transmission data range of 1.55 to 32 Gb/s and consumes 1.89 pJ/bit/lane at a data rate of 32 Gb/s. Author Contributions: C.C. and X.Z. designed the circuits, analyzed the measurement data, and wrote the manuscript. D.W. and J.L. assisted the circuit simulation and implementation. L.Z. and J.W. assisted the chip package implementation and the PCB designing. D.L. performed the chip test and assisted with the chip measurement. Y.C. contributed to the technical discussions and reviewed the manuscript. X.L. gave some valuable guidance and confirmed the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.