A 1.93-pJ/Bit PCI Express Gen4 PHY Transmitter with On-Chip Supply Regulators in 28 nm CMOS

: This paper presents a fully integrated Peripheral Component Interconnect (PCI) Express (PCIe) Gen4 physical layer (PHY) transmitter. The prototype chip is fabricated in a 28 nm low-power CMOS process, and the active area of the proposed transmitter is 0.23 mm 2 . To enable voltage scaling across wide operating rates from 2.5 Gb/s to 16 Gb/s, two on-chip supply regulators are included in the transmitter. At the same time, the regulators maintain the output impedance of the transmitter to meet the return loss speciﬁcation of the PCIe, by including replica segments of the output driver and reference resistance in the regulator loop. A three-tap ﬁnite-impulse-response (FIR) equalization is implemented and, therefore, the transmitter provides more than 9.5 dB equalization which is required in the PCIe speciﬁcation. At 16 Gb/s, the prototype chip achieves energy efﬁciency of 1.93 pJ/bit including all the interface, bias, and built-in self-test circuits.


Introduction
Because Peripheral Component Interconnect Express (PCIe) is one of the most fastevolving high-speed interface standards (for example the per-pin data rate is doubled every 4 years [1]), there are a lot of design challenges to meet PCIe Gen4 physical layer (PHY) transmitter specification. The challenges can be summarized as follows: at first, the maximum data rate increases to 16 Gb/s, while that of the prior PCIe Gen3 is 8 Gb/s. Moreover, because PCIe is required to provide backward compatibility, a PCIe Gen4 transmitter should also cover a lower data rates of 2.5 Gb/s (Gen1), 5 Gb/s (Gen2), and 8 Gb/s (Gen3) [1]. Secondly, it is required to implement a three-tap finite-impulse-response (FIR) equalization which include a pre-cursor and a post-cursor, while the equalization coefficients must be controllable, to suppress intersymbol interference (ISI) from variable channel loss. Thirdly, transmitter output impedance should be matched well as specified in return loss requirement, to minimize reflections for better signal integrity. Lastly, the output voltage swing should be adjustable.
In addition to the above essential requirements, supply voltage scalability is another key knob for improving energy efficiency across such wide data rates [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. Ideally, the energy efficiency is expected to be flat across the data rate [6], which is rarely achieved in conventional fixed-voltage designs. Figure 1a shows an example of energy efficiency of the fixed-voltage design. The energy consumption is categorized with static current consumption, dynamic switching energy consumption, and signaling power over lossy transmission line [15,16]. At a lower data rate, the energy consumption by the signaling and the static current becomes less efficient because a fixed power is consumed regardless of the data rate. On the hand, the dynamic switching energy is fixed to CV 2 , regardless of the data rate. As a result, the overall energy efficiency is worse at a low data rate [17]. The signaling energy increases at a high data rate due to equalization power. On the other hand, the supply scaling allows improving the energy efficiency at a low data rate, as shown in Figure 1b, where a linear voltage-frequency scaling is assumed [17]. Because the contribution from the dynamic switching is significantly reduced at a low data rate, the energy efficiency curve becomes flatter compared to Figure 1a, in addition to the improved energy efficiency. Moreover, the supply scaling becomes more attractive in recent CMOS process technology where the circuits can take advantage of robust lowvoltage operation [18]. This work presents design techniques for PCIe Gen4 transmitter and fabrication results in a 28 nm CMOS, which offers the supply-voltage scalability, as well as all the requirements of PCIe Gen4. data rate. On the hand, the dynamic switching energy is fixed to CV data rate. As a result, the overall energy efficiency is worse at a low signaling energy increases at a high data rate due to equalization p hand, the supply scaling allows improving the energy efficiency at shown in Figure 1b, where a linear voltage-frequency scaling is assum contribution from the dynamic switching is significantly reduced at energy efficiency curve becomes flatter compared to Figure 1a, in proved energy efficiency. Moreover, the supply scaling becomes more CMOS process technology where the circuits can take advantage of operation [18]. This work presents design techniques for PCIe Gen4 t rication results in a 28 nm CMOS, which offers the supply-voltage sc all the requirements of PCIe Gen4. This paper is organized as follows: Section 2 describes the overall details of circuit implementation for the proposed transmitter. In Secti results from the prototype chip are discussed. Lastly, Section 4 conclud 2. Implementation of the Proposed PCIe Gen4 Transmitter Figure 2 shows an overall architecture of the proposed PCIe Gen is fabricated in a 28 nm low-power CMOS process. In order to supp  This paper is organized as follows: Section 2 describes the overall architecture and the details of circuit implementation for the proposed transmitter. In Section 3, the fabrication results from the prototype chip are discussed. Lastly, Section 4 concludes this paper. Figure 2 shows an overall architecture of the proposed PCIe Gen4 transmitter, which is fabricated in a 28 nm low-power CMOS process. In order to support the wide bitrate range without power penalty, a CMOS logic-based circuit topology and a scalable supply voltage using an on-chip regulator are employed; hence, a configurable voltage scaling is implemented in the proposed transmitter [19]. Considering the fanout-of-four (FO4) inverter delay across the process, voltage, and temperature (PVT) corners in the 28 nm CMOS technology, a half-rate clocking structure and a time-multiplexing driver are chosen to relax the required circuit bandwidth [12]. P-over-N source-series terminated (SST) or voltage-mode (VM) driver topology with stacked NMOS and PMOS is adopted to implement the time-multiplexing driver. Because the final 2:1 multiplexing happens at the very end of the transmitter, no ultra-high-speed (>8 Gb/s) circuit is used; hence, substantial power saving is achieved. In addition, the transistor stacking relaxes the electrostatic discharge (ESD) requirement due to CMOS snapback breakdown [20,21]. Compared to triple-stack topology with a quarter-rate clock [12], the double-stack allows less charge sharing and eases the overhead of multi-phase generation.

Implementation of the Proposed PCIe Gen4 Transmitter
Electronics 2021, 10, x FOR PEER REVIEW 3 of 10 CMOS technology, a half-rate clocking structure and a time-multiplexing driver are chosen to relax the required circuit bandwidth [12]. P-over-N source-series terminated (SST) or voltage-mode (VM) driver topology with stacked NMOS and PMOS is adopted to implement the time-multiplexing driver. Because the final 2:1 multiplexing happens at the very end of the transmitter, no ultra-high-speed (>8 Gb/s) circuit is used; hence, substantial power saving is achieved. In addition, the transistor stacking relaxes the electrostatic discharge (ESD) requirement due to CMOS snapback breakdown [21]. Compared to triple-stack topology with a quarter-rate clock [12], the double-stack allows less charge sharing and eases the overhead of multi-phase generation. An eight-way parallel data stream is generated from the on-chip pseudo-random binary sequence-7 (PRBS-7) generator. Subsequently, the parallel data are serialized to a two-way stream by half-rate clock. The half-rate clock is externally provided and internally distributed using CMOS buffers. An open-loop duty-cycle corrector (DCC) based on an AC-coupled trans-impedance amplifier (TIA) is used to correct the duty-cycle distortion of the incoming clock [24]. The PRBS generator and the serializer are under a separate digital supply (LVDD) domain, to decouple the digital switching noise from the analog circuits. In the pre-driver stage, high-speed custom flip-flops are used to synchronize the 6 bit half-rate data stream for three-tap FIR equalization (D1<1:0>, D0<1:0>, D-1<1:0>) from the serialized stream. This also includes the data and clock buffers which drive large transistors of the time-multiplexing output drive, where the P-over-N SST topology provides the high output swing of PCIe Gen4 specification (>800 mV) with reasonable linearity (return loss < −10 dB) [25]. The pre-driver and the output driver are under the regulated supply voltages, VDDPU and VDDPD, respectively, which are configurable. Two supply regulators are dedicated to generating VDDPU and VDDPD from the 1.8 V supply (HVDD), which are An eight-way parallel data stream is generated from the on-chip pseudo-random binary sequence-7 (PRBS-7) generator. Subsequently, the parallel data are serialized to a two-way stream by half-rate clock. The half-rate clock is externally provided and internally distributed using CMOS buffers. An open-loop duty-cycle corrector (DCC) based on an AC-coupled trans-impedance amplifier (TIA) is used to correct the duty-cycle distortion of the incoming clock [22][23][24]. The PRBS generator and the serializer are under a separate digital supply (LVDD) domain, to decouple the digital switching noise from the analog circuits.
In the pre-driver stage, high-speed custom flip-flops are used to synchronize the 6 bit half-rate data stream for three-tap FIR equalization (D1<1:0>, D0<1:0>, D-1<1:0>) from the serialized stream. This also includes the data and clock buffers which drive large transistors of the time-multiplexing output drive, where the P-over-N SST topology provides the high output swing of PCIe Gen4 specification (>800 mV) with reasonable linearity (return loss < −10 dB) [25]. The pre-driver and the output driver are under the regulated supply voltages, VDD PU and VDD PD , respectively, which are configurable. Two supply regulators are dedicated to generating VDD PU and VDD PD from the 1.8 V supply (HVDD), which are used to match the output impedance of the transmitter, as well as to configure the voltage scaling and the output swing. The 1.8 V supply can be further reduced depending on the link requirements. Replica slices (pull-down path in the VDD PD regulator, pull-up path in the VDD PU regulator) of the output driver are used in the regulators to capture the impedance information of the actual output driver. Both the output driver and the replicas are segmented; therefore, the number of active segments becomes configurable. Because the regulators basically equalize the impedance of the replicas with the reference resistance by adjusting the regulated voltage, configuring the number of active segments achieves on-chip supply scaling. Specifically, because of its inverter-like structure, the P-over-N SST driver does not allow the pull-up (PMOS) and pull-down (NMOS) branches be turned on simultaneously. As a result, the pull-up and pull-down impedances can be controlled independently. Assuming linear operation, the impedance is a dominant function of gate-overdrive voltage, which is VDD PU -V T,P or VDD PD -V T,N , respectively, for pull-up and pull-down paths. Therefore, changing the number of active segments in the replica array lets the regulators adjust the VDD PU and VDD PD accordingly to match the impedance of the replica array and the reference resistance. For example, once the number of active segments is reduced, the regulator raises the supply voltage to increase the resistance per segment, to keep the same overall resistance. Therefore, the driver provides a stable output impedance regardless of the driver configuration, as [12] validates that the impedance variation is less than 10% of the target impedance. The segmentation of the output driver also enables the transmitter to configure FIR coefficients, by adjusting the number of segments belonging to the pre-cursor, main, and post-cursor taps. The driver segment selection (NAND and NOR gates) is placed at the first stage of the pre-driver to minimize redundant switching power due to data transition [12]. On the other hand, the replica segment selection is simply realized by connecting the gate to the regulated supply voltage or ground voltage without considering the data transition; as a result, a simple inverter in the regulated supply domain serves as the segment selection. For both regulators, simple one-stage differential amplifiers are used as the operational transconductance amplifiers. The dominant pole is placed at the gate of the pass transistor for both regulators, to stabilize the negative feedback loop with a reasonable size of the stabilizing capacitor which is placed at the output of the amplifier. The simulated power-supply rejection ratios (PSRR) of both regulators are shown in Figure 3. Because the dominant pole is placed at the gate of the pass transistor, a zero is introduced in the PSRR curve [26]. This is acceptable for this work, where the primary role of the regulators is to control the impedance, whereas the PSRR is a secondary benefit. In addition, the dominant pole is placed at a low enough frequency across all configurations (i.e., at least 10× lower frequency over the second pole), which confirms that the loop is stable regardless of the configurations.

Measurement Results
The proposed PCIe Gen4 transmitter is fabricated in a 28 nm low-power CMOS technology and occupies an active area of 0.023 mm 2 , while the full chip occupies 1 mm 2 . The chip photomicrograph and block description are shown in Figure 4. The prototype chip is wire-bonded on the test printed circuit board (PCB). The transmitter dissipates 30.8 mW at 16 Gb/s including all the interface buffers, bias circuits, and built-in self-test circuitry. Figure 5 shows the power breakdown at various data rates with nominal supply voltage and output swing configuration. Owing to the CMOS-based circuit implementation, the power consumption by the data generation, data buffers, and clock distribution is almost linearly scaled with the data rate. On the other hand, the power consumption of the output SST driver and the regulators is not scaled because they belong to the signaling and static power categories in Figure 1. However, this can be improved by lowering HVDD as the regulator headroom requirement is relaxed at a lower data rate. Figure 6 shows the measured eye diagrams at 8 Gb/s data rate with a low-loss channel, with respect to the threetap FIR configuration, which verifies the proper operation of three-tap FIR equalization.

Measurement Results
The proposed PCIe Gen4 transmitter is fabricated in a 28 nm low-power CMOS technology and occupies an active area of 0.023 mm 2 , while the full chip occupies 1 mm 2 . The chip photomicrograph and block description are shown in Figure 4. The prototype chip is wire-bonded on the test printed circuit board (PCB). The transmitter dissipates 30.8 mW at 16 Gb/s including all the interface buffers, bias circuits, and built-in self-test circuitry. Figure 5 shows the power breakdown at various data rates with nominal supply voltage and output swing configuration. Owing to the CMOS-based circuit implementation, the power consumption by the data generation, data buffers, and clock distribution is almost linearly scaled with the data rate. On the other hand, the power consumption of the output SST driver and the regulators is not scaled because they belong to the signaling and static power categories in Figure 1. However, this can be improved by lowering HVDD as the regulator headroom requirement is relaxed at a lower data rate. Figure 6 shows the measured eye diagrams at 8 Gb/s data rate with a low-loss channel, with respect to the three-tap FIR configuration, which verifies the proper operation of three-tap FIR equalization.
Electronics 2021, 10, x FOR PEER REVIEW   Because the supply regulators provide background calibration of the outpu ance, the transmitter provides constant output impedance regardless of the FIR ration; hence, the measured return loss of the prototype transmitter chip is less dB across all the frequencies of interest and passes the return loss specificatio      Because the supply regulators provide background calibration of the output ance, the transmitter provides constant output impedance regardless of the FIR c ration; hence, the measured return loss of the prototype transmitter chip is less th dB across all the frequencies of interest and passes the return loss specification Gen4, as shown in Figure 7. In particular, the measured return loss at low frequ    Because the supply regulators provide background calibration of the output impedance, the transmitter provides constant output impedance regardless of the FIR configuration; hence, the measured return loss of the prototype transmitter chip is less than −10 dB across all the frequencies of interest and passes the return loss specification of PCIe Gen4, as shown in Figure 7. In particular, the measured return loss at low frequency is less than −32 dB, which is equivalent to the impedance mismatch of less than 2.5 Ω, validating the impedance calibration method based on the supply regulators. In addition, as mentioned in the Section 2, the relaxed ESD requirement due to the stacked output driver  Because the supply regulators provide background calibration of the output impedance, the transmitter provides constant output impedance regardless of the FIR configuration; hence, the measured return loss of the prototype transmitter chip is less than −10 dB across all the frequencies of interest and passes the return loss specification of PCIe Gen4, as shown in Figure 7. In particular, the measured return loss at low frequency is less than −32 dB, which is equivalent to the impedance mismatch of less than 2.5 Ω, validating the impedance calibration method based on the supply regulators. In addition, as mentioned in the Section 2, the relaxed ESD requirement due to the stacked output driver improves the high-frequency return loss. A practical lossy channel is used to fully verify the effect of three-tap FIR equalization. Figures 8 and 9 show the measured insertion loss (S21) of the lossy channel and the eye diagrams at 16 Gb/s with the lossy channel, respectively. With 9.6 dB channel loss at the Nyquist frequency (8 GHz), no residual ISI is found from the eye diagram with three-tap FIR equalization, which verifies that the proposed transmitter provides a sufficient compensation at Nyquist frequency larger than 9.5 dB, which is required from the PCIe Gen4 specification. In addition, we observe that the three-tap FIR equalizer gives a better eye opening compared to two-tap configuration, which justifies the implementation of the three-tap FIR. Figure 10 shows the output voltage swing measured with a Tektronix P7313 differential SMA probe module. The transmitter achieves the output swing of 800 mV peak-to-peak,differential .

(a) Without EQ (b) Pre-emphasis (c) 3-tap EQ
Electronics 2021, 10, x FOR PEER REVIEW 7 of 1 from the eye diagram with three-tap FIR equalization, which verifies that the propose transmitter provides a sufficient compensation at Nyquist frequency larger than 9.5 dB which is required from the PCIe Gen4 specification. In addition, we observe that the three tap FIR equalizer gives a better eye opening compared to two-tap configuration, whic justifies the implementation of the three-tap FIR. Figure 10 shows the output voltag swing measured with a Tektronix P7313 differential SMA probe module. The transmitte achieves the output swing of 800 mVpeak-to-peak,differential.     The performance of the fabricated transmitter is summarized and compared to other supply-scalable, wide-range transmitters in Table 1. Compared to the reference designs, the proposed transmitter achieves the best energy efficiency except for [13]. However, onchip supply scaling is missing in [13]; consequently, it does not reflect the power overhead of on-chip supply scaling, such as the voltage dropout by a linear regulator or finite efficiency and switching noise of a switching regulator. As a result, considering that a substantial portion of the power is dissipated by the dropout of regulators, we can conclude     The performance of the fabricated transmitter is summarized and compared to other supply-scalable, wide-range transmitters in Table 1. Compared to the reference designs, the proposed transmitter achieves the best energy efficiency except for [13]. However, onchip supply scaling is missing in [13]; consequently, it does not reflect the power overhead of on-chip supply scaling, such as the voltage dropout by a linear regulator or finite efficiency and switching noise of a switching regulator. As a result, considering that a substantial portion of the power is dissipated by the dropout of regulators, we can conclude     The performance of the fabricated transmitter is summarized and compared to other supply-scalable, wide-range transmitters in Table 1. Compared to the reference designs, the proposed transmitter achieves the best energy efficiency except for [13]. However, onchip supply scaling is missing in [13]; consequently, it does not reflect the power overhead of on-chip supply scaling, such as the voltage dropout by a linear regulator or finite efficiency and switching noise of a switching regulator. As a result, considering that a substantial portion of the power is dissipated by the dropout of regulators, we can conclude The performance of the fabricated transmitter is summarized and compared to other supply-scalable, wide-range transmitters in Table 1. Compared to the reference designs, the proposed transmitter achieves the best energy efficiency except for [13]. However, on-chip supply scaling is missing in [13]; consequently, it does not reflect the power overhead of on-chip supply scaling, such as the voltage dropout by a linear regulator or finite efficiency and switching noise of a switching regulator. As a result, considering that a substantial portion of the power is dissipated by the dropout of regulators, we can conclude that the proposed transmitter achieves the best efficiency among the supply-scalable transmitters.

Conclusions
A PCIe Gen4 PHY transmitter fabricated in a 28 nm CMOS was presented. The transmitter adopts supply voltage scaling to improve energy efficiency across a wide range using the on-chip supply regulators. The fabricated chip satisfies all the specifications of PCIe Gen4, including data rates, output swing, equalization, and return loss. The transmitter consumes 30.8 mW at 16 Gb/s and occupies an active area of 0.023 mm 2 .