A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects

: This paper proposes a multichannel and high-bandwidth (BW) receiver for standard packaging die-to-die (D2D) interconnects. The receiver adopts forward clock (FCK) architecture of the high-density transmission standard, which consists of 16 high-speed data paths and a pair of low-speed differential clocks for 512 Gbps BW. To reduce the chip area and power consumption, a common minimal phase-locked loop (MINI-PLL) and data adjustment (CDA) circuit to replaces the clock data recovery circuit (CDR) in the traditional receiver. A delay-matching circuit is adopted to combat PVT variation and lane skew. In addition, a high linearity phase interpolator (PI) circuit design is used in the minimum phase-locked loop (MINI-PLL) to adjust the clock phase and improve the clock jitter performance. Using 28 nm CMOS technology, the overall link power consumption is 1.56 pJ/b. Bit error rate (BER) is less than 10 − 15 under the real S-parameters with a channel loss of 10db@16GHz.


Introduction
With the development of data exchange rates, the amount of data transmission has grown exponentially, which is very important for hyperscale data centers, highperformance central processing units (CPU), graphics processing units (GPU), and artificial intelligence (AI). Therefore the development of system-on-chip (SoC) will face unprecedented challenges. With the slowdown of Moore's Law, improvements in chip performance and power consumption are increasingly uneconomical. Heterogeneous integration (chiplet) technology provides a new design solution to this problem [1,2]. Multiple smaller chips in a multichip module (MCM) are linked by D2D interconnects that have extremely low power consumption and high BW at the edge of each chip. In high-performance computing (HPC) and AI applications, a large SoC is divided into two or more homogeneous chips. As shown in Figure 1, the I/O and network cores are divided into separate chips in networking applications. D2D interconnects in such SoC must not affect overall system performance [3], so D2D interconnects focus more on lower bandwidth density (bandwidth per area/layer), lower power consumption (power consumption per data rate) and lower BER [4].
Traditional D2D interconnects often adopt single-end and low-speed parallel transmission [5,6] to increase bandwidth density, but with the rapid increase of the amount of exchanged information, this method cannot meet the demand for D2D interconnects.
To solve the problem of low bandwidth, Pulse amplitude modulation (PAM) was used in D2D interconnects, but it brings a larger area and power consumption problems [7], resulting in high power consumption. At the same time, PAM4 has lower signal-to-noise ratio (SNR) than NRZ, so BER will higher than the systems using NRZ. With the application of SerDes between chips, D2D interconnects have gradually turned to single-ended and high-speed transmission to further increase bandwidth density. However, with the continuous increase of speed, the marginalization of signals is intensified, and it is easily affected by external factors. So single-ended signals (SES) are more dependent on advanced packaging technology [8], but not in standard packages. In order to be suitable for standard packaging and higher rates, differential signals (DS) transmission is often used. But due to circuit characteristics of DS, the theoretical power consumption for D2D interconnection is twice that of SES transmission, and the bandwidth density is half of that. Therefore, it is necessary to improve the bandwidth density and reduce the power consumption of DS transmission in order to be suitable for standard packaging under high speed transmission. This paper proposes a solution for the problem of DS transmission for D2D interconnects. MINI-PLL and CDA circuit replaces CDR in traditional receiver, and a delay matching circuit is adopted to combat PVT variation and lane skew, which reduce power consumption and area.

Receiver Design
To improve clock quality and reduce BER, forward clock (FCK) is often used in D2D interconnects. Figure 2 shows a FCK architecture [9,10]. In the FCK, the frequency of the clock and the data are well matched, so the jitter introduced in the data transmission is small. The sampling clock can be adjusted by a phase interpolator (PI), so a length mismatch between the data line and the clock line are allowed in FCK transmission. The advantage of the mode is that the sending and the receiving ends are relatively independent, which is convenient for modular design and reduces the complexity of the receiver, so it is very suitable for D2D interconnects [6,7,11]. The additional clock path will occupy some interconnect channels, and will cause additional Si-area and power consumption. However, these are distributed among the individual data paths with the number of channels increasing. In D2D interconnects, each channel is often configured with its own clock recovery circuit (CDR), because the skew of the clock on each channel is different. In this design, CDA and PLL structures are used instead of CDR, and delay matching δ is used to match the clocks on each channel. The advantage of this method is that it is not necessary to configure CDRs on each channel, so the chip area and power consumption can be greatly reduced in multi-channel transmission, thereby increasing the bandwidth density.
As shown in Figure 3, there is the Integral structure of the receiver. The output of PLL generates eight-phase equal-phase clocks, which contain PFD, filters, charge pumps (CP), VCO, divide-by-4 circuits (DIV4), and PI [12]. The MINI-PLL adjusts the phase of the PI, so that the clock generated by the VCO is sampled to the best position of the data. CDA is used to process the phase early/late information and send it to the PI [4,13].

CDA
As shown in the data path and CDA digital design of Figure 4, the sampler samples 32 Gbps data and generates four channels of data information and four channels of data edge informations with a speed of 8 Gbps. After the DMUX module, select 16 (D[0:15] + E[0:15])-aligned 2 Gbps information streams from L0. After the phase detector (PD) module, early/late/hold signals are generated, and then through the voting processing of the voter module, the final early/late/hold signals are sent to the digital filter. The eight bits in the output of the filter are transformed to control signals of 35 BITs through the Weight Coding. Among them, P<0:2> is the Gray code, which is used to control the quadrant; Bit[0:31] is the thermometer code, which is used to control the PI rotation. The MINI-PLL adjusts the phase of the PI, so that the clock generated by VCO is sampled to the best position of the data. The delay calibration is used to perform different delays on MINI-PLL after the L0 channel is locked to match each channel.

Delay Matching δ
The deviation of the channel length and the difference of the PVT of the chip lead to the inconsistent time of each signal reaching the receiving end, so each channel needs to be matched with a delay. As shown in Figure 5, the MINI-PLL samples the data to the best position after locking through the L0 channel, and L1 to L15 use this scheme to delay the clock. Taking L1 as an example, when the transmitter sends PRBS31 for training, the PI in the δ is traversed under the control of the control table (from 00000 to 111111), and the output data receive the bathtub curve. At the best position (Time = 0 ps), the control code in the query control table is X 1 X 2 X 3 X 4 X 5 X 6 , and input it into the PI. L1 to L15 use the same method to get 15 control bits input to δ in L1 to L15.

DMUX1:2
The structure of the DMUX is shown in Figure 6a. The data rate of DMUX input is below 20 GHz and the clock frequency is below 10 GHz [14]. After the data passes through the 1/4 sampler, they become 10 GHz data. The clock frequency (5 GHZ) is half of the data rate. In the D1 path, Latch1 and Latch2 form a rising edge trigger, and Latch2 and Latch3 form a falling edge trigger. The rising edge of the clock is sampled and the falling edge is output. In the Dout2 path, Latch4 and Latch5 form a falling edge trigger, and the falling edge of the clock is sampled and output. The timing diagram of the output is shown in Figure 6b.

Digital Model
CDA adopts an all-digital structure, and Figure 7 shows the entire digital structure of CDA. The 2G data D[0:15] and edge information E[0:16] are divided into four groups, and each group contains four BBPDs. After each group of phase discrimination results pass through the voter, a signed digital signal (−1 to +1) is formed. The symbol "+" represents the lead signal, "−" represents the lag signal, and "0" is the hold. Four groups of digital signals pass through the adder to form a new signed digital signal (−4 to +4). Then through the bandwidth controller (BW), select the appropriate bandwidth to amplify or maintain, and generate unsigned eight-bit lead-lag information. The signal output by the filter is converted by weight to generate a thermometer code to control the PI rotation.

Weight Coding
The eight-bit early and late information output by the digital filter is sent to the weight coding control, forming a phase control code (generated by the higher three bits) and a thermometer control code (generated by the lower four bits).
When the control signal is 00000000 in the initial state of the circuit, the last five bits of the control signal are 00000 (indicating that the clock lags). The phase of the corresponding output clock can be determined to be 2π. As the value of the control signal increases, the clock lags further. So in order for the clock to track the data, the output clock phase is reduced. The output phase should rotate clockwise from the fourth quadrant, and the change trend is shown in the Figure 8. When the phase changes, the five-bit signal has a sudden change. Therefore, before the last five bits are decoded, they must be processed first, and the change trend after processing is shown in the Figure 9. In the coding process, the phase control code and the thermometer code are designed separately. To convert the higher three bits into a Gray code with no competition risk, the overall logic is shown in Table 1. Then, the obtained new five-bit signal A4A3A2A1A0 is decoded, and the phenomenon of competition and risk of binary numbers can be avoided. The conversion relationship is shown in Table 2.
LC oscillators use passive components, such as inductors and capacitors, which are large in area, complicated in process, and difficult to redesign. Packaging and EMI issues require further consideration. Ring oscillators achieve a high level of integration and good coordination without the need for additional processes or packaging. In addition, ring oscillators can provide polymorphic clocks.
The RX-PLL in Figure 10 adopts a divide-by-four structure. The output of the ring oscillator (number of ring oscillations N = 4) comprises eight phases to improve the rotation accuracy of the PI. Clock and data alignment (CDA) was used to detect eight-bit edge information, and after thermometer coding, three phase codes and 32 thermometer codes were generated. When the clock generated by the phase-locked loop is sampled to the best position of the data, it is locked and the output clock jitter is 2.4 ps.

PFD
The structure of the frequency and phase discriminator in this paper is shown in Figure 11a, which is mainly composed of two D flip-flops with reset function, delay module, logic operation and gate. The reference clock f re f and the feedback signal f div are used as the CLK end of the D starting device after passing through the buffer of the invertor. When there is frequency difference or difference between the two input signals, QA_ P and QB_ P will output high level, and the power supply pump will work. When two input signals are synchronized, both QA_ P and QB_ P outputs are low. QA_ N and QB_ N are the inverse output signals of QA_ P and QB_ P, respectively. Figure 11b is the working sequence diagram of the frequency and phase detector. When the reference clock f re f leads, the QA_ P signal jumps before the QB_ P signal and becomes high level. At this time, after QA_ P and QB_ P phase are connected, the Reset signal becomes high level and is transmitted to two D triggers with a certain delay, so that QA_ P and QB_ P are brought back to low level. When the feedback signal f div is in the lead, QB_ P jumps before QA_ P, becomes high level, and finally becomes low level under the action of the Reset signal.

Charge Pump
The charge pump circuit adopts differential input and differential output, which can well suppress common mode noise and prevent clock feedthrough. Its circuit structure is shown in Figure 12a. QA_ P, QA_ N, QB_ P, QB_ N are two pairs of phase difference information output by the frequency discriminator, M15, M16, M17, M18, M4, M5, M6 form a current mirror structure. M7, M8, M9, and M10 are four switch transistors, all of which use NMOS transistors, thereby eliminating the mismatch caused by the simultaneous use of PMOS transistors and NMOS transistors as switch transistors and reducing current mismatch. M11, M12, M3, and M14 are respectively connected between the switch transistors and the output terminal, which can isolate the clock feedthrough. M1, M2, and M3 are kept on under the control of the common mode voltage V CM and the common mode points VCP and VCN fed back by the filter. As shown in Figure 12b, is the variation of differential signal (VDS = VCP − VCN) with time, at about 0.42 µs , the PLL is locked and sampled to the best position of the data.

VCO
The VCO block includes the control voltage regulation, digital ring oscillator, and level conversion circuit, where I1 and I2 are digitally controlled current sources as shown in Figure 13. The CML to CMOS module converts the signal output by the ring oscillator into a clock signal whose swing and duty cycles meet the requirements through negative feedback. Figure 14 shows the delay elements of the ring oscillator used in this design. The fewer the number of ring vibration units (minimum two), the smaller the power consumption, delay, and area. The delay of each unit is the sum of the inverter and latch, and α is the ratio of the parameters of the latch and inverter, where α is 0.7 to ensure that the oscillation can be started [15]. M1 and M2 form the inverter. M5 and M6 form a latch. By adding M3 and M4, it can effectively increase the oscillation frequency of the ring vibration.
The load changed the center frequency of the ring oscillator (6 GHz to 11 GHz) and the VCO gain (4 GHz/V to 8 GHz/V).Under different I1, and I2 controls, the desired VCO output frequency range can be obtained in Figure 15. Figure 15a is for 9.5 GHz to 11 GHz clock, Figure 15b is for 7.5 GHz to 8.5 GHz clock, Figure 15c is for 6 GHz to 7 GHz clock.

PI
(1) Method PI is the key module in this design, and the linearity of its rotation will directly affect the overall performance of the receiver. In the design, the monotonicity and linearity of PI are the main concerns. Its linearity can be represented by a proportional function. (1) k PI is the gain of PI, and n is the control code. Equation (1) shows that, when n increases from 0 to N, the output phase increases from 0 to 2π. If k PI remains unchanged, the relationship between ϕ out and n is monotonic linear. However, the input and output of the actual PI are nonlinear sinusoidal Equation (2), and the phase and amplitude of the output are determined by A 1 , A 2 and ϕ d in Equation (3).
When the traditional PI takes ϕ d to take 90 • , 0 to 2π are divided into four quadrants. Each quadrant is interpolated separately, which leads to a decrease in the linearity of the phase interpolator. In this work, since the designed minimum phase-locked loop generates 8 clocks with equal phases, PI is interpolated between 45 • and 0 • , which greatly increases the linearity. When ϕ d takes 45 • , Equations (4) and (5) are obtained.
Due to the symmetry of the quadrants, the model is simulated for linearity in the first quadrant and compared with a traditional PI. Figure 16 shown the linearity analysis and comparison of PI. Figure 16a is the output phase with the thermometer code changing when ϕ d = 45 • . This model has an inflection point at ϕ d = 22.5 • , and the trend of the curve approximates the ideal linearity. The traditional phase interpolator (ϕ d = 90 • ) has an inflection point at 45 • in Figure 16b. Through the fitted curve as shown in Figure 16c, phase interpolator (ϕ d = 45 • ) is closer to the ideal value. Therefore, the method of eight equal phases is much better than the traditional PI. (2) Circuit Figure 17a shows the structure of a traditional equivalent current source PI. The size of the input transistors M1, M2, M3, and M4 are all the same. The loads R1 and R2 are equal to R. The input signal is two pairs of quadrature differential signals v IP and v QP , v I N and v QN , (0 • , 90 • , 180 • , and 270 • ). PI interpolates the phases of these two pairs of clocks, and a recovered clock with a phase between them can be obtained. The phase of the recovered clock can be adjusted by changing the tail currents of these two differential pairs.
When   I1 I2  I32  I1  I32  I2   M5 M6  M7 M8   I1 I2  I32  I1  I32  I2   v0  v4 v1  v5 v2  v6  As shown in the Figure 18, the circuit is the overall circuit design of PI. The bottom of each differential pair is composed of 32 controllable current sources, and R1 and R2 are used as load resistors. The phase control terminal is P<0:2>, which are derived from the Weight Encoder in the CDA module proposed above. BIT0-BIT31 are the phase thermometer codes that control the PI in the range of 45 • . The phase control code and thermometer code together divide 0-360 • into 256 points, so the minimum precision of PI is 1.4 • . The control bits SWITCH_ 0, SWITCH_ 45, SWITCH_ 90, SWITCH_ 135, SWITCH_ 180, SWITCH_ 225, SWITCH_ 270, and SWITCH_ 315 that control the current source current path are generated as shown in Figure 19. By selecting different branches, the phases of different quadrants can be output. For example, when the SWITCH_ 0 branch and the SWITCH_ 45 branch work, the phase interpolator works from 0 • to 45 • . During the working process of the phase interpolator, only 32 switches can be turned on at the same time, so that the total current of PI will not change in any state. Figure 20 is the phase of the circuit output under different control codes, which corresponds to Figure 16. When ϕ d = 90 • (Figure 20a), the output phase shows obvious inequality. However, the improved one (ϕ d = 45 • ) shows equality, as shown in Figure 20b. (2) DNL and INL The output linearity determines the extra jitter introduced by the phase interpolator and is an important technical indicator of PI, which is mainly measured by the Differential Non-Linearity (DNL) Integral Non-Linearity (INL). Equation (6) is the minimum resolution, Equation (7) is the DNL calculation method, Equation (8) is the INL calculation method [16].
Under standard process corner and temperature 27 • , the curve of DNL and INL controlled by N from 0 to 256 (0 • to 360 • ) is shown in Figure 21. As can be seen from Figure Table 3 shows the comparison for three types of PI. Compared with [17,18], INL is significantly improved, but the power consumption is increased both the accuracy and linearity have been greatly improved. Therefore, we use the improved PI in MINI-PLL for a excellent clock signal, while in the δ of each channel, the traditional PI is still used to keep the power consumption low, because the amount of phase rotation is determined by δ. In this way, the balance between high performance and low power consumption can be better optimized. The power consumption of PI in this work can be allocated to each channel. When the receiver clock is sampled to the best position, Figure 22 shows the two PI clocks that are stabilized and output in MINI-PLL. The traditional way is shown in Figure 22a and the jitter is 6 ps, while the improved PI jitter is only 2.4 ps in Figure 22b. It can be seen that the performance of the clock is improved due to the improvement in accuracy and linearity.

Results
The transceiver can be supported in a D2D interconnect with a channel length of 50 mm. Figure 23 shows the circuit layout of a transceiver system that works at 512 Gbps. The central part of the clock path data includes the MINI-PLL layout, all-digital CDA layout and PI layout designed in this research. There are 16 pairs of differential data paths on the upper and lower sides of the overall layout. The transceiver system uses 9ML(nine-metallayer) 28 nm CMOS technology, and the receiver occupies a 0.9 mm × 2 mm silicon area. The power consumption of the receiver is shown in Figure 23b. In the receiver, the analog part consumes 137 mW (61%) for CTLE, 1/4 sampler circuit, PI and clock. The power consumption of the digital part (21%) mainly includes 4:16DMUX and 16:64MUX, and the total power consumption is 45 mW. The phase-locked loop of the receiver uses a forward clock structure, and the power consumption is only 36 mW (8%). The transceivers share a bias circuit and its power consumption is 14mW, so the power consumption (3%) of the bias circuit in the receiver is taken as half. Others which is used for the clock is 16mW (7%).
At temperature 125 • C and tt corner, the power consumption of the receiver is measured to be 0.44pJ/b (the overall system power consumption is 1.56 pJ/b). Figure 24 shows that the VCO outputs eight-phase clocks of equal phase, each with a phase difference of 15.6 ps and a 2.4 ps jitter.      Table 4 shows the performance of the previous work compared to this one. There is a decrease in power consumption when adopting the same signal-DS [19,20]. This work is highlighted by the bidirectional bandwidth density of up to 284 Gb/s/mm compared to SES transmission [8,21]. As The BER is less than 10 −15 and the channel loss is 10dB, it can support standard packaging D2D interconnects of 512G.

Conclusions
A 16-lane 512 Gbps bandwidth receiver is designed using 28 nm CMOS process, which provides a solution to the problems of bandwidth density and power consumption in standard packing D2D interconnects. The overall link power consumption of this solution is 1.56pJ/b. The BER is less than 10 −15 under the true S-parameters with a channel loss of 10db@16GHz and a channel length of 50mm.

Conflicts of Interest:
The authors declare no conflict of interest.