A 16 Gbps, Full-Duplex Transceiver over Lossy On-Chip Interconnects in 28 nm CMOS Technology

Arash Ebrahimi Jarihani 1,2,*, Sahar Sarafi 1, Michael Koeberle 1, Johannes Sturm 1 and Andrea M. Tonello 2 1 Department of Engineering & IT, Carinthia University of Applied Sciences, 9524 Villach, Austria; s.sarafi@fh-kaernten.at (S.S.); m.koeberle@fh-kaernten.at (M.K.); j.sturm@fh-kaernten.at (J.S.) 2 Institute of Networked and Embedded Systems, Klagenfurt University, 9020 Klagenfurt, Austria; andrea.tonello@aau.at * Correspondence: arasheb@edu.aau.at


Introduction
With recent CMOS technologies, the device sizes are scaled down while the computational speed of the VLSI system is increased. However, the length of the global on-chip interconnects have remained almost the same. Moreover, with the reduction of the width of global interconnects due to scaling, the electrical resistance of the global interconnects got increased. In other words, the on-chip interconnects are RC dominated and very lossy, which decreases the overall bandwidth and the energy-efficiency of the circuitry. Therefore, on-chip interconnects are becoming a power, speed and reliability bottleneck in sub-micron technologies [1] and significantly affect the performance of a network on chip (NoC).
Additional complexity is introduced in on-chip signaling when simultaneous bidirectional, multipoint-to-multipoint and parallel communication is needed. Such interconnects are very common, for example, those used for the on-chip buses to connect different parts of a multi-core processor chip, system on a chip (SoC), and in global address or data-lines for memories. The resistance and delay of interconnects can be reduced by using wider cross-sectional dimensions, however, the area occupied by the interconnect and its capacitance increase, which leads to lower data-rates and higher energy consumption [2,3]. design to avoid significant self-interference components. In this paper, we propose an alternative FDT architecture that is capable of reaching 16 Gbps transmission using an on-chip interconnect.
In more detail, the major contributions reported in this paper are the following. A high-speed power-efficient transceiver is proposed for full-duplex signaling over on-chip interconnects. The link utilization is increased by a factor two, thus reducing the link area to half w.r.t. unidirectional transmission. To achieve FDT, a transistor is used as a hybrid device for separating the inbound and outbound signals and performing echo-cancellation with the help of a main and an auxiliary driver. Moreover, the hybrid device employs impedance matching at both ends of the interconnect, which eliminates the necessity of deploying a passive termination for achieving higher bandwidth and therefore higher data-rates.
Finally, the performance of the designed FDT is evaluated by post-layout simulation results using the TSMC 28 nm standard process over a global on-chip interconnect having a typical length of 5 mm [34].
The rest of the paper is organized as follows. Section 2 presents the proposed FDT architecture. Section 3 discusses the circuit design. Section 4 presents the post-layout simulation results followed by the conclusion in Section 5.

Proposed Full-Duplex Transceiver Architecture
Full-duplex transceivers use bidirectional signaling for doubling the data rate. This requires a combination of transmitting and receiving units at both ends of the interconnection. FDT implies the generation of self-interference due to the superposition of the transmitted and received signals. In order to extract the received data from the super-imposed signal, a hybrid structure or echo-cancellation circuitry is required to separate the inbound (V ib ) and the outbound (V ob ) signals from each other. Therefore, a new topology with detailed analysis on how to do echo-cancellation and perform simultaneous bidirectional signaling is proposed. A conceptual block diagram of the proposed FDT is shown in Figure 1. The proposed FDT consists of a main and auxiliary drivers, a hybrid transistor (M hybrid ), and a trans-impedance amplifier (TIA). The hybrid transistor plays an important role to separate the inbound signal from the superimposed signal at the interconnect end.
As known, by virtue of its transconductance (g m ), M hybrid converts changes in its source-gate voltage (V sg ) to a small-signal drain current [35]. Therefore, by generating the same and equal signals at the gate and the source of the M hybrid (V g =V ob ), no V sg variation comes from the transmitting signal.
Consequently, the drain current of the hybrid device is mainly a function of the received signal (V ib ) from the second transmitter on the other side of the interconnect, which is sensed on V sg . Then, the sensed inbound voltage signal (V ib ) is converted to the current (i rx ) by the g m of the hybrid device.
The gate signal (V g ) of the M hybrid is generated by the main driver. However, M hybrid operates as a source follower (SF) in the transmitting function of the FDT. Therefore, the generated signal by the main driver at the source of M hybrid is V ob,M = αV g , where α = A v(SF) < 1. To perform echo-cancellation, the source and the gate voltages of the M hybrid must be in-phase and have equal amplitude. For this purpose, an auxiliary driver is employed. It generates the signal V ob,A = βV ob as a function of its steering current and impedance seen by its output. Mathematically, V ob at both end of the interconnect is given by Ideally, α + β should be equal to 1, to perform V ob = V g and have zero echo. As a result, the main and the auxiliary drivers generate the superposition signal V ob altogether. The received signal may contain the unwanted outbound signal V echo as uncancelled echo, which is basically due to mismatches between the V ob and V g signals. Finally, the TIA is used to amplify the received current signal (i rx ) and convert it to voltage to be used by the comparator and data recovery circuitry.

Analysis and Circuit Design of the Full-Duplex Transceiver
A transistor-level implementation of the transceiver is shown in Figure 2. To reduce the complexity of the circuit, a single-ended schematic is illustrated. For analysis purposes, the FDT is divided into three parts: hybrid, main and auxiliary drivers.
The major part of the transmitting signal is generated by the main driver. This driver is designed in relation to the hybrid part to have similar process variations. The portion of the transmitted signal by the main driver (V ob,M ) can be calculated by the following equations: where V g,B is the gate voltage generated at node B by the main driver, i tx,M is the current steering from the main driver, and R int is the resistance of the interconnect. The auxiliary driver is a simple differential pair, which includes M 3 , M 4 and M 5 devices. Transistors M 4 and M 5 have two operation phases. Based on the pseudo-random binary sequence (PRBS) data that applies to these transistors, they can be ON or OFF. When the transistor is in ON state, it will be operated in a saturation region and leads to a higher output impedance for the driver. Moreover, M 3 is used to adjust the amplitude of the transmitting signal and minimize the uncancelled echo. The minor portion of the transmitting signal which is generated by the auxiliary driver can be expressed as The hybrid part includes M 1 and M 2 devices and a resistor. While transistor M 1 is the hybrid device, M 2 and resistor (R) are used to bias the hybrid device.
The characteristics of the global on-chip interconnects are different from off-chip transmission lines. For on-chip interconnections, impedance matching with the characteristic impedance (Z o ) of the interconnect is not necessary. However, the bandwidth of bidirectional signaling can be improved by deploying a low impedance termination at both ends of the interconnect [12]. Therefore, impedance matching is done by matching to the impedance seen from the source of the hybrid device (≈ 1/g m ) .
Matching the signals at nodes A and B for optimum echo-cancellation: The on-chip interconnects are RC dominated. The dominant low-frequency pole at node A can be written as The capacitance C int is approximately 1.05 PF (for the used 5 mm length on-chip interconnect [34]) and total junction capacitances for M 1 , M 2 , and M 5 are approximately 60 fF. Therefore, the influence of the junction capacitances can be neglected at low frequencies.
Likewise, there are two poles at the main driver in nodes C and B, which are related to the internal capacitances of the connected transistors to these nodes. These are high frequency poles ( f −3dB,C > 40 GHz). However, to have similar signals at nodes A and B and hence lower V echo , the capacitive effect of the interconnect should be represented in node B. This can be done by fulfilling the following condition: W p,A = W p,B . Therefore, a compensation capacitance (C c ) is added to node B to create a dominant low-frequency pole. Thus, the dominant pole at node B is: , where R B = r o1 ||R and C B = C c .
With the use of simulations, the magnitude and the phase responses for nodes A and B can be obtained. They are reported in Figure 3. Figure 3a,b show the effect of C c before and after adding it to the main driver output, respectively. The capacitance C c is added to have not only matched magnitude response but also matched phase response at nodes A and B, which could be a critical issue in time domain echo-cancellation. The compensation capacitance is optimized to have minimum magnitude and phase differences up to 5 GHz. Thus, the structure would be able to support 8 Gbps unidirectional data rate and in turn 16 Gbps full-duplex signaling. Figure 4 shows that after adding C c , the signal at node A follows the signal at node B with less error during transition (similar rise/fall times).  However, there are some spikes during the transition (switching) time which we refer to as uncancelled echo. These spikes appear in the mid of the received data from the other side due to the transmission line propagation delay. To minimize the peak magnitude of the spikes, the capacitance C is used to realize a filter at the output of the hybrid (TIA input). Figure 5 shows the received current and the effect of C c and C capacitances.
Circuit simulations have been performed for different values of compensation capacitance at the range of ±30%. They have shown that the variations in the vertical and horizontal eye openings are less than ±10% and ±5%, respectively. After sensing the inbound signal and conversion to current by the hybrid device, a TIA can be used to amplify the received current signal. It should be mentioned that the input impedance and bandwidth of the TIA will be changed due to the use of the capacitance C. Simulations have been implemented to verify the effect of using different values of C and the results are plotted in Figure 6. The peak value of TIA input impedance and its bandwidth is reduced. However, the variation of input impedance is negligible and the bandwidth ( f −3dB ) is higher than the required Nyquist frequency for the worst case. Therefore, it does not change the operation of the transceiver at frequencies below 6 GHz and thus the achievable data-rates.

Post-Layout Simulation Performance
In this section, the most important performance indicators for the proposed on-chip full-duplex transceiver (e.g., eye-opening, BER, power consumption, robustness against the PVT variations and corner simulations) are analyzed. For this purpose, circuit design for the proposed system has been realized. Figure 7 shows the layout screenshot of the FDT operating at 16 Gbps data rate, which occupies an area of 51 µm × 31 µm. In order to verify the performance of the proposed FDT, the post-layout simulations have been performed in 28 nm TSMC CMOS process. The functionality of the FDT is validated over an on-chip interconnect with a length of 5 mm [34]. The link has a width of 0.6 µm and a spacing of 1 µm with the adjacent shield layers.

Main Driver
Hybrid & Auxiliary Driver 51 µm 31 µm For this purpose, 2 7 − 1 random bit patterns have been generated using an on-chip PRBS generator [36] and sent to the drivers. The full-duplex operation of the transceiver is realized by transmitting the data streams from the FDT's at both ends of the interconnect. As a result, a maximum data-rate of 16 Gbps is achieved (8 Gbps from each side with a bit period equal to 125 ps). The differential voltage eye diagrams of the received data are shown in Figure 8 for half-duplex and full-duplex signaling modes.
The plotted eye (full-duplex) has vertical and horizontal openings of 168 mV and 87 ps, respectively. The peak-to-peak and RMS jitter of the eyes for both transceivers are approximately 38.6 and 16.4 ps, respectively. The computed power consumption for each transceiver at 0.9 V supply voltage is 8.827 mW. Therefore, the FDT has a power efficiency of 0.11 pJ/b/mm (0.16 pJ/b/mm including TIA).
In addition, the bit error rate (BER) performance has also been evaluated. Figure 9 shows the BER (in log scale) bathtub curves at 16 Gpbs full-duplex operation. The corresponding timing bathtub curve has 0.34 UI at BER of 10 −12 .  The unwanted echo signal can increase because of the mismatches between the main and the auxiliary drivers due to the process-voltage-temperature (PVT) variations. Therefore, to assess the effect, PVT variations are introduced to the FDT at 2×8 Gbps FD operation. The variation in both the vertical and the horizontal eye-opening is observed by changing the supply voltage from 0.8 to 1 V. As can be seen from Figure 10, the vertical eye opening decreases by decreasing the supply voltage and vice versa. By lowering the supply voltage to 0.8 V, the horizontal eye opening decreases approximately 11% only. Moreover, the performance of the FDT is validated while the temperature variations are applied. By increasing the temperature from room temperature to 100 • C and lowering to −20 • C, the eye-opening decreases by 22% and increases by 9%, respectively. However, the horizontal opening remains almost constant for the whole range from −20 • C to 100 • C.
The variations of vertical and horizontal eye openings for different process corners are plotted in Figure 11. The worst-case performances occur in slow-slow (SS) and fast-fast (FF) corners. Nevertheless, the eye in these corners is open enough allowing reliable data detection. It should be mentioned that the mismatch between the voltage swings at nodes A and B due to the process variation at SS and FF corners can be compensated by tuning the current of the auxiliary driver. Therefore, the echo voltage is minimized and the eye-opening is maximized. Finally, a synthetic comparison of the proposed system with the on-chip full-duplex transceivers reported in the literature is summarized in the Table 1. Some details about these solutions have been given in the introduction. The comparison is made in terms of technology size, supply voltage, interconnect length, offered data rate, energy efficiency, and area occupied. The results show that the proposed solution can offer the highest data rate of 16 Gbps among the other designs with competitive power consumption. The solution described in [32] has better energy efficiency. However, the offered data rate is lower and equal to 10 Gbps. Looking at the design reported in [29] and its achieved data rate, it is expected that this will be affected negatively, if longer on-chip interconnects are used. As an overall result, the proposed FDT solution has a much better overall performance in comparison with the state-of-the-art.

Conclusions
In this paper, a high-speed full-duplex transceiver for simultaneously transmitting and receiving data across an on-chip interconnect is presented. A hybrid device is utilized to separate the inbound and the outbound signals from each other and perform echo-cancellation by the combination of the main and the auxiliary drivers. Moreover, the hybrid transistor is used as an active low impedance termination which helps to improve the overall bandwidth of the transceiver. The proposed FDT architecture has been designed in 28 nm, 0.9 V CMOS process. The transceiver has a power efficiency of 0.16 pJ/b/mm for a data rate of 16 Gbps while performing simultaneous bidirectional transmission. The transceiver performance has been assessed through post-layout simulation as a function of process parameters, voltage and temperature variations. It is robust to the introduced PVT variations in the echo cancellation. The post-layout simulations have been carried out over a 5 mm length on-chip interconnect to evaluate the performance of the transceiver.