Supply-Scalable High-Speed I / O Interfaces

: Improving the energy e ﬃ ciency of computer communication is becoming more and more important as the world is creating a massive amount of data, while the interface has been a bottleneck due to the ﬁnite bandwidth of electrical wires. Introducing supply voltage scalability is expected to signiﬁcantly improve the energy e ﬃ ciency of communication input / output (I / O) interfaces as well as make the I / Os e ﬃ ciently adapt to actual utilization. However, there are many challenges to be addressed to facilitate the realization of a true sense of supply-scalable I / O. This paper reviews the motivations, background theories, design considerations, and challenges of scalable I / Os from the viewpoint of computer architecture down to the transistor level. Thereafter, a survey of the state-of-the-arts fabricated designs is discussed.


Introduction
Nowadays, there are huge demands on a smarter world for better human convenience and happiness (i.e., manufacturing, smart city, autonomous vehicle, security . . . ). In order to realize those, it is inevitable that we have to create, replicate, and process tremendous amount of data [1]. For example, ref.
[2] forecasts that the amount of data will increase by more than 3× in five years. Handling those explosive data is a large burden on computer communications in the current computer architecture [3]. As a result, high-speed input/output (I/O) standards used for computer communications are evolving very rapidly, and moreover, new I/O standards are also being introduced to address various needs. In accordance with those trends, multi-standard I/O transceivers (transmitter and receiver) are getting a huge attention from industry to provide considerable flexibilities to IC products [4][5][6][7][8][9][10][11]. To support multiple standards with a single transceiver design, it needs to be flexible for a variety of specifications, especially for a wide range of operating data rates. In addition, even for the cases of dealing with a single I/O standard, many standards require backward compatibility to their own legacy generations; for example, PCI Express, USB, serial ATA, etc.; thus, the wide-range operation is also very important [12,13].
However, implementing such wide-range capability introduces a large burden on the I/O design. Before deep diving into what the burden is, we briefly review the overall I/O transceiver architecture and how it works. As shown in the overall block diagram in Figure 1, the I/O transceiver is composed of three big parts: clock generation, a transmitter (TX), and a receiver (RX) [14]. In the clock generation, a phase-locked loop (PLL) is generally used to generate a high-frequency clock for the high-speed I/O from a low-frequency reference clock. These days, a half-rate clock (e.g., 14 GHz for 28 Gb/s NRZ datastream, also referred to as a double-data rate (DDR) clock) and a quarter-rate clock (i.e., 7 GHz for 28 Gb/s NRZ, also referred to as a quarter-data rate (QDR) clock) are generally used. The clock generation circuitry can be shared across multiple transmitter and receiver lines to amortize the area and power consumption [15][16][17]. The high-frequency clock is distributed to the individual transmitters and receivers with a clock buffer chain. Depending on the macro floorplan, the distribution distance can be a few hundred µm or even longer; a long chain of buffers needs to be used so that the distribution circuit consumes a significant amount of power. In addition, the high-speed clock may experience duty-cycle distortion and phase skew across the chain because of inherent device mismatch and duty-cycle error amplification [18,19]. As a result, a duty-cycle correction (DCC) circuit or a quadrature error correction (QEC) circuit is frequently used at the transmit side to correct such distortions for better eye opening at the transmitter output. Based on the timing information from the corrected clock, parallel input data are serialized into a high-speed bitstream (serializer). The serialized data stream is transmitted through the transmission line driven by the TX output driver, whose output impedance is supposed to match with the characteristic impedance of the transmission line for signal integrity. To compensate inter-symbol interference (ISI) due to the channel loss at high frequency, a feed-forward equalizer (FFE), which combines multiple adjacent bits to cancel out the ISI effect, is frequently implemented at the transmit side. To adapt with various channel conditions and receiver sensitivity, the output voltage swing is configurable. On the receiver side, the received data is usually highly distorted by the ISI due to the limited channel bandwidth. Therefore, a continuous time linear equalizer that boosts high-frequency gain is generally placed at the front-end of the receiver. RX slicers are used to decide whether the received analog signal is '1' or '0' by sampling the signal at the timing when the signal-to-noise ratio (SNR) is maximized. A variable-gain amplifier (VGA) can be preceded by the slices to amplify the input signal for better sensitivity. Since it is important to sample the received signal when the SNR is maximized, finding the optimum sampling timing is one of the utmost tasks of the receiver. The clock and data recovery (CDR) circuit collects such timing information from edge slicers (bang-bang CDR) or error slicers (baud-rate CDR), and based on the collected information, it adjusts the sampling timing by using a phase interpolator (PI) [20]. A multi-phase generation circuit such as a delay-locked loop (DLL) or an injection-locked oscillator (ILO) is widely used to provide a high enough number of phases for the PI. On the other hand, a decision-feedback equalizer (DFE) is used to fully compensate the residual ISI. The basic concept of the DFE is to utilize the previously received data stream to help the decision of the currently received bit to compensate the ISI. Unlike the FFE or continuous-time linear equalizer (CTLE), the DFE does not degrade the SNR, so nowadays, it has become a very essential building block of the receiver. Finally, the recovered data is de-serialized to the parallel stream.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 21 amortize the area and power consumption [15][16][17]. The high-frequency clock is distributed to the individual transmitters and receivers with a clock buffer chain. Depending on the macro floorplan, the distribution distance can be a few hundred μm or even longer; a long chain of buffers needs to be used so that the distribution circuit consumes a significant amount of power. In addition, the high-speed clock may experience duty-cycle distortion and phase skew across the chain because of inherent device mismatch and duty-cycle error amplification [18,19]. As a result, a duty-cycle correction (DCC) circuit or a quadrature error correction (QEC) circuit is frequently used at the transmit side to correct such distortions for better eye opening at the transmitter output. Based on the timing information from the corrected clock, parallel input data are serialized into a high-speed bitstream (serializer). The serialized data stream is transmitted through the transmission line driven by the TX output driver, whose output impedance is supposed to match with the characteristic impedance of the transmission line for signal integrity. To compensate inter-symbol interference (ISI) due to the channel loss at high frequency, a feed-forward equalizer (FFE), which combines multiple adjacent bits to cancel out the ISI effect, is frequently implemented at the transmit side. To adapt with various channel conditions and receiver sensitivity, the output voltage swing is configurable. On the receiver side, the received data is usually highly distorted by the ISI due to the limited channel bandwidth. Therefore, a continuous time linear equalizer that boosts high-frequency gain is generally placed at the front-end of the receiver. RX slicers are used to decide whether the received analog signal is '1' or '0' by sampling the signal at the timing when the signal-to-noise ratio (SNR) is maximized. A variable-gain amplifier (VGA) can be preceded by the slices to amplify the input signal for better sensitivity. Since it is important to sample the received signal when the SNR is maximized, finding the optimum sampling timing is one of the utmost tasks of the receiver. The clock and data recovery (CDR) circuit collects such timing information from edge slicers (bang-bang CDR) or error slicers (baud-rate CDR), and based on the collected information, it adjusts the sampling timing by using a phase interpolator (PI) [20]. A multi-phase generation circuit such as a delay-locked loop (DLL) or an injection-locked oscillator (ILO) is widely used to provide a high enough number of phases for the PI. On the other hand, a decision-feedback equalizer (DFE) is used to fully compensate the residual ISI. The basic concept of the DFE is to utilize the previously received data stream to help the decision of the currently received bit to compensate the ISI. Unlike the FFE or continuous-time linear equalizer (CTLE), the DFE does not degrade the SNR, so nowadays, it has become a very essential building block of the receiver. Finally, the recovered data is de-serialized to the parallel stream. As mentioned before, introducing wide-range operation results in significant overhead on design complexity and an energy/area efficiency penalty as well. For example, because it is not easy to cover the entire frequency range with a single PLL or a single voltage-controlled oscillator (VCO), multiple PLLs (or VCOs) are frequently implemented together and are selected with respect to the operating frequency. In addition, most of the building blocks needs to be designed to work properly at the highest rate. It leads to a significant impact on the energy efficiency at a lower data rate. For better understanding, Figure 2 shows a simple example of a CMOS inverter. The delay (τ d ) and rise/fall time (approximately 2τ d ) of a CMOS inverter is proportional to the fan-out of the inverter, which is a ratio of the gate output capacitance over the gate input capacitance. To retain a full rail-to-rail CMOS swing and to reduce jitter generation, the rise/fall time must be fast enough for given operating frequency. As a rule of thumb, τ d should be no larger than one-fourth of bit period (T bit ) [21,22]. It implies that the fan-out has to be low enough to sustain a proper operation at the highest frequency. On the other hand, the dynamic switching power consumption of a CMOS inverter chain whose final output load is C L is given as where FO, V DD , and f clk represent the fan-out of the chain, the supply voltage, and the switching frequency, respectively [23]. That means that if the fan-out is set for the optimum operation at the highest frequency, then in fact, it is overkill and causes unnecessary power consumption at a lower frequency. In addition, there are also overheads to make circuits work at a low frequency, which makes the efficiency at high frequency worse (i.e., capacitors). These observations can be proven from the survey of wide-range and supply-scalable transceivers shown in Figure 3, where the energy efficiencies of transmitters, receivers, and transceivers in [15,16,22,[24][25][26][27][28][29][30][31][32][33][34][35] are marked in the scatter plot with respect to how wide the operating range is (highest data rate/lowest data rate). Note that the energy efficiencies depicted in the Figure 3 are measured at the highest data rate of each design. It is observed that the energy efficiency becomes worse as the range widens, which proves the overhead of the wide range.
To resolve such efficiency loss due to the wide-range operation, introducing supply scalability, whose impact has been already well-demonstrated in microprocessors and Internet of Things (IoT) applications [36], to the I/O transceiver has been proposed. The basic concept is to lower the supply voltage when the I/O works at low speed where it does not require full voltage due to the relaxed bandwidth, in order to reduce the power consumption, which is similar to the concept of dynamic voltage and frequency scaling (DVFS) of microprocessors [24,[37][38][39][40]. More details on the supply-scalable I/O interfaces will be discussed in the following section. As mentioned before, introducing wide-range operation results in significant overhead on design complexity and an energy/area efficiency penalty as well. For example, because it is not easy to cover the entire frequency range with a single PLL or a single voltage-controlled oscillator (VCO), multiple PLLs (or VCOs) are frequently implemented together and are selected with respect to the operating frequency. In addition, most of the building blocks needs to be designed to work properly at the highest rate. It leads to a significant impact on the energy efficiency at a lower data rate. For better understanding, Figure 2 shows a simple example of a CMOS inverter. The delay (τd) and rise/fall time (approximately 2τd) of a CMOS inverter is proportional to the fan-out of the inverter, which is a ratio of the gate output capacitance over the gate input capacitance. To retain a full rail-to-rail CMOS swing and to reduce jitter generation, the rise/fall time must be fast enough for given operating frequency. As a rule of thumb, τd should be no larger than one-fourth of bit period (Tbit) [21,22]. It implies that the fan-out has to be low enough to sustain a proper operation at the highest frequency. On the other hand, the dynamic switching power consumption of a CMOS inverter chain whose final output load is CL is given as where FO, VDD, and fclk represent the fan-out of the chain, the supply voltage, and the switching frequency, respectively [23]. That means that if the fan-out is set for the optimum operation at the highest frequency, then in fact, it is overkill and causes unnecessary power consumption at a lower frequency. In addition, there are also overheads to make circuits work at a low frequency, which makes the efficiency at high frequency worse (i.e., capacitors). These observations can be proven from the survey of wide-range and supply-scalable transceivers shown in Figure 3, where the energy efficiencies of transmitters, receivers, and transceivers in [15,16,22,[24][25][26][27][28][29][30][31][32][33][34][35] are marked in the scatter plot with respect to how wide the operating range is (highest data rate/lowest data rate). Note that the energy efficiencies depicted in the Figure 3 are measured at the highest data rate of each design. It is observed that the energy efficiency becomes worse as the range widens, which proves the overhead of the wide range.
To resolve such efficiency loss due to the wide-range operation, introducing supply scalability, whose impact has been already well-demonstrated in microprocessors and Internet of Things (IoT) applications [36], to the I/O transceiver has been proposed. The basic concept is to lower the supply voltage when the I/O works at low speed where it does not require full voltage due to the relaxed bandwidth, in order to reduce the power consumption, which is similar to the concept of dynamic voltage and frequency scaling (DVFS) of microprocessors [24,[37][38][39][40]. More details on the supply-scalable I/O interfaces will be discussed in the following section.

Basic Concept of Supply-Scalable I/O
In addition to the demands on the multi-standard or the backward-compatible I/Os discussed in the previous section, the fact that the link utilization of computer communications is rarely maximum is another important motivation of supply-scalable I/O [37][38][39][40]. For example, in [39], Google disclosed that the servers' utilization is mostly between 10% and 50% of their maximum utilization levels, where they exhibit poor efficiency. For example, ref. [40] shows that if a system (servers and network) is 15% utilized while the servers are fully energy-proportional, the network will then consume nearly 50% of the overall power, which is a huge waste. As a result, if the communication network is able to adapt the power consumption by scaling the supply voltage according to the communication bandwidth required by the utilization, a significant energy saving is expected. In addition, recent FinFET technology offers a better performance at a lower supply voltage compared to conventional planar MOSFET, which amplifies the benefits of the supply scaling [41]. However, conventional I/O designs typically assume fixed-voltage operation. Even for some wide-range designs, the power consumption is usually dominated by the supply voltage post-fabrication; frequency scaling alone rarely saves much power but instead hurts performance [42].
To see more deeply how the supply scaling improves the energy efficiency, we review the power consumption of I/O interface circuits, which can be divided into three groups [43,44]. The first one is CMOS dynamic switching power, which is proportional to CV 2 f. In the block diagram of Figure 1, the serializer, deserializer, digital circuits, slicers, and some of the clocking circuits fall into this category. The second thing is the power consumption from the static current, which includes analog circuits relying on current biases-for example, amplifiers and current-mode logic (CML) buffers. The last one is the signaling power, which includes equalization circuits and TX drivers as well as the power dissipated at 50-Ω terminations. Those three types of power consumption exhibit very different aspects when the operating speed changes, which is illustrated on the left side of Figure 4. As we can easily expect from CV 2 f, the dynamic switching power scales linearly as the data rate scales. On the other hand, the others do not scale with the data rate. In fact, at high frequency, the signaling power increases non-linearly as the data rate increases, because extensive equalization

Basic Concept of Supply-Scalable I/O
In addition to the demands on the multi-standard or the backward-compatible I/Os discussed in the previous section, the fact that the link utilization of computer communications is rarely maximum is another important motivation of supply-scalable I/O [37][38][39][40]. For example, in [39], Google disclosed that the servers' utilization is mostly between 10% and 50% of their maximum utilization levels, where they exhibit poor efficiency. For example, ref. [40] shows that if a system (servers and network) is 15% utilized while the servers are fully energy-proportional, the network will then consume nearly 50% of the overall power, which is a huge waste. As a result, if the communication network is able to adapt the power consumption by scaling the supply voltage according to the communication bandwidth required by the utilization, a significant energy saving is expected. In addition, recent FinFET technology offers a better performance at a lower supply voltage compared to conventional planar MOSFET, which amplifies the benefits of the supply scaling [41]. However, conventional I/O designs typically assume fixed-voltage operation. Even for some wide-range designs, the power consumption is usually dominated by the supply voltage post-fabrication; frequency scaling alone rarely saves much power but instead hurts performance [42].
To see more deeply how the supply scaling improves the energy efficiency, we review the power consumption of I/O interface circuits, which can be divided into three groups [43,44]. The first one is CMOS dynamic switching power, which is proportional to CV 2 f. In the block diagram of Figure 1, the serializer, deserializer, digital circuits, slicers, and some of the clocking circuits fall into this category. The second thing is the power consumption from the static current, which includes analog circuits relying on current biases-for example, amplifiers and current-mode logic (CML) buffers. The last one is the signaling power, which includes equalization circuits and TX drivers as well as the power dissipated at 50-Ω terminations. Those three types of power consumption exhibit very different aspects when the operating speed changes, which is illustrated on the left side of Figure 4. As we can easily expect from CV 2 f, the dynamic switching power scales linearly as the data rate scales. On the other hand, the others do not scale with the data rate. In fact, at high frequency, the signaling power increases non-linearly as the data rate increases, because extensive equalization and higher transmit swing are required to overcome the channel loss at such high frequency [25,45]. As a result, as shown in the leftmost plot of Figure 4, the total power consumption of the I/O transceiver increases linearly as the data rate increases but becomes non-linear at the high-frequency region. It also consumes considerable static power even at very low speed. The corresponding efficiency curve is illustrated in the inset plot, where we find that the efficiency is severely degraded at a low speed region. For example, ref. [40] shows that a link configured for 2.5 Gb/s consumes as much as 42% of the power for 40 Gb/s, which is ideally expected to be only 6.25%.
If we reduce the supply voltage at a lower speed, then how much we can reduce it? Basically, the supply voltage can be reduced until the circuits marginally work at the given speed. The free-running frequency of CMOS ring oscillators, which is usually used to evaluate the speed of CMOS devices, is a great indicator to see the relation between the speed and the voltage. The free-running frequency of a CMOS ring oscillator is expressed as where N, β, V TH , and C represent the number of stages, transistor coefficient, threshold voltage of CMOS devices, and nodal capacitance of the ring, respectively [19]. α is another coefficient that characterizes a transistor performance, which equals two in the conventional long-channel devices but becomes closer to one for recent short-channel devices.
For both cases, we can see a linear-like relation as long as V DD is higher than V TH , so we can assume that the supply voltage can be linearly scaled with the data rate [24]. The resulting power consumption and efficiency when we apply the linear scaling of supply voltage are depicted in the right-side plot and inset in Figure 4. The power consumption by the static current scales linearly with the frequency (P = VI), while the CMOS dynamic power scales cubically (P = CV 2 f) [44]. On the other hand, it is assumed that the signaling power cannot scale with the supply voltage due to the SNR constraint. The resulting efficiency curve is much flatter than that without supply scaling, implying that the power consumption is almost linearly proportional to the data rate, which is the ideal case assumed in [40]. It also shows that the optimum efficiency may locate at an intermediate speed because of the signaling power, which is practically observed in many state-of-the-art engineering samples [15,27,32]. In addition, the signaling power overhead at a higher data rate can be confirmed from the survey in Figure 6, where the energy efficiencies at the highest rate of various scalable I/O designs are scattered with respect to the highest data rate. The efficiency becomes worse at a higher data rate above a certain data rate (approximately 8 Gb/s), implying the energy overhead to drive such high-speed signaling, which is originated from heavy equalization circuits and a larger signal swing [44].
Electronics 2020, 9, x FOR PEER REVIEW 5 of 21 and higher transmit swing are required to overcome the channel loss at such high frequency [25,45]. As a result, as shown in the leftmost plot of Figure 4, the total power consumption of the I/O transceiver increases linearly as the data rate increases but becomes non-linear at the high-frequency region. It also consumes considerable static power even at very low speed. The corresponding efficiency curve is illustrated in the inset plot, where we find that the efficiency is severely degraded at a low speed region. For example, ref. [40] shows that a link configured for 2.5 Gb/s consumes as much as 42% of the power for 40 Gb/s, which is ideally expected to be only 6.25%. If we reduce the supply voltage at a lower speed, then how much we can reduce it? Basically, the supply voltage can be reduced until the circuits marginally work at the given speed. The free-running frequency of CMOS ring oscillators, which is usually used to evaluate the speed of CMOS devices, is a great indicator to see the relation between the speed and the voltage. The free-running frequency of a CMOS ring oscillator is expressed as where N, β, VTH, and C represent the number of stages, transistor coefficient, threshold voltage of CMOS devices, and nodal capacitance of the ring, respectively [19]. α is another coefficient that characterizes a transistor performance, which equals two in the conventional long-channel devices but becomes closer to one for recent short-channel devices.
(2) is illustrated in Figure 5 for α = 2 and 1.5. For both cases, we can see a linear-like relation as long as VDD is higher than VTH, so we can assume that the supply voltage can be linearly scaled with the data rate [24]. The resulting power consumption and efficiency when we apply the linear scaling of supply voltage are depicted in the right-side plot and inset in Figure 4. The power consumption by the static current scales linearly with the frequency (P = VI), while the CMOS dynamic power scales cubically (P = CV 2 f) [44]. On the other hand, it is assumed that the signaling power cannot scale with the supply voltage due to the SNR constraint. The resulting efficiency curve is much flatter than that without supply scaling, implying that the power consumption is almost linearly proportional to the data rate, which is the ideal case assumed in [40]. It also shows that the optimum efficiency may locate at an intermediate speed because of the signaling power, which is practically observed in many state-of-the-art engineering samples [15,27,32]. In addition, the signaling power overhead at a higher data rate can be confirmed from the survey in Figure 6, where the energy efficiencies at the highest rate of various scalable I/O designs are scattered with respect to the highest data rate. The efficiency becomes worse at a higher data rate above a certain data rate (approximately 8 Gb/s), implying the energy overhead to drive such high-speed signaling, which is originated from heavy equalization circuits and a larger signal swing [44].     Figure 7 shows two major circuit topologies in CMOS technology, CMOS and CML circuits, which are most frequently used in modern analog and mixed-signal IC designs. Although current-mode logic (CML) circuits have been a majority for base circuit topology for high-speed I/O interfaces owing to their high-speed capability [46], using CMOS circuits as much as possible is preferred for supply-scalable I/O in order to maximize the frequency range and the benefit of the supply scaling [44,47]. It is mainly because the power consumption of CML circuits follows the blue line in Figure 4 (static current consumption), whereas the CMOS exhibits cubic power scaling. In addition, besides the power scaling, the CMOS circuit dissipates less power as long as the fan-out is no less than 2 [23], and it also provides a higher swing than the CML. On the other hand, the CML circuit has a much better controllability of its speed compared to the CMOS counterpart whose speed is highly deterministic to the given process technology and supply voltage. The time constant of a CML circuit is a product of the load resistance (R) and the load capacitance (CL), so tweaking R is   Figure 7 shows two major circuit topologies in CMOS technology, CMOS and CML circuits, which are most frequently used in modern analog and mixed-signal IC designs. Although current-mode logic (CML) circuits have been a majority for base circuit topology for high-speed I/O interfaces owing to their high-speed capability [46], using CMOS circuits as much as possible is preferred for supply-scalable I/O in order to maximize the frequency range and the benefit of the supply scaling [44,47]. It is mainly because the power consumption of CML circuits follows the blue line in Figure 4 (static current consumption), whereas the CMOS exhibits cubic power scaling. In addition, besides the power scaling, the CMOS circuit dissipates less power as long as the fan-out is no less than 2 [23], and it also provides a higher swing than the CML. On the other hand, the CML circuit has a much better controllability of its speed compared to the CMOS counterpart whose speed is highly deterministic to the given process technology and supply voltage. The time constant of a CML circuit is a product of the load resistance (R) and the load capacitance (CL), so tweaking R is  Figure 7 shows two major circuit topologies in CMOS technology, CMOS and CML circuits, which are most frequently used in modern analog and mixed-signal IC designs. Although current-mode logic (CML) circuits have been a majority for base circuit topology for high-speed I/O interfaces owing to their high-speed capability [46], using CMOS circuits as much as possible is preferred for supply-scalable I/O in order to maximize the frequency range and the benefit of the supply scaling [44,47]. It is mainly because the power consumption of CML circuits follows the blue line in Figure 4 (static current consumption), whereas the CMOS exhibits cubic power scaling. In addition, besides the power scaling, the CMOS circuit dissipates less power as long as the fan-out is no less than 2 [23], and it also provides a higher swing than the CML. On the other hand, the CML circuit has a much better controllability of its speed compared to the CMOS counterpart whose speed is highly deterministic to the given process technology and supply voltage. The time constant of a CML circuit is a product of the load resistance (R) and the load capacitance (C L ), so tweaking R is a powerful knob of controlling the circuit speed in the design phase. The CMOS circuit can be accelerated by reducing the fan-out; however, a too-small fanout (i.e., less than 2) is not feasible in a practical circuit implementation and also results in a non-linear increase of power consumption (see (1)). As a result, the CML circuit exhibits a better capability of high-speed operation, which is a dominant issue to achieve the maximum speed of the scalable I/O. It has been one of the main reasons to make high-speed interfaces highly rely on CML circuits, although their scalability is not as good as CMOS circuits. In order to resolve this issue, ref. [25] proposed adjusting the current bias of CML circuits in accordance with the supply scaling. For example, at the minimum data rate and supply voltage (0.65 V) condition, the current bias is reduced to half of the nominal value, whereas it is raised to 1.5× when the data rate and supply voltage (1.05 V) are maximized. Instead of using a fixed resistance as CML load resistance, a symmetric load whose resistance automatically adapts with the bias current is employed in [25] to scale the circuit bandwidth while maintaining the voltage swing. As an alternative approach, a CMOS inverter with resistive feedback can be used to hit high bandwidth for fully utilizing CMOS circuits even above the CMOS limit, because the resistive feedback extends the bandwidth of the CMOS circuit at the cost of increased power consumption [48].

Base Circuit Topology
Electronics 2020, 9, x FOR PEER REVIEW 7 of 21 a powerful knob of controlling the circuit speed in the design phase. The CMOS circuit can be accelerated by reducing the fan-out; however, a too-small fanout (i.e., less than 2) is not feasible in a practical circuit implementation and also results in a non-linear increase of power consumption (see (1)). As a result, the CML circuit exhibits a better capability of high-speed operation, which is a dominant issue to achieve the maximum speed of the scalable I/O. It has been one of the main reasons to make high-speed interfaces highly rely on CML circuits, although their scalability is not as good as CMOS circuits. In order to resolve this issue, ref. [25] proposed adjusting the current bias of CML circuits in accordance with the supply scaling. For example, at the minimum data rate and supply voltage (0.65 V) condition, the current bias is reduced to half of the nominal value, whereas it is raised to 1.5× when the data rate and supply voltage (1.05 V) are maximized. Instead of using a fixed resistance as CML load resistance, a symmetric load whose resistance automatically adapts with the bias current is employed in [25] to scale the circuit bandwidth while maintaining the voltage swing. As an alternative approach, a CMOS inverter with resistive feedback can be used to hit high bandwidth for fully utilizing CMOS circuits even above the CMOS limit, because the resistive feedback extends the bandwidth of the CMOS circuit at the cost of increased power consumption [48].

On-Chip Supply Control
From the surveys shown in Figures 3 and 6, we observe there have been many works relying on off-chip supply control rather than on-chip control, and generally those works exhibit a better efficiency than those with on-chip control. It is mainly because they do not include any power overhead of supply control (either of on-chip and off-chip), so it is important to understand what the overhead is. In addition, ultimately, the supply control circuit must be integrated with the core circuits. The supply control scheme can be divided into two types: a switching regulator (DC-DC converter) and linear regulator (specifically, a low-dropout (LDO) regulator), which are depicted in Figure 8. Generally, the switching regulators exhibit higher efficiency compared to linear regulators, because there is a non-negligible voltage dropout in the linear regulators [24]. However, because of their switching behavior, the switching regulators have a periodic switching ripple on the output voltage whereas the linear regulators offer power supply noise rejection (PSR). Since most of the building blocks of high-speed I/O are very sensitive to the supply noise [49,50], the linear regulators can be more suitable for high-performance applications [12,31]. For example, [31] reveals that the link bit-error-rate (BER) degrades when a switching regulator is used. In addition, the supply sensitivity increases at a lower supply voltage, so it is more critical in supply-scalable I/O compared

On-Chip Supply Control
From the surveys shown in Figures 3 and 6, we observe there have been many works relying on off-chip supply control rather than on-chip control, and generally those works exhibit a better efficiency than those with on-chip control. It is mainly because they do not include any power overhead of supply control (either of on-chip and off-chip), so it is important to understand what the overhead is. In addition, ultimately, the supply control circuit must be integrated with the core circuits. The supply control scheme can be divided into two types: a switching regulator (DC-DC converter) and linear regulator (specifically, a low-dropout (LDO) regulator), which are depicted in Figure 8. Generally, the switching regulators exhibit higher efficiency compared to linear regulators, because there is a non-negligible voltage dropout in the linear regulators [24]. However, because of their switching behavior, the switching regulators have a periodic switching ripple on the output voltage whereas the linear regulators offer power supply noise rejection (PSR). Since most of the building blocks of high-speed I/O are very sensitive to the supply noise [49,50], the linear regulators can be more suitable for high-performance applications [12,31]. For example, [31] reveals that the link bit-error-rate (BER) degrades when a switching regulator is used. In addition, the supply sensitivity increases at a lower supply voltage, so it is more critical in supply-scalable I/O compared to fixed-supply I/O. For example, the transition delay of a CMOS inverter is approximately expressed as to where τ d is the transition delay of the CMOS inverter [19]. By differentiating (3) with V DD , we can obtain the supply sensitivity of the transition delay, which represents the jitter induced by supply noise of ∆V DD as where α equals two in long-channel devices but becomes smaller in short-channel devices, as mentioned in the Section 2. (4) is illustrated in Figure 9 for α = 2.0 and 1.5, where we find the supply sensitivity dramatically increases as the supply voltage scales [19,51]. As a result, for the case of using a switching regulator, the switching frequency and CDR bandwidth should be carefully chosen to minimize the impact of supply ripple on the link BER [22,33]. Figure 8b shows one of the design considerations of a linear regulator regarding PSR. The linear regulator is based on a negative feedback loop with two poles, one is at the output of the amplifier (ω 1 , gate of pass transistor) and the other is at the output (ω 2 , regulated supply voltage). Depending on the pole locations, the regulator exhibits different PSR characteristics. For example, if ω 1 is the dominant pole (ω 2 > ω 1 ), the PSR has a peaking, whereas the opposite case exhibits a plain low-pass filter characteristic [19,52]. However, making ω 2 dominant requires a huge capacitance because of low output resistance at the V DD node due to the circuit load. It costs a huge silicon area and frequently needs an off-chip capacitor, and it also severely constraints the supply transition time. To summarize, there are lots of design trade-offs (e.g., PSR versus efficiency, PSR versus silicon area), which need to be carefully examined to find the best topology and design parameters for a given application.
where τd is the transition delay of the CMOS inverter [19]. By differentiating (3) with VDD, we can obtain the supply sensitivity of the transition delay, which represents the jitter induced by supply noise of ∆VDD as , where α equals two in long-channel devices but becomes smaller in short-channel devices, as mentioned in the Section 2. (4) is illustrated in Figure 9 for α = 2.0 and 1.5, where we find the supply sensitivity dramatically increases as the supply voltage scales [19,51]. As a result, for the case of using a switching regulator, the switching frequency and CDR bandwidth should be carefully chosen to minimize the impact of supply ripple on the link BER [22,33]. Figure 8b shows one of the design considerations of a linear regulator regarding PSR. The linear regulator is based on a negative feedback loop with two poles, one is at the output of the amplifier (ω1, gate of pass transistor) and the other is at the output (ω2, regulated supply voltage). Depending on the pole locations, the regulator exhibits different PSR characteristics. For example, if ω1 is the dominant pole (ω2 > ω1), the PSR has a peaking, whereas the opposite case exhibits a plain low-pass filter characteristic [19,52]. However, making ω2 dominant requires a huge capacitance because of low output resistance at the VDD node due to the circuit load. It costs a huge silicon area and frequently needs an off-chip capacitor, and it also severely constraints the supply transition time. To summarize, there are lots of design trade-offs (e.g., PSR versus efficiency, PSR versus silicon area), which need to be carefully examined to find the best topology and design parameters for a given application.

Clock Generation
where τd is the transition delay of the CMOS inverter [19]. By differentiating (3) with VDD, we can obtain the supply sensitivity of the transition delay, which represents the jitter induced by supply noise of ∆VDD as , where α equals two in long-channel devices but becomes smaller in short-channel devices, as mentioned in the Section 2. (4) is illustrated in Figure 9 for α = 2.0 and 1.5, where we find the supply sensitivity dramatically increases as the supply voltage scales [19,51]. As a result, for the case of using a switching regulator, the switching frequency and CDR bandwidth should be carefully chosen to minimize the impact of supply ripple on the link BER [22,33]. Figure 8b shows one of the design considerations of a linear regulator regarding PSR. The linear regulator is based on a negative feedback loop with two poles, one is at the output of the amplifier (ω1, gate of pass transistor) and the other is at the output (ω2, regulated supply voltage). Depending on the pole locations, the regulator exhibits different PSR characteristics. For example, if ω1 is the dominant pole (ω2 > ω1), the PSR has a peaking, whereas the opposite case exhibits a plain low-pass filter characteristic [19,52]. However, making ω2 dominant requires a huge capacitance because of low output resistance at the VDD node due to the circuit load. It costs a huge silicon area and frequently needs an off-chip capacitor, and it also severely constraints the supply transition time. To summarize, there are lots of design trade-offs (e.g., PSR versus efficiency, PSR versus silicon area), which need to be carefully examined to find the best topology and design parameters for a given application.

Clock Generation
As mentioned briefly in the introduction, multiple PLLs (or multiple oscillators in a single PLL) are frequently implemented in wide-range transceivers to cover the entire frequency range [4,8,10,11,53]. It is mainly because LC oscillators are popular for high-speed I/O applications because of its superior phase noise performance and high-speed capability compared to ring oscillators, and they are also less sensitive to supply voltage variation. However, LC oscillators occupy a much larger area, so using multiple LC oscillators is an expensive solution. The survey results given in Figures 10 and 11 validate such trade-off between LC-based PLL and ring-based PLL. Figure 10 illustrates a scatter plot of figure-of-merit (FoM) of PLL versus operating frequency of PLL designs presented in IEEE International Solid-State Circuits Conference (ISSCC) from 2010 to 2019 [19]. The FoM of the PLL is defined as where σ J and P PLL denote the absolute jitter and power consumption of a PLL, respectively, which is widely used to evaluate the PLL jitter performance with equalized power consumption [54]. The trend shown in Figure 10 implies that the LC-based PLLs generally exhibit better jitter performance and higher operating frequency than the ring-based counterpart. On the other hand, we can find that the ring-based PLL occupies much less silicon area from Figure 11, where the FoM versus silicon area is plotted. In addition, the tuning range of the ring is much wider than LC [19]. As a result, the design trade-off between LC-and ring-based PLLs in wide-range I/O can be summarized as follows: the LC PLL offers better jitter performance; however, its range is limited so that no less than two LC PLLs are required to cover a wide range, which results in significant area consumption. On the other hand, the ring PLL achieves wide range with a small area; however, its higher phase noise and supply sensitivity results in worse performance. Conventionally, using multiple PLLs has been a majority option for wide-range I/O transceivers [8,10,11,55,56]. Recently, there have been some ring-based works by introducing some circuit techniques to overcome the limit of ring PLLs; for example, a two-stage ring oscillator [23], clock multiplying delay-locked loop (MDLL) [31], and multi-phase calibration [57]. For specific applications where only discrete rates are used, a single PLL with a configurable division ratio can be also used [16,58].
Electronics 2020, 9, x FOR PEER REVIEW 9 of 21 As mentioned briefly in the introduction, multiple PLLs (or multiple oscillators in a single PLL) are frequently implemented in wide-range transceivers to cover the entire frequency range [4,8,10,11,53]. It is mainly because LC oscillators are popular for high-speed I/O applications because of its superior phase noise performance and high-speed capability compared to ring oscillators, and they are also less sensitive to supply voltage variation. However, LC oscillators occupy a much larger area, so using multiple LC oscillators is an expensive solution. The survey results given in Figures 10 and 11 validate such trade-off between LC-based PLL and ring-based PLL. Figure 10 illustrates a scatter plot of figure-of-merit (FoM) of PLL versus operating frequency of PLL designs presented in IEEE International Solid-State Circuits Conference (ISSCC) from 2010 to 2019 [19]. The FoM of the PLL is defined as where σJ and PPLL denote the absolute jitter and power consumption of a PLL, respectively, which is widely used to evaluate the PLL jitter performance with equalized power consumption [54]. The trend shown in Figure 10 implies that the LC-based PLLs generally exhibit better jitter performance and higher operating frequency than the ring-based counterpart. On the other hand, we can find that the ring-based PLL occupies much less silicon area from Figure 11, where the FoM versus silicon area is plotted. In addition, the tuning range of the ring is much wider than LC [19]. As a result, the design trade-off between LC-and ring-based PLLs in wide-range I/O can be summarized as follows: the LC PLL offers better jitter performance; however, its range is limited so that no less than two LC PLLs are required to cover a wide range, which results in significant area consumption. On the other hand, the ring PLL achieves wide range with a small area; however, its higher phase noise and supply sensitivity results in worse performance. Conventionally, using multiple PLLs has been a majority option for wide-range I/O transceivers [8,10,11,55,56]. Recently, there have been some ring-based works by introducing some circuit techniques to overcome the limit of ring PLLs; for example, a two-stage ring oscillator [23], clock multiplying delay-locked loop (MDLL) [31], and multi-phase calibration [57]. For specific applications where only discrete rates are used, a single PLL with a configurable division ratio can be also used [16,58].

TX Driver Topology
TX output driver topology is another important factor that impacts the efficiency of scalable IO, since the TX output driver is mainly responsible for the signaling power we discussed above, although a separate supply voltage is usually dedicated for the TX driver to decouple the TX swing from the supply scaling [12,25]. The requirements for the TX driver are summarized as a proper output impedance for signal integrity, enough output swing to guarantee sufficient SNR, and optional FFE equalization for a high-loss channel. Figure 12 shows three popular topologies of TX driver implementation: a current-mode driver, P-over-N voltage-mode driver, and N-over-N voltage-mode driver [25,[59][60][61][62]. In the current-mode driver, the transistors operate in the saturation region, the output impedance of the driver is generally dominated by the passive load resistance R. As a result, it provides good impedance matching across a wide range of output swing. Compared to the CML circuit that has an output swing of IBIASR, the current-mode driver has a halved swing of IBIASR/2 (single-ended), because the IBIAS splits into the load resistance and RX termination. For example, with 50-Ω termination, 20-mA current consumption is needed for 500-mV swing (=Swing/25Ω). On the other hand, the voltage-mode drivers consume less current than the current-mode driver, because there are not spilt current paths. Such a difference originates from where the termination is placed relative to the signal path, such that the termination is placed in parallel with the signal path in the current-mode driver but in series with the signal path in the voltage-mode driver. For that reason, the voltage-mode driver is also referred to the source-series terminated (SST) driver. In a P-over-N voltage-mode driver, an inverter-like PMOS-over-NMOS structure drives the channel and the RX termination. In an N-over-N voltage-mode driver, NMOS transistors are used for both pull-up and pull-down, but they are driven by an opposite polarity of input voltage. As a result, pull-up and pull-down paths are not turned on simultaneously, the current solely flows from VDD,PU (or Vswing) to VCM (or VCM to ground). As a result, assuming matched impedance for both paths, both the output swing and the common-mode voltage (VCM) are VDD,PU/2. For example, it leads to the current consumption of 5 mA for a 500-mV output swing, which is only ¼ that of the current-mode driver. The same is also applied to the N-over-N voltage-mode driver. On the other hand, the impedance matching of voltage-mode drivers is more complex than the current-mode driver, because they rely on the active devices rather than a passive resistor. Basically, the transistors should operate as a resistor to assure low output impedance so that they should be in a linear region, which constraints the gate-overdrive voltage (VOV) to be higher than the drain-source

TX Driver Topology
TX output driver topology is another important factor that impacts the efficiency of scalable IO, since the TX output driver is mainly responsible for the signaling power we discussed above, although a separate supply voltage is usually dedicated for the TX driver to decouple the TX swing from the supply scaling [12,25]. The requirements for the TX driver are summarized as a proper output impedance for signal integrity, enough output swing to guarantee sufficient SNR, and optional FFE equalization for a high-loss channel. Figure 12 shows three popular topologies of TX driver implementation: a current-mode driver, P-over-N voltage-mode driver, and N-over-N voltage-mode driver [25,[59][60][61][62]. In the current-mode driver, the transistors operate in the saturation region, the output impedance of the driver is generally dominated by the passive load resistance R. As a result, it provides good impedance matching across a wide range of output swing. Compared to the CML circuit that has an output swing of I BIAS R, the current-mode driver has a halved swing of I BIAS R/2 (single-ended), because the I BIAS splits into the load resistance and RX termination. For example, with 50-Ω termination, 20-mA current consumption is needed for 500-mV swing (=Swing/25Ω). On the other hand, the voltage-mode drivers consume less current than the current-mode driver, because there are not spilt current paths. Such a difference originates from where the termination is placed relative to the signal path, such that the termination is placed in parallel with the signal path in the current-mode driver but in series with the signal path in the voltage-mode driver. For that reason, the voltage-mode driver is also referred to the source-series terminated (SST) driver. In a P-over-N voltage-mode driver, an inverter-like PMOS-over-NMOS structure drives the channel and the RX termination. In an N-over-N voltage-mode driver, NMOS transistors are used for both pull-up and pull-down, but they are driven by an opposite polarity of input voltage. As a result, pull-up and pull-down paths are not turned on simultaneously, the current solely flows from V DD,PU (or V swing ) to V CM (or V CM to ground). As a result, assuming matched impedance for both paths, both the output swing and the common-mode voltage (V CM ) are V DD,PU /2. For example, it leads to the current consumption of 5 mA for a 500-mV output swing, which is only 1 4 that of the current-mode driver. The same is also applied to the N-over-N voltage-mode driver. On the other hand, the impedance matching of voltage-mode drivers is more complex than the current-mode driver, because they rely on the active devices rather than a passive resistor. Basically, the transistors should operate as a resistor to assure low output impedance so that they should be in a linear region, which constraints the gate-overdrive voltage (V OV ) to be higher than the drain-source voltage (V DS ). As a result, the P-over-N is more appropriate for high output swing, whereas the N-over-N fits better for low output swing [12]. A passive resistor is usually placed in series with the driver to reduce the V DS for better linearity [60,63].
Electronics 2020, 9, x FOR PEER REVIEW 11 of 21 voltage (VDS). As a result, the P-over-N is more appropriate for high output swing, whereas the N-over-N fits better for low output swing [12]. A passive resistor is usually placed in series with the driver to reduce the VDS for better linearity [60,63]. As shown in Figure 12b,c, the voltage-mode drivers are driven by CMOS inverters, whereas the current-mode driver needs to be driven by a CML pre-driver, so the voltage-mode drivers are more appropriate for supply-scalable design, as we discussed in Section 3.1. However, there are constraints on the supply voltage of such inverters; hence, it limits the scalability. Since the output impedance is set by the transistors, and it is dominated by the VGS of the transistors, the supply voltage can only be adjusted in a range where the output impedance is matched within the configurable range of the driver segmentation. In the P-over-N, the VGS of PMOS and NMOS are VDD,PU and VDD,PD, respectively. In the N-over-N, the VGS of pull-up and pull-down NMOSs are VDD-VCM and VDD, respectively; therefore, the pull-up device is typically much larger to have the same impedance. For both cases, the output impedance is a function of the supply voltages of the pre-driver inverters, so the controllability on those supplies is constrained by the controllability of the output driver strength. For the supply voltage of the output driver itself, the voltage-mode driver is fully constrained by the output swing, but the current-mode driver can scale the supply until the differential pair and the current source faces a voltage headroom issue.

Clocking Architecture
Clocking architecture is one of the most important design choices that impacts the performance and characteristic of a generic I/O link. Sometimes, it is not choosable, as it is pre-defined in specification; nevertheless, it is still very important to understand various clocking architectures and their pros/cons. Figure 13 shows three popular clocking architectures: plesiochronous, source-synchronous, and mesochronous. In a plesiochronous link, the TX and the RX have their own PLLs, which are synchronized to different reference clock sources. As a result, there is an intrinsic frequency offset between the TX and the RX. At the receiver side, the frequency offset is tracked by rotating the sampling clock phase, which is controlled by a CDR loop. On the other hand, a source-synchronous link TX forwards its clock to the RX through a dedicated clock channel so there is no frequency offset. In addition, it is assumed that the latencies of all data channels are synchronized; hence, no per-lane phase tracking circuit is used, but the forwarded clock is simply distributed to each data lane. The delay of the data and clock paths are corrected by introducing additional delay to either TX or RX, to maximize jitter tolerance as well as eliminate static skew [64]. Compared to the plesiochronous link, the source synchronous link has a much simpler circuitry at the cost of the additional forwarded clock channel. As a result, it significantly relaxes the design complexity introduced by the supply scalability. However, it becomes impossible to eliminate the skew across data and clock lanes as the data rate increases; thus, the source-synchronous architecture cannot be a viable solution for today's high-speed interfaces. The mesochronous link As shown in Figure 12b,c, the voltage-mode drivers are driven by CMOS inverters, whereas the current-mode driver needs to be driven by a CML pre-driver, so the voltage-mode drivers are more appropriate for supply-scalable design, as we discussed in Section 3.1. However, there are constraints on the supply voltage of such inverters; hence, it limits the scalability. Since the output impedance is set by the transistors, and it is dominated by the V GS of the transistors, the supply voltage can only be adjusted in a range where the output impedance is matched within the configurable range of the driver segmentation. In the P-over-N, the V GS of PMOS and NMOS are V DD,PU and V DD,PD , respectively. In the N-over-N, the V GS of pull-up and pull-down NMOSs are V DD -V CM and V DD , respectively; therefore, the pull-up device is typically much larger to have the same impedance. For both cases, the output impedance is a function of the supply voltages of the pre-driver inverters, so the controllability on those supplies is constrained by the controllability of the output driver strength. For the supply voltage of the output driver itself, the voltage-mode driver is fully constrained by the output swing, but the current-mode driver can scale the supply until the differential pair and the current source faces a voltage headroom issue.

Clocking Architecture
Clocking architecture is one of the most important design choices that impacts the performance and characteristic of a generic I/O link. Sometimes, it is not choosable, as it is pre-defined in specification; nevertheless, it is still very important to understand various clocking architectures and their pros/cons. Figure 13 shows three popular clocking architectures: plesiochronous, source-synchronous, and mesochronous. In a plesiochronous link, the TX and the RX have their own PLLs, which are synchronized to different reference clock sources. As a result, there is an intrinsic frequency offset between the TX and the RX. At the receiver side, the frequency offset is tracked by rotating the sampling clock phase, which is controlled by a CDR loop. On the other hand, a source-synchronous link TX forwards its clock to the RX through a dedicated clock channel so there is no frequency offset. In addition, it is assumed that the latencies of all data channels are synchronized; hence, no per-lane phase tracking circuit is used, but the forwarded clock is simply distributed to each data lane. The delay of the data and clock paths are corrected by introducing additional delay to either TX or RX, to maximize jitter tolerance as well as eliminate static skew [64]. Compared to the plesiochronous link, the source synchronous link has a much simpler circuitry at the cost of the additional forwarded clock channel. As a result, it significantly relaxes the design complexity introduced by the supply scalability. However, it becomes impossible to eliminate the skew across data and clock lanes as the data rate increases; thus, the source-synchronous architecture cannot be a viable solution for today's high-speed interfaces. The mesochronous link resolves such issues of the source-synchronous link with per-lane de-skew circuits. Since there is no frequency offset, the de-skew circuit does not have to rotate the phase continuously, so the circuit implementation is still simpler than the plesiochronous RX.
Electronics 2020, 9, x FOR PEER REVIEW 12 of 21 resolves such issues of the source-synchronous link with per-lane de-skew circuits. Since there is no frequency offset, the de-skew circuit does not have to rotate the phase continuously, so the circuit implementation is still simpler than the plesiochronous RX. Parallelized architecture with a reduced clock rate but with a multi-phase clock (i.e., half-rate or quarter-rate clocking) has been one of the major innovations enabling today's ultra-high-speed I/O with reasonable power consumption [22]. It plays a critical role in the supply-scalable I/O, because it relaxes the required circuit bandwidth at reduced supply voltages. As a result, many supply-scalable I/O designs have adopted highly parallelized architecture (e.g., quarter rate) from early works [22] to recent works [16,32,34,57]. However, the downside of parallelized architecture with scalable supply is that the mismatch effect goes worse with the scaled supply voltage [65]. As a result, duty-cycle correction (for half-rate clocking) or the multi-phase alignment technique becomes essential for supply-scalable I/O design. Specifically, for an example of quarter-rate clocking that is most popular for scalable I/O, quadrature error correction (QEC) schemes, such as phase calibration with a quadrature phase detector [47,66] or with asynchronous sampling [15,16,67] or time-division phase calibration [57,68], are widely employed. Parallelized architecture with a reduced clock rate but with a multi-phase clock (i.e., half-rate or quarter-rate clocking) has been one of the major innovations enabling today's ultra-high-speed I/O with reasonable power consumption [22]. It plays a critical role in the supply-scalable I/O, because it relaxes the required circuit bandwidth at reduced supply voltages. As a result, many supply-scalable I/O designs have adopted highly parallelized architecture (e.g., quarter rate) from early works [22] to recent works [16,32,34,57]. However, the downside of parallelized architecture with scalable supply is that the mismatch effect goes worse with the scaled supply voltage [65]. As a result, duty-cycle correction (for half-rate clocking) or the multi-phase alignment technique becomes essential for supply-scalable I/O design. Specifically, for an example of quarter-rate clocking that is most popular for scalable I/O, quadrature error correction (QEC) schemes, such as phase calibration with a quadrature phase detector [47,66] or with asynchronous sampling [15,16,67] or time-division phase calibration [57,68], are widely employed.

Survey on State-Of-The-Art Supply-Scalable I/O
In this section, we attempt to compare the previously published supply-scalable I/O designs in terms of various aspects we discussed in the previous sections. Tables 1-3 show the comprehensive review of supply-scalable transmitters, receivers, and transceivers, respectively. Note that they are sorted with oldest publication first, and the clock generation (PLL) power is included in the transmitter.
Throughout the surveys, we can find various meaningful trends. First, the data rate tends to increase continuously. Second, the figure-of-merit (FoM, energy efficiency) gap between the minimum and maximum data rates has been converging as the process technology scales down. It can be interpreted as the dynamic switching power occupied the dominant portion of I/O power in older technologies (e.g., [22,24]); however, it has been significantly reduced owing to the technology scaling, whereas the others (the static current and the signaling power) have not [41,43]. Third, the quarter-rate clocking has become the major clocking architecture and tends to exhibit better energy efficiency. Lastly, as we already discussed in Figures 3 and 6, the works relying on off-chip sources for supply scaling do not consider the overhead due to the supply scaling circuitry, such as non-zero energy loss, they tend to exhibit better efficiency. Only five works include the on-chip supply scaling, where the regulator efficiencies of around 80-90% have been reported.
Specifically, looking at Table 1, we can observe that the voltage-mode driver has become the mainstream topology for the transmitters for the reasons we discussed in Section 3.4. On the other hand, more than half of the transmitter works rely on the external clock source to mitigate the design complexity due to the supply sensitivity of PLL. [12,16,22,31] managed to cover the entire range with a single PLL; however, [16] uses an LC-PLL followed by a programmable divider but provides only a few quantized frequencies. [12,22,31] use a ring-PLL to cover the entire range; however, their supply voltages are separated from the other building blocks, which can also imply that it is difficult to scale the clocking circuit with the other building blocks because of its sensitivity to the supply voltage. The energy efficiency achieved from the highest data rate of each design ranges from 0.44 pJ/bit [27] to 43.3 pJ/bit [22], but we have to note that [27] does not include an equalizer, PLL, and on-chip supply scaling. If we narrow down the scope to the transmitters having at least two of them, the best energy efficiency becomes 1.97 pJ/bit at 32 Gb/s [16].
Looking into Table 2 where the survey of supply-scalable I/O receivers is presented, we can observe that mesochronous clocking has been a mainstream architecture [15,[24][25][26][27], as we discuss in Section 3.5. The receiver designs that rely on external CDR/deskew calibration tend to exhibit better energy efficiency, which implies the hardware overhead of robust operation with the on-chip CDR/deskew. For example, [27] achieves 0.22 pJ/bit at 8 Gb/s from the 0.75-V supply, but [16] exhibits 1.06 pJ/bit at 8 Gb/s from the 0.72-V supply. Note that it does not mean the overhead is almost 80% of a complete receiver, because there are many differences in the level of completeness between [16] and [27]; for instance, [16] has much more extensive equalization and wider eye opening. Therefore, we have to be sure of whether a design includes such on-chip circuitry or not, while evaluating the performance of a receiver. Table 3 shows the performance survey of the scalable I/O designs where complete transceivers have been presented. The energy efficiency ranges from 0.51 to 15 pJ/bit for the lowest data rates and from 0.66 to 76 pJ/bit for highest data rates. A few notable works are as follows. The pioneer works of supply-scalable parallel [24] and serial I/Os from Stanford University [22], utilize DC-DC converters to scale the on-chip supply voltages. They take the control voltages of analog delay-locked loop [24] and analog ring-PLL [22], which can be a benchmark of operating frequency because they intrinsically track the operating frequency, as already discussed in the Section 2, as a reference voltage of the DC-DC converters. As a result, the supply voltages can scale automatically with no off-chip control. In [25], Intel presents the voltage calibration scheme using a calibration VCO that replicates the critical path in the transmitter, and it achieves an energy efficiency of 2.7-5.0 pJ/bit in 65-nm CMOS technology, which is approximately a 10× improvement over [24] and [22]-however, without on-chip supply scaling. Intel also presents [16], which achieves 32 Gb/s with a complete set of equalizers, on-chip PLL, and CDR, with an energy efficiency of 3.25-6.41 pJ/bit. [31] incorporates the combination of a DC-DC converter and LDO regulators to achieve high conversion efficiency (DC-DC) as well as to protect supply-noise-sensitive circuits from the noisy output of a DC-DC converter, while achieving 3.6-7.24 pJ/bit efficiency with a 3-10 Gb/s range with supply scaling. In addition, [31] combines the scalable supply technique with burst-mode operation, lowering the minimum effective data rate of 16 Mb/s with 34 pJ/bit.

Summary and Conclusions
This paper overviews the supply-scalable I/O interfaces. At first, the motivations for the supply-scalable I/O are discussed in Sections 1 and 2, from the computer architecture level down to transistor level. The basic concepts and expected behavior of the supply-scalable I/O are introduced in Section 2. Throughout Section 3, circuit techniques and critical building blocks to enable supply-scalable I/O are reviewed. Section 4 presents a comprehensive survey of supply-scalable I/O designs whose functionality has been verified from fabrication results and discusses the trend and where we stand. From the survey, we can find that there have been many wonderful efforts to enable supply-scalable I/O for energy-efficient computing; however, a true sense of complete supply-scalable high-speed I/O, such that it includes on-chip supply scaling, on-chip PLL, per-lane CDR/deskew, and equalization, has not yet been presented so far. In addition, the energy efficiency of those supply-scalable I/Os (3-7 pJ/bit) has not reached that of non-scalable I/Os (<3 pJ/bit) [69]. The focus of this paper is not just introducing the supply-scalable I/O technology but encouraging prospective researchers to work on this topic.