A Robust Ultra-Low Voltage CPU Utilizing Timing-Error Prevention

To minimize energy consumption of a digital circuit, logic can be operated at subor near-threshold voltage. Operation at this region is challenging due to device and environment variations, and resulting performance may not be adequate to all applications. This article presents two variants of a 32-bit RISC CPU targeted for near-threshold voltage. Both CPUs are placed on the same die and manufactured in 28 nm CMOS process. They employ timing-error prevention with clock stretching to enable operation with minimal safety margins while maximizing performance and energy efficiency at a given operating point. Measurements show minimum energy of 3.15 pJ/cyc at 400 mV, which corresponds to 39% energy saving compared to operation based on static signoff timing.


Introduction
With constantly tightening power budgets and incentives to add new features, energy consumption has become one of the most important aspects of portable electronics. The minimum energy consumption of digital logic is traditionally achieved by utilizing a supply voltage which is below the transistor threshold voltage [1]. Operation in this sub-threshold -region has several practical limitations such as radically reduced performance and exponentially increased variability.
Recent manufacturing processes have pushed the minimum energy point (MEP) towards the threshold voltage, which mitigates performance and variance concerns and makes near-threshold operation much more attractive and practical. Near-threshold systems may still need to satisfy temporary high throughput requirements, which can be satisfied by utilizing dynamic voltage and frequency scaling (DVFS). In energy-constrained systems, supply voltage also varies slowly with time (battery, solar cell, etc.), creating additional challenges for ensuring efficient and reliable operation with a dynamic performance target. Despite the challenges, near-threshold computing has recently gained increasing amount of attention due to its suitability to IoT and other extremely energy-limited applications which yet have moderate performance requirements.
The goal of this research is to study applicability of timing-error prevention (TEP) to an ultra-low voltage CPU. We present two variants of a 32-bit RISC microprocessor which can operate reliably at near-threshold voltages with minimal safety margins. The CPU cores are optimized for design compatibility and Ultra-Dynamic Voltage Scaling (UDVS) [2] respectively. They are both placed on same die, which is manufactured in 28 nm CMOS technology. Section 2 of this article describes more closely the operation at ultra-low voltages with background theory and examples, and presents the concepts of adaptive timing. The design process of the CPU cores is described in Section 3, and Section 4 presents the measurement results and analysis. Finally, Section 5 concludes the article.

Ultra-Low Voltage Operation
The goal of this section is to make the reader familiar with the principles and incentives of ultra-low voltage operation. This is followed by a description of the most important challenges, and two possible solutions to overcome them.

Motivation and Challenges
For digital static CMOS logic, when the performance constraints allow, the straightforward solution to energy minimization is to lower the operating voltage all the way to the MEP. This point has been proven to exist around 0.2-0.4 V depending on various factors [3]. With older process nodes, this is in the subthreshold operation region and for newer process nodes, in the near-threshold region. Figure 1 shows the MEP for 65 nm and 28 nm low-power (LP) bulk processes with different activity factors. On the 28 nm process, MEP resides near the threshold voltage. With lower leakage and higher V t , the 65 nm process has MEP at approximately 100 mV lower voltage which resides in subthreshold region. α=10% (65nm) α=30% (65nm) α=10% (28nm) α=30% (28nm) Figure 1. The minimum-energy points for 65 nm and 28 nm low-power (LP) bulk processes using low-V t (LVT) transistors. ↵ denotes the activity factor. The test circuit here is a 13 stage minimum-size negative-AND -gate (NAND) ring oscillator (based on a circuit presented in [3]).
However, the advantages of operating at the MEP are easily lost due to higher device variability. In subthreshold, process, supply voltage, temperature, and aging (PVTA) induced variance is amplified in the subthreshold region due to exponential dependency of the subthreshold current [3]: In Equation (1), I 0 is the drain current when V gs = V t ; V t is the threshold voltage; n is the subthreshold slope factor; and V th is the thermal voltage. While near-threshold operation mitigates the exponential dependency, increased variance still remains when compared to nominal operating voltages.
The theory is verified with a simulation of an example timing path from a test circuit in 28 nm CMOS. The path is simulated with 100 Monte Carlo runs, and Table 1 shows resulting nominal, minimum and maximum delays for near-threshold voltages. The results demonstrate the design challenge of near-threshold logic, as local variation can cause up to 30% delay deviation from nominal. Even without the exponential dependence on transistor parameters of subthreshold, robust operation clearly demands overly large design margins or individual post-fabrication measurements of the components; the former negating the minimum energy operation, and the latter increasing production costs considerably.

Adaptive Timing Methods
Here, we present 2 alternative solutions for solving the issue of large timing margins. Several researches have successfully applied these methods to a number of test chips.

Timing-Error Detection
Timing-Error Detection (TED)-based systems have been shown to be effective in largely removing the variation-incurred timing margins [4][5][6]. The achieved lower margins have then enabled energy savings (i.e., lower V DD [4]) or a higher yield [5]. The TED methodology is based on having the system operate at a voltage and frequency point in which the timing of critical paths fails intermittently. These failed timing occurrences are detected and handled. The overhead of detection and handling has to be lower than the energy savings resulting from lower V DD or higher frequency. A classic example of a TED system is the instruction replay, where an instruction with a failed timing path is replayed [4].

Timing-Error Prevention
Timing-Error prevention (TEP) is a variant of TED which utilizes time borrowing (TB). When a system is timed for zero TB at normal operation, fractionally late signals can be tolerated without errors occurring in the system. However, TB sets timing requirements on the subsequent pipeline stages; namely the cumulative TB of consecutive stages must never exceed the TB window and therefore careful design time planning is required. Combining TED with TB into TEP conceives a system which can tolerate late coming signals, but which does not require special arrangements with regards to stage lengths. The method works as follows: When a late signal arrives, time borrowing occurs normally. Time borrow events (TBE) are detected with latches or special time borrow flip-flops. A recovery is necessary to prevent borrowed time from accumulating and generating a timing error. This can be done by moving the clock phase [7] or gating the clock on per-stage basis [8].
Since late signals are allowed with both TED and TEP technique, setup-timing margins are eliminated thereby allowing energy reduction. When TEP is integrated into a dual-phase latch pipeline, the resulting system can not only tolerate late signals, but does not require additional hold buffers on fast paths [8]. This is a large advantage compared to a traditional TED system. As shown in Table 1, minimum delay uncertainty is also clearly worsened at near-threshold voltages, which would significantly increase the demand for extra hold buffers on fast paths. A dual-phase latch pipeline can also be driven by 2 non-overlapping clocks, which increases skew tolerance and decreases general requirement for hold buffers.
With a balanced pipeline, a TEP system has margins but only when they might be required (when time borrowing is detected). Importantly, the system is a zero margin device at design time and adaptive margin device at runtime.

Design of a 32-bit RISC CPU with TEP
The CPU cores of our system are modified versions of a freely available, open-core LatticeMico32 CPU [9]. LM32 is a configurable medium-scale RISC microprocessor with 6 pipeline stages, full GCC toolchain support and sufficient performance (1.14 DMIPS/MHz, 1.83 Coremark/MHz) even for demanding sensor network applications.
As described in [10,11], we have enhanced the CPUs with critical path monitoring combined with timing-error prevention. Our TEP system enables time borrowing on all paths. Time borrow events are detected with time borrow detector (TBD) circuits, which are integrated into critical path latches as illustrated in Figure 2a. The circuit indicates when TB occurs, and TBEs are combined and propagated to a clock control circuit, which is responsible for error prevention. The clock control circuit is a state machine consisting of a small number of flip-flops and logic gates (Figure 2b). It is able to shift the global clock phase after a TBE as illustrated in Figure 2c. Therefore, the TEP system prevents the stacking of TB, which could otherwise lead to a timing error. In addition to per-cycle timing error prevention, the real-time feedback from critical paths (combined TBEs) is output from the chip, allowing it to be used in tuning the microprocessor voltage/frequency in order to minimize die characterization effort. Since the unmodified LM32 is a standard edge-sensitive pipeline, additional design steps are required to make it suitable for TEP. Out of the compatible sequential cells, pulse-latches and TB flip-flops would be the most straightforward to integrate into an edge-sensitive design, but they would add minimum-delay constraints into fast paths and raise robustness concerns at near-threshold voltages. Standard latches are readily available in foundry design kits and have no robustness issues, but an edge-sensitive design needs to be fully transformed into a dual-phase latch pipeline to avoid half-cycle minimum delay requirement for critical paths. We created 2 CPU variants (CPU1 and CPU2) with dual-phase latch pipelines to study the applicability and effectiveness of different implementation strategies. As shown in Table 2, different design techniques were applied to the 2 CPUs. The fully custom TBD cell has 3.2⇥ area of a minimum-sized latch, and the total TBD area overhead specified in the table is the percentage of combined TBD area out of standard cell area. The area overheads of OR-tree and clock control block are insignificant in comparision and are not listed.

CPU1
CPU1 uses automatic latch transformation and EDA-tool backed fixed-phase retiming [12] during synthesis, which is illustrated (for a single flip-flop) in Figure 3. The flip-flops are split into 2 latches, and the whole design is retimed in order to balance the paths. As the method is mostly automatic, it has been demonstrated to be applicable in larger designs [8]. However, there are some problems in practice. The balancing performed by the EDA tools is not guaranteed to be optimal-a carefully optimized flip-flop pipeline may turn into a pipeline where the path between the original master and slave latch becomes consistently shorter than the paths between the latch pairs, reducing overall efficiency. Also, some of the tools are not able to retime designs with multiple voltage domains. The method is also prone to increase clock network complexity, as the number of registers in the design increases.
As the target of CPU1 is to demonstrate applicability of TEP in a typical RISC processor, the core is supplemented with standard on-chip SRAM memory. This creates a new challenge, since it adds a hard edge-sensitive block inside otherwise level-sensitive design. As with [8], our design employs a SRAM wrapper, which moves reads and writes to the falling clock edge to ensure that SRAM input signals are always valid. However, this reduces the allowable time on the paths into subsequent latches. It also prevents usage of TBD latches at the end of these paths, since the data always transitions during their transparent phase, which would generate a possibly false positive. However, the vendor-provided SRAM in our design operates in a separate power domain of higher voltage due to retention requirements, making it significantly faster than the core logic and thereby eliminating the issue of stricter timing requirements. Based on simulations, a core voltage up to 0.5 V was shown to cause no SRAM performance bottleneck in the system.

CPU2
A more optimal starting point for a TEP implementation would be a CPU design which has two-phase latch pipeline to start with, and a more flexible memory interface. In order to study the benefits of such base design, we created another variant of the core. CPU2 is manually transformed at register-transfer level (RTL) into dual-phase system before synthesis, and it uses a small latch array memory instead of SRAM. The former removes EDA tool limitations as the registers inside the design are synthesized directly to latches. The latter property removes the need of wrapper logic and memory path length constraining. Moreover, due to the lack of SRAM retention issues, the system is implemented in a single voltage domain. This relaxes voltage scaling limitations, and thus allows operation at an ultra-wide voltage range. Figure 4 illustrates the RTL rewrite process through a simple 3-stage pipeline example. The edge-sensitive processes are changed into level-sensitive, and consecutive stages are assigned to alternating clock phases. Any paths starting and terminating on a stage with the same polarity-including the case where source and target register is the same (pc_f in the example)-must be supplemented by an additional synchronizing latch stage. Despite the additional latches, the total clock network load will be smaller than the original flip-flop design due to latch being smaller and simpler than a flip-flop. Changing a pipeline stage to operate on a single phase halves the achievable clock frequency, but also cuts the pipeline latency into half. It should be noted that conditional constructs in a level-sensitive block will explicitly synthesize into clock-gated latches. This is in contrast to an edge-sensitive block, where a similar construct would synthesize to a circular feedback path by default, unless clock-gating was enabled in the synthesis tool.
// These will synthesize into flip-flops reg

Measurement Results and Discussion
Based on the simulations and estimated average activity, the CPUs were synthesized and place & routed at the estimated MEP of 400 mV. With the exception of TBD and level shifters (in CPU1), all gates were from vendor standard cell library, which was re-characterized at the target operation point. As shown in Table 3 and explained in Section 3.2, the frequency of CPU2 is lower due to its pipeline structure, but its area is also smaller as shown in the chip microphotograph in Figure 5.
Silicon measurements of 6 test chips verified correct operation for both CPUs at their signoff frequencies, but due to safety margins and inaccuracies in timing libraries, optimal energy or performance was not achieved. This was evidenced by studying timing feedback from critical paths, which showed that TEP was not activated. Further testing verified that the CPUs could be run at significantly higher frequencies until time borrowing started to occur. Since the clock stretching halves the effective frequency when activated, the optimal frequency was set so that time borrow events only occurred rarely. SRAM CPU1 CPU2 0.6mm  Voltage was scaled next to study energy/performance tradeoff at various operation points (Figure 6a,b). The lower bound (250 mV) was based on the simulated functional failure rate as described in [10], while the upper bound of CPU1 was limited by fixed SRAM performance. Voltage of CPU2 could be scaled up to 750 mV, after which further performance improvements were not possible as the externally generated clock became the limiting factor. If an on-chip PLL was used for clock generation, we estimate that CPU2 would operate approximately at 300 MHz with a nominal voltage of 1.0 V.
Figures 6a-c illustrate power distribution, performance and energy of the CPUs running a stress test at the measured voltage/frequency ranges. The dynamic energy consumed by logic is similar for both CPUs as both are inherently same architecture, but the more complicated clock network of CPU1 results to higher total dynamic energy. CPU1 also has higher leakage power, but when integrated over a clock period, the resulting leakage energy is slightly smaller because of higher clock frequency. Thus, when leakage power is not dominant (from 350 mV upwards), CPU2 is over 30% more energy-efficient than CPU1. The frequency of CPU1 ranges from 110 kHz to 16.5 MHz, and its minimum energy point (4.48 pJ/cyc) is at around 350 mV. The ultra-wide voltage scale of CPU2 allows it to operate up to 135 MHz, making it suitable for a wide number of usage scenarios. Minimum energy consumption (3.15 pJ/cyc) is achieved at a near-threshold voltage of 400 mV. Compared to operation based on static signoff timing, TEP-assisted DVFS allows reducing the energy consumption by 27% and 39% for CPU1 and CPU2 respectively.
The results demonstrate that TEP can utilized successfully in a typical RISC processor with a few automated extra steps during the design flow (CPU1). However, largest benefits are achieved with a processor, which is designed TEP in mind (CPU2). Such processor design would ideally have a dual-phase level-sensitive pipeline, and no hard edge-sensitive blocks. The measured energy consumption is compared with state of the art in Table 3, which shows that CPU2 achieves over 2⇥ lower effective energy than competition with smaller area.

Conclusions
Ultra-low voltage CPUs are not adopted widely in commercial systems due to complex design and reliability concerns. Shown here, is a simple adaptive technique, timing-error prevention. This technique helps the two presented CPU variants to operate reliably with minimal safety margins in order to ensure optimal performance and energy consumption. The first presented CPU is converted from a standard RISC design to a TEP-compatible system with minimal amount of manual intervention, demonstrating the technique's applicability to contemporary designs. The second CPU variant is specifically designed with ultra-low voltage operation and TEP in mind, enabling higher energy savings and ultra-wide voltage operation from 250 mV to 750 mV. Compared to operation at static signoff-based frequency, the TEP system reduces energy consumption of the presented CPUs by 27% and 39%.