1. Introduction
A conventional way to increase chip performance is to improve its operational frequency. However, the power consumption of a chip shares a linear relationship with its operating frequency. It forces the designers to search for other ways to increase performance without exponentially increasing power consumption. Aggressive technology scaling in a deep nanometer regime enables the fabrication of billions of transistors on a chip [
1]. This led to the design of chip multi-processors (CMP) or multi-core architectures with high performance and low power consumption [
2].
CMPs typically use the on-chip bus for communication among two or four cores. The on-chip bus lacks scalability and does not support simultaneous communication among multiple cores. With the increasing multitude of cores, communication among the cores become more critical. The on-chip bus was unable to perform stringent communication, creating a bottleneck in performance growth. Consequently, it shifts the trend from the computational centric design to communication-centric design and has led to the evolution of on-chip interconnection network or network-on-chip (NoC) [
3,
4]. NoC architecture is a packet-based inter-connected network that separates communication from the computation. As it is different from the shared bus, it facilitates customization in terms of bandwidth, buffers size, and topology. It offers scalability without using long global wires. NoC infrastructure consists of routers and links that are used to deliver packets via layered protocol [
5].
The continuous improvement in process technology yields higher transistor density [
1]. It makes transistors and links more vulnerable to different fault mechanisms [
6,
7]. A fault in a transistor may lead to erroneous computation. Open/short circuit in links, result in data corruption. Even single fault can paralyze the whole chip. Primarily, faults are classified into permanent and transient categories [
8]. Permanent faults are also known as hard faults, apparent in the chip for a lifetime after their occurrence. They permanently affect the functionality of the chip. They are traditionally caused by open/short circuits in links, time-dependent-dielectric-breakdown (TDDB) [
9], electro-migration (EM) [
10], stress migration, negative-bias-temperature-instability (NBTI) [
11], hot carrier injection [
12] and thermal cycling [
13]. On the other hand, transient faults, also known as soft errors, occur only for one or two clock cycles. They are traditionally caused by electrical noise, electrostatic discharge, electrical power drops, overheating, mechanical shocks, process variation [
14], and external radiations like alpha particles [
15,
16] and cosmic rays [
17].
Each node in NoC connects with a processing element (PE) or memory unit through a router. The healthy core connected to the router become inaccessible in case of a router failure. Faults in the router architecture lead to misrouting, deadlock, traffic hot spots, packet loss, and increased latency. Thus, reliable router architecture is necessary to avoid these undesirable fault scenarios [
18,
19]. We focused only on the tolerance of permanent faults in the router micro-architecture. Previously, many routing techniques are proposed for permanent fault tolerance, they ignore healthy cores and make them inoperable [
20,
21]. We work on the permanent fault tolerance mechanism for each pipeline stage of the router. It ensures connectivity of the healthy core associated with the faulty router. The proposed methodology exploits idle cycles of existing resources, the default winner strategy for failed arbiters, and bypassing of faulty resources by providing multiple bypass paths. The dominant idea in this work is to get the right balance between redundancy techniques to get higher reliability at low latency overhead.
The rest of the paper is organized as follows;
Section 2 presents the overview of existing reliable router architectures and fault detection mechanisms.
Section 3 presents the fundamentals of a generic NoC router.
Section 4 presents the effects of faults on router pipeline.
Section 5 presents the proposed NoCGuard reliable router architecture.
Section 6 presents the performance analysis.
Section 7 presents the reliability analysis.
Section 8 presents the latency analysis.
Section 9 concludes the paper.
2. Related Work
BulletProof router [
22] exploits N-modular redundancy (NMR) which requires multiple copies of the hardware for fault tolerance. It uses redundant components at every stage. However, it incurs high area and power overhead. RoCo router [
23] splits the router into two independent row and column modules. Each module deals with one dimension, i.e., X or Y. Fault in one module does not affect the functionality of the other module. Thus, it tolerates faults with degraded performance.
Vicis router [
24] exploits input port swapping and adaptive routing algorithm to tolerate faults in inter-router links. To tolerate faults in the data path of a router, i.e., intra-router links, crossbar (XB), and virtual channel (VC) buffers, it uses error correcting codes (ECC) and crossbar bypass bus. Encoder and decoder are placed at the input port. ECC can tolerate only one fault in the data path. Therefore, input port swapper and bypass bus use reconfiguration to avoid or move additional faults to different data paths. Repair router [
25] improves port swapping by using additional buffers. However, it saves only one out of two faulty ports. Hence, it requires costly rerouting. Both Vicis and Repair routers incur high area overhead of 40–50% respectively.
PVS router [
26] exploits resource sharing for fault tolerance in the input ports and RC units. It shares buffers among different input channels. However, if the resource sharing component, i.e., demultiplexer gets a fault, all input ports associated with it become inoperable. DRS router [
27] uses decoupled resource sharing to resolve this issue. It shares multiplexers, demultiplexers and all VC buffers of 2 or 3 adjacent input ports along with the decoupled resource sharing unit of each input port. Thus, the occurrence of a fault in the DRS unit of one input port does not affect the functionality of the adjacent input port.
SHIELD router [
28] tolerates both permanent and transient faults. For tolerating permanent faults, it exploits different strategies at each pipeline stage. At routing computation (RC) stage it uses spatial redundancy, at virtual channel allocation (VA) stage it uses resource borrowing, at switch allocation (SA) stage it uses default winner strategy and at crossbar (XB) stage it provides multiple paths to reach an output port. It does not consider faults at the input ports. For tolerating transient faults, it exploits the selective hardening of gates. Consequently, it increases the size of the critical gates to make them immune to transient faults. Overall it offers high reliability at the cost of noticeable latency overhead.
HPR router [
29] tolerates permanent faults at each pipeline stage of the router. At the input port, it uses ECC for the detection and correction of single-bit errors. To further increase tolerance at input port it uses virtual channel (VC) closing strategy, at RC stage it uses double routing strategy, at VA stage it uses default winner strategy, at SA stage it uses a runtime arbiter selection mechanism, and at XB stage it uses double bypass bus strategy. It achieves high reliability at low latency overhead.
In [
30] authors consider faults in the state field of the VC. It divides the state field into two groups and provides spare registers in each group for fault tolerance. It uses built-in-self-test (BIST) for fault detection in the state fields of a status register. It marks the VC as faulty on the occurrence of more than one fault in a group. Then the VC closing strategy propagates the upstream router to avoid sending flits to that downstream VC. It only tackles fault in the input port VC, leaving other router components vulnerable.
In [
31] author proposes three different fault-tolerant architectures namely single spare, turn priority and stay alive. An additional unit is available to replace faulty units at each level in the single spare architecture. Priority-based selection is used in case if two units at the same level get a fault. Exploiting regularity of the router, turn priority architecture uses the available resources by assembling as much input and output channels as possible. Different priorities are assigned to input and output channels to ensure router operation even in the worst case. Stay alive architecture combines these two previous architectures to enhance tolerance. These architectures incur high area overhead, i.e., single spare incurs 42%, turn priority incurs 77% and stay alive incurs 115% area overhead.
In [
32] author uses channel slicing along with on-demand triple modular redundancy (TMR) for fault tolerance. Each node in the network consists of three identical router slices that share internal paths. It supports three modes of operation. First is the parallel mode, in which different router slices share their control logic with each other in case of a fault in the control logic of a slice. Second is the separate mode, in which different router slices share internal resources like buffers and XB’s in case of fault in the buffer or XB of a slice. The third is the TMR mode, in which all three slices work in parallel on the same data and control signals. At the output, it uses majority voting to select the correct result.
EsyTest [
33] is a built-in-self-test (BIST) strategy for the data path and control path of NoC architecture. It exploits free time slots of data path components for testing. For data path testing, it uses idle cycles of inter-router links. For control path testing, it isolates the components with the help of test wrappers. In the test mode, it statically connects the local port with the east or west port to ensure the secure connection of core with the network. To support this setting, it uses the adaptive routing algorithm. This testing approach requires complex control and routing mechanism. The area overhead is 9.88% and power overhead is 4.63% for this fault detection strategy.
NoCAlert [
34] is an online and real-time fault detection mechanism for the NoC router micro-architecture. It consists of lightweight micro-checkers that work on the concept of invariance checking. It continuously monitors the system in real time for functionally illegal outputs as a result of faults. These checkers instantly detect 97% faults. The area overhead is 3% and power overhead is 0.7% for this fault detection mechanism.
The proposed reliable NoC router architecture “NoCGuard” can detect and tolerate multiple permanent faults and ensure continuous network operation. For fault detection, it relies on NoCAlert lightweight checkers. In contrast to existing reliable NoC architectures, NoCGuard tolerate faults in all pipeline components and maintain a gracefully degraded performance under heavy network traffic.
4. Effects of Fault on Router Pipeline
4.1. RC Stage Fault Scenario
If a fault occurs in the RC unit of an input port, it results in the wrong output port computation. Due to the lookahead routing protocol, this incorrect computation does not result in misrouting at the current router. Misrouting occurs in the downstream router only if the incorrect computation estimate is still a valid output port direction. In that case, the next downstream router directs the packet toward its correct destination. Thus, the packet incurs extra latency. If the incorrect computation is not a valid output port direction, then the packet sticks in the downstream router and end up causing deadlock. The downstream router may drop that packet which results in retransmission at the later stage.
4.2. VA Stage Fault Scenario
A fault in an arbiter of the VA stage results in miss-allocation or no-allocation of the downstream VC. In the case of the miss-allocation of the downstream VC, if the VC has been already occupied, then the latter packet overwrites the last one. It results in data corruption. In the case of no-allocation of the downstream VC, the packet does not proceed and remains in the buffer. It results in severe performance degradation and may end up as a deadlock.
4.3. SA Stage Fault Scenario
A fault in an arbiter of the first stage of SA, blocks the packets at the associated input port. Now the packet in each VC of that port does not proceed. A fault in an arbiter of the second stage of SA, makes the output port inaccessible. In both cases, the whole input port isolates permanently, causing severe degradation in performance and may lead to deadlock.
4.4. XB Stage Fault Scenario
A fault in a multiplexer of the XB results in an inaccessible output port. It severely disrupts the operation of the router. Adaptive routing or bypassing faulty multiplexers helps a router to operate in degradable manner.
5. NoCGuard Router Micro-Architecture
Each pipeline stage performs a specific function and has a distant role in the operation of the router. This work proposes a fault-tolerant design for each pipeline stage. For fault detection, NoCGuard relies on the NoCAlert fault detection mechanism [
34].
5.1. RC Stage Fault-Tolerant Design
In the generic router, each input port has its own RC unit. There are five RC units per router. They operate on the head flit and remain idle for the body and tail flits. Knowing this fact, we propose to borrow the RC unit of adjacent ports when the RC unit of the current port become faulty. To facilitate the borrowing process, we propose four new state fields for each input port.
Figure 3 shows the modified input port. It consists of the following new state fields: ‘DR’ (Destination Router), ‘RR’ (RC Result), ‘RS’ (RC Status) and ‘BP’ (Borrower Port).
Suppose a fault occurs in the RC unit of input port 1 (IP1). First, it changes the status of its RC unit to faulty in the ‘RS’ field register. Then it searches for an idle RC unit by checking the ‘RS’ field status of all other input ports. Suppose it finds the RC unit of input port 2 (IP2) idle. Now IP1 initiates the borrowing process by changing the status of ‘RS’ field of IP2 to active globally. Then IP1 stores the destination router identification of its packet in the ‘DR’ field and its input port identification in the ‘BP’ field register of IP2’s RC unit. After the route computation is complete, the IP2 RC unit places the route result in its ‘RR’ field register. In addition, changes the status of its ‘RS’ field back to idle. Now, IP1 accesses the ‘RR’ field register of IP2’s RC unit to get the RC result. ‘RS’ state field register takes four states namely idle, active locally (active for the current port), active globally (active for another port) and faulty. ‘BP’ state field register takes input port identification namely east, west, north, south and local.
We exploit the double routing strategy of [
23] with the RC borrowing strategy. The generic router uses a lookahead routing mechanism. In which the RC unit of the upstream router computes the route for the next downstream router. In addition, embeds this pre-computed route in the packet and delivers it to the downstream router. The current router only computes the route for the next downstream router. Thus, we propose to skip the RC borrowing if we do not instantly get idle RC. Because if we are unable to find idle RC in the current cycle, we must wait for two or more cycles. Instead, we forward the packet to the downstream router based on the pre-computed route already present in the packet at the current router. Now the downstream router computes the current and next downstream router’s route using double routing strategy. It consumes one extra cycle. On the other hand, if an Idle RC was found instantly, the same one extra cycle was consumed. Thus, the use of both strategies gives high reliability as compared to the use of only double routing strategy as in RoCo [
23] and HPR [
29] router.
To detect faults in the RC unit, we use NoCAlert [
34] invariance checkers.
Figure 4 shows the fault detection circuitry. Error1 signal asserts on the illegal turn, forbidding, which is necessary to prevent deadlock in the network. Error2 signal asserts on the computation of invalid output port direction. Error3 signal detects whether the computed output port direction takes the flit one step closer to its destination or not.
5.2. VA Stage Fault-Tolerant Design
In the generic router, VA happens in two stages. We tolerate faults in the first stage at the second stage by using the default winner strategy. The second stage consists of twenty 20:1 arbiters, each of which associates with one downstream VC. The allocation requests to each 20:1 arbiter come from a set of arbiters in the first stage associated with each input port. Four requests come from an input port to a 20:1 arbiter. If faults occur in all corresponding four arbiters, we propose to provide one default winner request. In this way, VA proceeds even if all the arbiters in the first stage become faulty.
To facilitate fault tolerance at VA, we propose to add two registers per 20:1 arbiter as shown in
Figure 5. The first register known as IDP (Identification of the Port) is 5-bit wide. Each bit of which represents the fault status of the first stage arbiter set of an input port. The corresponding bit is asserted if all four first stage arbiters of an input port become faulty. The second register, known as IDVC (Identification of the Virtual Channel), holds the identification of the default winner VC and is 10-bit wide. A pair of bits represents the identification of the default winner VC for the corresponding first stage arbiter set of an input port. On the occurrence of faults, the corresponding fault status bit asserts. Now the 20:1 arbiter takes input from the corresponding default winner register instead of first stage arbiters. In this way, even if all first stage arbiters are faulty, VA proceeds by using the default winner strategy. This strategy consumes no extra cycle, as it does not depend on time upon any other computation.
The success of the default winner scheme is highly dependent on the selection methodology for the default winner. Poor selection methodology results in starvation. Arbiters in the generic router use round-robin selection methodology. The selection fairness should be close to the fairness of the round-robin arbiter. We propose the selection based on the R and C field of the VC status register. In addition, default winner rotates in a cyclic manner to provide fairness. It avoids static allocation, which can cause starvation. In this way, packet in every VC gets a chance to proceed.
Both VA and SA stages consist of arbiters.
Figure 6 shows the NoCAlert [
34] invariance checkers for arbiters. Error1 signal asserts if arbiter grants without getting any request. Error2 signal asserts if arbiter does not grant on requests. Error3 signal asserts, if more than one bit of a grant vector is logically high, i.e., an arbiter gives multiple grants at a time.
5.3. SA Stage Fault-Tolerant Design
The generic router consists of two identical sets of SA units. One unit handles the speculative requests, and the second unit handles the non-speculative requests. On the arrival of the flit, it directly requests for speculative SA. Upon allocation, if the flit has already won the VA stage, it goes for the XB stage. Otherwise, in the next cycle, it requests for non-speculative SA. We exploit this redundancy to achieve fault tolerance. Non-speculative SA unit is more critical than the speculative one because if a fault occurs in it, an input port will not win a non-speculative grant and SA will not proceed.
Figure 7 shows the proposed fault-tolerant SA design. It is the modification of the runtime arbiter selection strategy for fault tolerance proposed in [
29]. By exploiting the inherent redundancy provided by speculation, [
29] proposes to select the arbiters of both allocators at runtime. It ensures that the non-speculative requests will never be blocked, and SA always proceeds. To do that, it shifts the non-speculative requests to the speculative arbiter at runtime, if the corresponding non-speculative arbiter becomes faulty. It achieves that by adding a few 2:1 multiplexers to select between appropriate arbiters. Subsequent speculative requests at that time shift to the non-speculative arbiter. As non-speculative arbiter is already faulty, the flit uses the speculative arbiter in the next cycle. This strategy works both for the first and second stages of SA. Arbiter fault detection mechanism generates necessary control signals for runtime arbiter selection multiplexers. In case of a fault, this strategy consumes an extra cycle.
Now the speculative SA of the first stage becomes critical, handling both types of requests. To protect it, we propose a default winner strategy. When the arbiter of the speculative SA of the first stage becomes faulty, the input VC identification stored in the register is considered to be the winner without arbitration. The choice of the default winner is very critical. If a specific input VC is always a default winner, it causes static allocation and results in starvation. To avoid this, we purpose dynamic allocation in which VC identification in the register rotates. The identification of the first non-empty VC found in an input port is stored in the register. Once all the flits in that VC have successfully traversed the XB, the register is updated with the next non-empty VC. This is how starvation is avoided by rotating VCs as a default winner. To further enhance reliability, this strategy is also used for the first stage of non-speculative SA. Default winner strategy does not consume an extra cycle because it does not depend on time upon any other component. The default winner strategy kicks in when corresponding arbiters in speculative and non-speculative SA become faulty. It not only saves an extra cycle but also proceeds SA. Thus, the use of both strategies gives high reliability as compared to the HPR router [
29].
5.4. XB Stage Fault-Tolerant Design
In the multiplexer-based XB, each output port associated with a multiplexer. Exactly one primary path exists to access an output port. The flit traverses the multiplexer to reach its output port. A fault in the multiplexer blocks all the flits trying to access that output port. We modify the XB to add multiple secondary paths to bypass the faulty multiplexer. The flits use the secondary bypass paths to access an output port. Secondary bypass paths consist of small size multiplexers.
Figure 8 shows the XB fault-tolerant architecture.
For example, from
Figure 8 consider an output port (out1). Multiplexers (M1, N1, O1) form the primary path to access out1. When a fault occurs in the primary multiplexer M1, out1 becomes inaccessible. In this case, the flit uses a secondary bypass path to reach out1. The following sets of multiplexer combinations form secondary bypass paths to reach out1: (M2, N1, O1), (M3, N3, 01), (M4, N3, 01). The SA provides necessary control signals for the additional multiplexers. The secondary bypass paths are only used in case of fault. In the absence of fault, secondary bypass paths are inactive, and the architecture behaves like generic 5 × 5 XB.
To use the secondary bypass path, flit uses the primary multiplexer of another output port. The input port VC containing that flit must arbitrate for this output port in its SA stage. For the above example, when M1 is faulty, the flit must go through M2, M3, or M4 to reach out1. To facilitate this, we add two new state fields to the status register of each input VC (shown in
Figure 3). First is the secondary route field ‘SR’, it holds the output port associated with the primary multiplexer of the secondary bypass path. Second is the secondary route flag field ‘SRF’, it indicates the fault status of the primary multiplexer.
In case of a fault, after route computation the SR field is updated with the appropriate output port. For the above example, the ‘SR’ field is updated with one of the output port associated with multiplexers (M2, M3 & M4). Then the secondary route flag ‘SRF’ is raised to indicate that there is a fault in the primary multiplexer, so the secondary bypass path is to be used. This strategy also helps in the fault tolerance for the second stage of SA. By arbitrating for another output port and using a secondary bypass path, faults in the arbiter of the second stage of the SA are tolerated.
8. Latency Analysis
In this section, we discuss the impact of fault-tolerant enhancements on the latency of a generic router. Gem5 [
43] simulator is used for simulation. The generic router is simulated in Garnet [
44]. It is then modified to implement NoCGuard fault-tolerant design.
For software-based fault simulations, the ideal way is to inject faults based on the failure rates of pipeline components. Failure rates estimated in the previous section are minimal. For these failure rates, the simulation runs for a very long time. To reduce simulation time, a fault is injected after 1 million cycles. All simulations are done on an 8 × 8 mesh network and executed for 10 million cycles. In 8 × 8 mesh, there are 64 nodes, i.e., routers. We inject faults at 20 randomly chosen routers. Each router is injected with a fault in two of its pipeline stages. Different pipeline stages of a router are chosen randomly for fault injection. Two experiments are conducted. First using synthetic traffic and second using PARSEC [
45] and SPLASH-2 [
46] benchmark applications.
The first experiment with synthetic traffic employs uniform random, tornado, shuffle and transpose traffic patterns. Injection rate varies from 0.01 to 0.1 packets/node/cycle in 5 steps.
Figure 11 shows the latency comparison of NoCGuard and HPR [
29] router with the generic router under synthetic traffic patterns. In the absence of faults, NoCGuard incurs no extra latency. For uniform random, tornado, shuffle and transpose traffic patterns, the average increase in latency is 3.45%, 3.41%, 3.46% and 3.32% respectively. This increase in average latency is less for all traffic patterns than the state-of-the-art HPR reliable router architecture.
The second experiment employs PARSEC and SPLASH-2 benchmark applications. In 8x8 mesh each core associates with a cache and directory. The NoC uses MOESI_CMP_directory cache coherence protocol in this experiment.
Figure 12 and
Figure 13 shows the latency comparison of NoCGuard and HPR [
29] router with the generic router under PARSEC and SPLASH-2 benchmark applications. NoCGuard incurs no extra latency in the absence of faults. For PARSEC and SPLASH-2 benchmark applications the average increase in latency is 12% and 15% respectively. For benchmark applications too, the increase in average latency is less for all application traces than the state-of-the-art HPR [
29] reliable router architecture.
9. Conclusions
We propose NoCGuard, a gracefully degraded in performance and reliable router architecture based on a generic router. It uses different architectural modifications to tolerate faults on routers RC, VA, SA and XB pipeline stages. Simulation results show graceful degradation in latency even under heavy network traffic. Lifetime reliability evaluation using MTTF reveals 5.53 times improvement as compared to a state-of-the-art reliable router architecture. Most importantly, the mean no. of faults to cause failure, area, and SPF estimation show that NoCGuard is more reliable than existing state-of-the-art reliable router architectures. Overall, NoCGuard achieves the right balance between performance degradation, reliability, and area overhead incurred.