Maximizing the Inner Resilience of a Network-on-Chip through Router Controllers Design †

Reducing component size and increasing the operating frequency of integrated circuits makes the Systems-on-Chip (SoCs) more susceptible to faults. Faults can cause errors, and errors can be propagated and lead to a system failure. SoCs employing many cores rely on a Network-on-Chip (NoC) as the interconnect architecture. In this context, this study explores alternatives to implement the flow regulation, routing, and arbitration controllers of an NoC router aiming at minimizing error propagation. For this purpose, a router with Finite-State Machine (FSM)-based controllers was developed targeting low use of logical resources and design flexibility for implementation in FPGA devices. We elaborated and compared the synthesis and simulation results of architectures that vary their controllers on Moore and Mealy FSMs, as well as the Triple Modular Redundancy (TMR) hardening application. Experimental results showed that the routing controller was the most critical one and that migrating a Moore to a Mealy controller offered a lower error propagation rate and higher performance than the application of TMR. We intended to use the proposed router architecture to integrate cores in a fault-tolerant NoC-based system for data processing in harsh environments, such as in space applications.


Introduction
Due to technological development and increasing integration, communication architectures that are used in computers for aerospace applications are composed of a growing number of processing cores. The increase in processing power is a demand due to the increasing amount of high resolution sensors and the bandwidth requirements of satellite-ground links. Some approaches proposed in the literature do not fulfill the communication requirements of future on-board computers [1]. Networks-on-Chip (NoCs) represent an alternative to the interconnect bus for multi-core systems. They can be used in aerospace applications as the communication backbone for interconnecting processors, memories, and the controllers of actuators and smart sensors when these components are integrated on a single chip to reduce the dimensions of the primary computer systems. However, the costs of an NoC are not negligible, especially when fault tolerance is required.
According to [2], different types of Systems-on-Chip (SoCs) require fault-tolerant components depending on the target environment. A fault-tolerant NoC must be able to detect the occurrence

Background
The interconnection in an SoC with few cores is generally performed through shared buses because this architecture is reusable and reduces design costs and time. However, SoCs with dozens of cores require an interconnection structure with performance scaling adjusted to the size of the system. For this reason, in the early 2000s, several studies argued that NoCs would be the best means of solving this problem [4][5][6][7][8]. NoCs are derived from the interconnection networks used in parallel computers [9,10]; they are reusable, as the shared bus, and offer parallelism in communication and scalable performance. An NoC consists of routers, links, and adapters (or Network Interfaces) [8], as shown in Figure 1.
The router is the main component of an NoC system. It comprises registers, multiplexers, arbiters, routing, and flow control circuits, in addition to buffers for the temporary storage of packets [10]. It also has input and output links for communication with the local core and other routers in its neighborhood. The links are usually structured into two unidirectional point-to-point channels, which may be synchronous or asynchronous. Each channel is made up of data and flow control signals [8].
The Network Interface (NI) is the unit that transparently connects a processing core to a router. It is responsible for adapting the communication protocols used by the core and the NoC. NIs are classified following the nature of the client, which might be a processor, a shared memory, or an external channel [10].

Network-on-Chip Communication
The usual means of communication between cores in an SoC is through an exchange of messages broken into packets. Each packet consists of a header (relative to the start of the packet), a payload (content), and a trailer (signaling the end of the packet).
The data flow in the network is outlined in [12]. Table 1 identifies the components of an SoC and relates to the layers of the Open System Interconnection (OSI) model. Each layer can provide services, including fault tolerance. The System layer corresponds to the processing cores and the application. The Interface layer decouples the cores from the network and controls the process of sending and receiving packets. The Network layer is responsible for packet routing, while the Link layer addresses questions related to coding, synchronization, and reliability.

Network-on-Chip Features
An NoC can be characterized in terms of the following attributes [8]: topology, flow regulation, memorization, routing, arbitration, and switching.
The topology defines the routers and links and arranges them in the form of a graph. The most common topologies in SoCs are those of the planar type, like the 2D mesh. With the advent of 3D integration processes, NoCs for these systems usually have inter-layer communication through vertical links. These links are made available in reduced numbers and are known as Through-Silicon Via (TSV) [13]. The topology of an NoC defines the physical layout and the connections between the nodes and the channels. It is characterized by the number of ports in each router, the number of hops from the source node to the destination node, the channel bandwidth, and path diversity [14]. The architecture of an NoC is usually defined at design time or according to the requirements of the target application [15].
The flow control is responsible for the allocation of the resources necessary for a packet to travel through the network, by regulating traffic in the channels. This regulation is required to prevent any undesired loss of data from a sender to a receiver. In general, NoCs are networks that do not discard packets (i.e., they are lossless). Packets are generally divided into flow control units (flits), on which the flow control takes place. Different flow control techniques are employed in NoCs, such as handshake, stop-and-go, credit based, and virtual channels. Typically, a flit corresponds to a word of the physical channel, called physical unit (phit).
Packets destined for channels that are already allocated must wait before being forwarded. This approach requires the implementation of a scheme that enables the blocked packets to be stored in queues within the router. Memorization (or buffering) can either be implemented at the input channels or the output channels. A shared centralized memory based approach can also be adopted.
The routing sets out the path that must be followed to forward a packet to the destination. There are many different routing strategies, which are usually classified in accordance with the following criteria: the number of destinations (unicast or multicast), the location where the routing is performed (centralized, source, or distributed), the type of physical implementation (table based or an  algorithm), and its adaptiveness (deterministic, adaptive, or oblivious). An example of a widely used technique in NoCs is dimension order routing, which is characterized as unicast, deterministic, and usually implemented as a hardware supported algorithm executed in the routers. Routing algorithms in NoCs should prevent packets from being blocked in the network (causing a deadlock) and the problem of packets moving through the network without reaching their destination (livelock) [9].
A conflict that arises when two or more packets compete for the same channel can be settled by arbitration. Round-robin is the arbitration scheme that is most widely employed in NoCs because it provides a fair distribution of channel usage. In the case of NoCs with Quality of Service (QoS) provision, requiring flow differentiation, alternative schemes can be used to meet the temporal requirements.
Switching determines how a message is transferred from the input of a router to one of its output channels. The main types are circuit and packet switching. There are different types of packet switching techniques, such as Store-and-Forward (SAF), Virtual Cut-Through (VCT), and wormhole. The latter performs the switching in the flit-level, and it is the most commonly used in NoCs because it offers low latency at less cost.

Fault Tolerance and Networks-on-Chip
The relationship between fault, error, and failure was presented in [16]. A fault may manifest an error, and an error may result in incorrect behavior, called a failure. Faults and errors can be masked and not lead to an error or failure. Masking occurs on logical, architecture, or application levels and for a variety of reasons. Faults and errors can be classified according to their duration and can be transient, intermittent, or permanent [17].
Studies about fault tolerance in NoCs mainly address both transient and permanent faults. For instance, the works in [18][19][20][21][22][23] examined Single Event Upset (SEU) in NoC components. The studies in [18][19][20] dealt with transient faults that were due to crosstalk. In [24][25][26], the authors investigated the problem of short and open circuit faults in the links of an NoC. In [27][28][29], the authors discussed the yield of vertical links in 3D NoCs. The studies that examined intermittent faults [24,30,31] treated them as permanent faults.
Following, we discuss the way fault tolerance techniques are deployed in NoCs, the methods for the detection, correction, and recovery of errors, diagnosis, and repair, and the metrics commonly employed to evaluate these techniques and methods.

Network-on-Chip Layers
An NoC can be divided into four layers: System, Interface, Network, and Link ( Table 1). The studies examined in the literature address the provision of fault tolerance in the structures of these layers, as follows.
The System layer is related to the processing cores and applications. Some works, like [32][33][34][35], sought to carry out the remapping of a task from a defective processor to a healthy unit. The Interface layer provides the communication services that are operated at the NIs. This layer was the focus of investigation in [36,37]. The solutions of these authors included the retransmission of corrupted packets [21,22] and giving support to communication primitives, such as OpenMP and Message Passing Interface (MPI) [38].
The Network layer essentially consists of the routers. Some works address the methods for the protection [19] and verification [22,39] of its internal components. In [40], the authors focused on the switching logic though most implementations involving routing algorithms that bypass defective routers. Many of them include adaptive routing algorithms in 2D [41][42][43][44][45] or 3D topologies [28,31,46]. One way to ensure packet delivery is by using trusted paths [47] or even redundant subnets [38,46]. Multiple path techniques, like flooding, can also be adopted [48].
The Link layer comprises the data links responsible for connecting each router to its neighbors and the processing core. Some studies have employed techniques related to flow control, such as bus encoding [18], retransmission [20], and bandwidth allocation [49]. Others have adopted additional links, such as the duplication of wires and channels [24,50], spare links [41,51], link serialization [26], or by generating new links [25]. The main focus in 3D NoCs is on the TSVs and the problems related to the manufacturing process [27,29]. There are also studies about fault tolerance techniques for unstructured links, such as optical [52] and wireless [53].

Detection and Correction
The basic mechanism for providing fault tolerance in a system is redundancy [54]. The purpose of this mechanism is to detect and, in some cases, to fix errors in the components. Redundancy techniques can be classified as Spatial, Temporal, and Information and are applied in NoCs as follows.
Spatial redundancy involves the addition of circuits, with the replication of modules whose outputs are compared by a voter. It is often performed through the use of Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR). When spatial redundancy is implemented in NoCs, it often consists of the replication of links [20,27,28,30,34,38,40,47,50]. The replication of routers [34,35], their internal structures [42,51], the use of adaptive routing tables [55], and the inclusion of checkers and testers in hardware [39] are also classified in this category.
Temporal redundancy consists of the re-execution of an operation resulting in comparison and validation. It is usually implemented by running an algorithm n times on the same hardware. In NoCs, some works employ temporal redundancy through the generation of multiple sampling of a message [19,32,48] or through link sections [26].
Information redundancy relies on additional bits for error detection and correction. Error-Detecting Code (EDC) techniques can detect an error incidence, while Error-Correcting Code (ECC) not only detect, but also correct an erroneous data word. Examples include parity [56] and Cyclic Redundancy Check (CRC) [21,36]. Moreover, the replication of the header flit can be adopted to ensure correct routing [23].

Recovery
There are two main types of technique for recovering from an error: Forward Error Recovery (FER) and Backward Error Recovery (BER). FER enables operational continuity in the presence of errors, without having to return to a previous state. Hamming distance, which is a concept widely addressed in the literature, consists of the number of positions in which a current word differs from a previous one. In NoCs, some works use the Hamming code and its variations to correct an error and detect up to two errors in a single data word [19,30,37].
BER techniques ensure the system can return to a previous state when it is considered to be healthy. They use checkpoints or logs and require additional memory elements to preserve these states. In NoCs, BER techniques rely on packet retransmission [20,21,32,49,51]. It is worth noting that FER is more widely used than BER because the latter requires more memory elements, which makes the system more susceptible to SEU faults.

Diagnosis and Repair
Detection and correction are approaches that are sufficient for the operational continuity of a system with transient errors. However, when a system has permanent errors, it is necessary to test and check its components to determine its correctness. Some works offer online test features [21,30,38,39], while others adopt Built-In Self-Test (BIST), a mechanism that enables a system to test itself. In BIST, specific hardware generates stimuli at the inputs of a circuit and compares the output with the correct expected values. Variations of this technique were employed in NoCs [24,26,34,41,45].
After discovering a defective component, it is desirable to disable it and implement a contingency plan, such as reconfiguration, before resuming the execution. In routers, it is possible to employ techniques that can avoid faulty components in the switching logic [21], arbiters [52], and links [26]. However, the focal point of the research is on the reconfiguration of the routing tables, as well as the algorithms needed to bypass faulty routers [28,[40][41][42][43]47,55]. The network interfaces are also used for diagnosis and repair [36].

Evaluation and Metrics
Several studies have adopted specific simulators and virtual platforms to evaluate the proposed techniques [26,33,34,39,44,45,49,51,53]. Some of them used dedicated simulators based on the targeted application [25,28,38]. Most of the works carried out a synthesis in Application-Specific Integrated Circuit (ASIC) to obtain a more accurate cost assessment than that provided by the synthesis of programmable logic devices, such as Field Programmable Gate Array (FPGA).

Router Architecture
In this study, we designed a parameterizable router architecture to evaluate the occupation of resources, propagation of errors, performance, and energy consumption of different combinations of the controllers responsible for data flow regulation, routing, and arbitration. We then implemented these controllers using Moore and Mealy FSMs. This router was partially evaluated in [11], in which we analyzed the different implementations of the routing controllers only. This article explores the design space for implementing the controllers of an NoC router for finding the best trade-off. It also provides a reference for evaluating the possible combinations of controllers when considering a specific environment or the constraints that are more stringent for the design of a given distributed system. As far as we know, no other work in the literature has investigated the internal mitigation of errors concerning the type of implementation chosen for the router controllers.
A Moore machine defines its output signals according to its current state, whereas the Mealy machine also takes into account its input signals and asserts these outputs during the transition from one state to another (or to the same state).
The architecture of the router was designed with a focus on regularity, flexibility, and low area overhead. To fulfill these requirements, we employed the wormhole switching technique and input buffers capable of storing n words. The router had five ports named Local, North, East, South, and West. The Local port was the terminal at which a processing core was attached through a network interface, and the other ports were used to connect the router with its neighbors. Internally, each port was connected to a crossbar that was responsible for the interconnection among the input and output channels that composed the communication ports. Figure 2 shows the architecture of the router. The novelty of the proposed router architecture was its flexibility to combine different types of FSM in the implementation of the internal controllers. The router was intended to be used in 2D mesh topology networks. Figure 3 shows the proposed packet structure overview.  The general packet format covers all the layers shown in Table 1. The sideband bits were related to the Link layer and comprised the framing tags that defined the begin and the end of the packet. The packet was then structured in one or more mandatory Network flits, optional Interface flits, and the payload flits related to the System layer.
To determine the latency of a flit to traverse the router, it is necessary to assess the number of cycles spent in each sequential logic structure, as presented in Equation (1). As the crossbar was implemented using combinational logic only, this component was not considered in the latency analysis.
Latency f lit = Cycles f low + Cycles bu f f er + Cycles routing + Cycles arbitration + Cycles f low (1) In a scenario without router contention, the buffer latency consisted of a single cycle. The internal controllers of the router varied their latency according to its implementation. Each controller required one cycle when using a Moore FSM or none when with a Mealy FSM. Input and output flow controllers must be implemented using the same FSM approach to provide link compatibility. Equation (2) represents the best case latency for a flit to traverse the router.
Using only Moore-based controllers, a flit needs at least five cycles to traverse a router. In a fully Mealy architecture, only the buffering cycle is required. The following subsections present the architecture of each Moore controller and its Mealy equivalent for performing the same function.

Flow Regulation Controller
The flow regulation controller implemented a four-stage handshake protocol for receiving and sending packets through its input and output channels, respectively. The signals used in these controllers comprised:

Routing Controller
The routing controller executed the XY routing algorithm to schedule an output channel. It was composed of a datapath and an FSM. The datapath had comparators that analyzed the destination address enclosed in the packet header. It compared this address with the coordinates of the router to define a set of signals that identify the relationship between these addresses. These signals were named x eq , y eq , x gt , y gt , x lt , and y lt . This XY algorithm started running after the arriving of a tag ( f ra) that assigned the begin of a packet. The scheduling of an output channel then followed the well known criteria of the XY algorithm: any packet must first travel through the X direction, and only when it reaches the same column of the destination node, it can take a path through the Y direction. Figure 8 presents the Moore implementation of this controller.
In the Moore approach, when a packet header was received, the machine took a branch to schedule an output channel, and each branch had two states. In the first state, after the packet header was forwarded, the FSM went to the second state, in which it waited for the packet trailer. When this trailer was forwarded, the FSM went back to the idle state (i.e., S0), and the request was de-asserted. On the other hand, in a Mealy implementation (Figure 9), it was possible to save one state for output for each channel scheduling.

Arbitration Controller
The arbitration controller consisted of a round-robin arbiter responsible for scheduling the use of the output channel by the packets of the requesting input channels. As there was no provision for loop-back communication in our implementation, each arbiter scheduled up to four requests (namely A, B, C, and D), depending on the router address. Similar to the previous controller, it could also be implemented using a Moore or a Mealy FSM. Figure 10 presents a simplified representation of the Moore-based implementation (some transitions are omitted for the sake of clarity).  As in the design of the routing controller, the Mealy implementation of the arbitration controller ( Figure 11) needed only one state for each scheduling branch, and a grant was only given when there existed an active request.

Controller Protection
The controllers described above were also implemented in a hardening version. The TMR technique was chosen to protect the FSM of each controller. It consisted of replicating the component in three units, all of them operating over the same input signals. Afterward, the output of each controller is compared by a single major voter, which elected the most common output value, as illustrated in Figure 12.
The TMR technique was selected because it was widely used in reliable systems and due to its ability to mask an error transparently [2]. This technique implied a high resource overhead if applied to complex structures. Meanwhile, in the case of components with few interface signals and few registers, as the focus of this work, it could represent a good trade-off.

Fault Injection
This section first describes the combinations of the router architecture submitted to verification. Then, it presents the fault injection method and the fault model used for the experiments.

Router Verification
In our study, a workload was generated for the evaluation of the router and the fault injection campaign. This workload was designed to inject packets continuously to a fixed set of non-concurrent paths of input and output channels. Thus, it enabled obtaining metrics for the router operating at its highest possible load. The following channel combinations were assigned to comply with the XY routing algorithm requirements: Local → East; East → West; West → South; South → North; North → Local. Figure 13 illustrates the connections within the crossbar of the router. The packet format used for verification is shown in Figure 14. It consisted of a single bit to perform flow control, a single flit as the header, two payload flits, and a trailer. The header flit was used solely to address the coordinates of the destination router. Both the header and the last payload flit (trailer) used "1" as the frame bit, while the regular payload flits used "0".

Fault Injection Environment
There are several different fault injection strategies proposed in the literature. They can be classified into hardware-based injection, software-based injection, simulation-based injection, emulation-based injection, and hybrid injection [59].
The strategy proposed in [60] was adopted for this experiment. The solution was designed originally to inject SEU faults into the registers of a processor and was customized to operate on the proposed router. The technique consisted of a simulation-based fault injection that relied on the use of built-in commands of the ModelSim ® simulator. Each iteration of the fault injection strategy included the following stages: In each experiment, a single fault was injected by inverting the logical value in the target signal. If the output of any external port differed from the golden run, then it was assumed that the fault resulted in an error. For each router configuration, 1000 simulations running for 100 µs were performed. This approach was applied to obtain a more accurate measurement of the error propagation rate in all scenarios. Algorithm 1 presents a pseudo-code that summarizes the steps of the fault injection campaign. run(i) ← arch.outputs 14: end for 15: end function 16: function ERRORPROPAGATION(run) 17: for i = 0 to TotalRuns -1 do 18: if run(i)! = GOLDENRUN(arch) then 19: error ← error + 1 20: end if 21: end for 22: return error 23: end function An architecture that requires more memory elements (i.e., registers) is more susceptible to SEU faults, due to the increased exposure area. For this reason, the number of propagated errors was normalized for a fairer comparison. Equation (3) shows the normalization adopted for the error rate comparisons.
The number of simulations that propagated an error was divided by the total number of simulations, 1000 in this case. This ratio was then multiplied by the total number of registers in the architecture under simulation by the registers count from a reference architecture, which was given by the most costly one.

Results
We defined a set of configurations to evaluate the different approaches for implementing the controllers, and each controller was implemented using a Moore or Mealy machine, with or without TMR. For reference, the implementations without TMR were named STD (i.e., Standard). We then synthesized each implementation to obtain its silicon costs, power consumption, and performance metrics. Afterward, we applied the fault injection campaign to measure the error rate. It is worth noting that the router was configured to use handshake flow control, XY routing, round-robin arbiter, flits with 32-bit width, and a 128-bit input buffer.
The architectures were described in VHDL and synthesized using the Intel ® Quartus Prime, Version 18.1, targeting the 5CGTFD9E5F35C7 FPGA device of the Cyclone V family. For synthesis, all optimization flags were de-asserted to allow the inference of redundant circuits when applying TMR and also for a more accurate comparison.
The simulations were run using ModelSim-Intel ® FPGA Edition. The Quartus Power Analyzer tool was used for power and energy estimation, and it was configured to use a Value Change Dump (VCD) file from the ModelSim simulation as input stimuli.
The different architectures evaluated were identified as STD (Standard), when no fault tolerance technique was applied, and TMR, when the triple modular redundancy technique was used.
All the experiments were conducted on an IBM PC compatible laptop with an Intel i7-4510U processor and 8 GB of RAM, running the Ubuntu Linux 18.04 operating system. We executed 22 syntheses, one per router architecture, and each synthesis lasted about three minutes. The fault injection experiments comprised 1000 simulations per architecture, consuming about six hours to evaluate all the design space considered.
To evaluate the synthesis and resilience results, we employed the metrics most commonly adopted in the literature [54]. Silicon costs are expressed by the number of Look-Up Tables (LUTs) and Flip-Flops (FFs) occupied. Performance was given by the maximum operating frequency (F max ), execution time, and throughput (i.e., the rate of data delivered without errors). Energy costs are given by the total power dissipation and the energy consumed during the time used to deliver the data injected. Finally, the primary metric was the error propagation rate, from which we inferred the fault coverage. The lower the error rate, the higher the fault coverage. The following subsections present the results obtained. Table 2 presents the synthesis results for the experiments that evaluated the Moore-based implementations. As expected, the architecture that did not use redundancy in any of its controllers required the least combinational and sequential elements. Due to a longer critical path, configurations that used redundancy in controllers showed some degradation at the maximum operating frequency. It was observed that the dissipated power practically did not change, even with the presence of a VCD input file. This behavior occurred because an isolated router consumes less than 1% of the resources available on the target FPGA, even in its most expensive configuration. However, these values were useful in obtaining energy consumption considering the simulation time. The simulation results of the Moore-based configurations are presented in Table 3. Following the fault injection campaign, a minor change in the propagated error rate was observed. This behavior was mainly because the controllers did not account for the largest number of registers. This cost was due to the input buffers.

Moore STD versus Moore TMR
Considering the maximum operating frequency and given the simulation time, we can obtain the total execution time (t exe ). This information, combined with the amount of flits transmitted by each of the communication channels, enabled measuring the total router throughput, considering the valid accepted traffic (with no errors). Similarly, by multiplying the total execution time by the power obtained in synthesis, we had the energy consumed by each combination.
In this comparison, the standard version of the router had the highest throughput and the lowest energy consumption. These results indicated that considering the entire router, the slight increase in reliability with protected controllers came at a high price because the performance and energy efficiency were degraded by more than 20%.

Mealy STD versus Mealy TMR
The results obtained from the evaluation of Moore-based controllers showed that the application of redundancy produced a slight reduction in error propagation, but at the price of degrading performance and energy efficiency. Looking for better results, we developed a Mealy variation of each controller. Table 4 presents the synthesis results for the Mealy-based controllers. As observed in the previous section, the use of TMR led to an increase in the use of combinational and sequential logic and a subsequent reduction in the maximum operating frequency. As can be seen from Table 5, the application of the hardening technique slightly increased the router reliability, showing a 2% error rate in a fully protected configuration. However, the highest throughput and the highest energy efficiency came from the unprotected version. As Mealy-based controllers had lower latency than their Moore equivalent, this ended up reducing the average latency of packet propagation across the router. This effect occurred because reducing the time a packet stayed on the router also reduced its fault exposure.

Moore versus Mealy
Figures 15-17 present the results of the simulation of all possible combinations of FSM (Moore or Mealy) and implementation (STD or TMR). For instance, the configuration STD-STD-STD defined that both the controllers for flow regulation, routing, and arbitration, respectively, did not implement triple modular redundancy. For this configuration, as Figure 15 shows, the Moore-based controller had an error propagation rate of 13.2%, and the Mealy-based controller had an error rate of 2.8%. It is worth noting that all the controllers of each configuration used the same type of state machine implementation. By analyzing the results shown in this figure, we observed that, for the same type of FSM implementation, the error propagation rate decrease was not significant as the controllers were being protected. However, when comparing the FSM mode, the Mealy-based approach proved to be more resilient, propagating on average 10% fewer errors than the Moore-based implementation for the same combination of controllers.  Figure 16 presents the results regarding the throughput. The results showed almost no variation between the Moore and Mealy implementations of each configuration. However, we noticed that increasing the reliability of the routing and arbitration controllers by using redundancy resulted in throughput degradation. On the other hand, adding TMR to the flow regulation controllers did not degrade communication performance. On the contrary, in some cases, it could result in a slight improvement in throughput. Figure 17 compares the energy consumption of the different configurations. As we can notice, the Mealy-based configurations consumed approximately 60% more energy than Moore ones in all comparisons. From these results, we can state that the Moore-based implementations were the most energy efficient.
The results presented show that the use of Mealy machines instead of applying TMR was a good alternative to decrease the error propagation rate and improve throughput. However, it must be noted that the Mealy-based controllers were less energy efficient than the ones based on the Moore machine.

Moore and Mealy Combined
With the low effectiveness observed in the application of TMR, along with the high energy consumption inferred from the use of Mealy-based controllers, we looked for solutions that presented a better trade-off. Because the router used in this work was developed focusing on flexibility, having a well defined component interface, it was possible to use mixed Moore and Mealy controllers within the router. Thus, we configured architectures combining the standard (non-protected) versions of Moore and Mealy controllers. Table 6 presents the synthesis results for each one of these combinations. The increase of LUTs for Mealy-based controllers was expected, due to the additional decoding of the output signals in states. As these machines also needed to encode fewer states, they also presented a decrease in the use of memory elements (i.e., Flip-Flops).
Increasing combinational logic implies a longer critical path of the circuit. As the critical path increased, the maximum operating frequency reduced, which made a fully Moore approach have a maximum operating frequency 71% higher than a configuration using only Mealy controllers, for instance. As explained earlier, the power dissipation remained uniform across all combinations.
Regarding the metrics obtained from simulation (Table 7), we observed a large variation on the error propagation rate. This variation was not only due to the lower use of FFs, but mainly to the retention of data in the input buffers. A further investigation noted that error propagation was directly related to data retention in buffers. Analysis of previous architectures showed that the speed-up in data forwarding when using Mealy provided increased reliability. Looking at the simulation results of the combined FSM (Figure 18), we can see that there were other combinations besides the fully Mealy one that took advantage of faster data forwarding. We realized that the throughput was higher when using Mealy machine in the flow regulation controller and Moore machine in the routing controller, and the way arbitration was implemented showed no significant change in performance. The lowest energy consumption was observed when applying Moore on both routing and arbitration controllers.
The application of TMR did not imply a good trade-off in applications with communication or energy restrictions. However, given a router that originally had its controllers implemented through Moore FSMs, the simple migration of the flow regulation controller to Mealy was enough to increase throughput by 37%, with equivalent energy consumption. In reliable systems, the combination that adopted Mealy in flow regulation and arbitration controllers and Moore in routing seemed to be a valid alternative, as its error propagation rate was 4% and energy consumption was only 15% higher in comparison to the baseline version, and in addition, it was the configuration with the highest throughput among all the architectures evaluated.

Conclusions
Future reliable systems, such as those used in space applications, will incorporate multiple processing cores into an SoC. This processing power is a demand for applications that require high performance in space due to the increasing amount of high resolution sensors and the bandwidth requirements of satellite-ground links. NoCs are the successors of the multi-core interconnection based on shared buses. However, their silicon costs are not negligible, and it is still challenging to provide an architecture that meets the reliability requirements at a low area overhead.
In this context, the main goal of this work was to provide a simplified and parameterizable router architecture and analyze the behavior of its custom architectures under the SEU fault injection. The obtained results showed that the use of Mealy FSMs for the control structures resulted in a significant decrease in the number of propagated errors, at a price of performance degradation and lower energy efficiency. The lower error propagation of the Mealy-based approaches was given mainly due to the occupation rate of the buffers. Each controller that used a Mealy implementation saved a cycle, which sped up data forwarding and caused the packets to be stored for a shorter time.
The article showed which were the best trade-offs that could be achieved in the design of an NoC router, combining the different implementations of its controllers. It also provided a reference for evaluating the possible combinations of controllers when considering a specific environment or the constraints that were more stringent for the design of a given distributed system. As far as we know, no other work in the literature has investigated the inner mitigation of errors concerning the type of implementation chosen for the router controllers.
As future work, we intend to evaluate the proposed router through fault injection in a particle accelerator and then use the router to integrate reliable multi-core systems such as those needed in satellites and other space applications. The source code of the router presented here is available in [61].