Quantitative Analysis of Multistage Switching Networks for Embedded Programmable Devices

This paper analyzes the properties of a class of congestion-free multistage switching networks (MSSNs) are butterfly-based and suitable for embedded programmable devices, which require sustaining static multicast connectivity. These MSSNs are fully synthesizable and enable the design of programmable IPs with typical size in the order of 1 KLUT, coupling flexibility with fast turn-around time. The non-blocking property for static connection of this class of MSSN is discussed. Our analysis shows pros and cons of adopting radix-2 or radix-4 MSSN structures, as well as the impact of bypass-paths to make the network fully hierarchical and locality-aware thanks also to a dedicated programming strategy. Implementation experiments carried out on STM CMOS 65 nm technology show the availability of various area-speed trade-offs, resulting in a range of '2× in frequency and a range of '4× in area. Depending on the specific application-field, an optimal interconnect definition is thus achieved without compromising the routability properties. In this respect, the paper proposes a simplified application-driven model for evaluation of the best MSSN, including bypass-adoption and radix selection.


Introduction
Nowadays, embedded programmable logic devices (ePLDs) can be applied in system-on-a-chip (SoC) designs to increase flexibility, widen their market, and curb soaring nonrecurring engineering (NRE) costs [1][2][3][4][5][6].Embedded Field Programmable Gate Arrays (eFPGAs) are a class of ePLD that can provide computational advantages in terms of Giga Operations Per Second/mm 2 (GOPS/mm 2 ) over processors and Digital Signal Processors (DSPs) [7], although several well-known drawbacks, mainly in terms of area, have historically restricted their use on the market of SoCs.In fact, the tight relationship between costs and silicon-area has historically driven the SoC market, especially for mid-to high-volume applications.For that reason, eFPGA cores must provide added value to the system but only take up a small percentage of the overall device area.To be effective we can estimate the target computational complexity around 1-10 KGates.For that, the typical size of such device can be evaluated in the order of 1 KLUT, eventually enhanced by hardwired DSP accelerators, hence the target application is not high-density computation but simple Finite State Machines (FSMs).An application domain of such a class of small-size eFPGA is the Smart-Power world that is demanding "intelligence" to improve efficiency [8].Power applications, e.g., motion-and lighting-control, are mostly based on a control-driven paradigm which leverages simple controllers-e.g., finite-state

Related Work
Embedded FPGAs, like the larger stand-alone FPGAs, are arrays of computational logic blocks (LB) connected using programmable routing [11].Overall efficiency of an eFPGA is deeply affected by routing, since it greatly impacts area and performances and is responsible for the efficiency of the connections [10,12].With the growth in complexity of the target applications, research work has been spent in overcoming the penalties related to bit-level programmability and in alleviating routing congestion issues.Leveraging on Rent's rules, hierarchical interconnects were thus proposed to diminish complexity issues, enabling reduction of the area overhead of programmable routing through local crossbars capable to connect clusters of LB [13][14][15].
Multistage switching networks (MSSN) are a kind of hierarchical interconnection network built using many small crossbars, typically called switch elements (SE) (Figure 1).Depending on its topology and/or its ability to perform connection between input and output ports, a MSSN can be: • Non-blocking, if it can connect each I/O pair regardless of the existing connections on the network.

•
Blocking, if it cannot connect all the requests.
Among the non-blocking networks, a network is strictly non-blocking (SNB) if any connection can be set up incrementally without the need to rearrange (i.e., reroute) any of the connections already in place, while it is rearrangeably non-blocking (RNB) if one or more existing connections may have to be rearranged to permit a new connection.All the previous definitions can be used for unicast (one-to-one), multicast (one-to-many), or broadcast (one-to-all) traffic.
MSSNs for eFPGAs must support multicast connectivity but, since circuits have static connectivity, the utilization of RNB networks is equivalent to SNB, allowing congestion-avoidance with fewer resources.In the eFPGA scenario, the application of MSSNs can provide some straightforward advantages: 1.
The modularity of the network potentially adapts to a soft-core approach.Each SE can be either implemented by standard-cells or optimized at circuit level as a single coarse-grained cell, without affecting synthesizability.

2.
MSSN routability properties improve the flexibility of eFPGA design, allowing architectural variations (e.g., size and I/Os number) without the need of specific skills in congestion analysis and prevention.

3.
The routability analysis is simplified, since the blocking properties of an MSSN are well-defined and predictable in terms of topology, as discussed in Section 4.
Some examples of application of hierarchical interconnects in eFPGA are found in the literature.M2000 (Abound Logic) proposed an MSSN with local crossbars based on a Clos network [16], while Leopard Logic proposed a butterfly-based hierarchical network [17].In [10,18] an MSSN based on butterfly topology is discussed with depopulation of the upper stages of the network and an isomorphic transformation to solve the radix-boundary problem [19], a limiting factor of MSSN in the field of FPGAs.Nevertheless, area saving is balanced by the fact that this network is no more proven to be RNB, although authors indicates the availability of enough bandwidth based on Rent's rule.Another example of multi-level interconnect dimentioning based on Rent's rule applied to logarithmic paths is provided by [20], which overcomes boundary-radix issue by adding shortcuts and staggering.
In [4] we proposed an eFPGA template core with MSSN-based interconnect: all the performance analysis and results refer to the overall eFPGA core, composed of both logic blocks and interconnect.For the proposed eFPGA the utilization of a MSSN to realize the interconnect structure represented a key enabler, but not the unique aspect; a detailed analysis and exploration of a class of MSSN suitable to realize the interconnect structure of synthesizable small size programmable IPs was missing in [4] and is the goal of this paper, which extends and formalizes preliminary results discussed in the M. Cuppini PhD Thesis [5].The RNB feature of the class of MSSN we propose is discussed in Section 4, whereas a dedicated configuration strategy allows to achieve a straight-as-possible policy in order to minimize the boundary-radix problem, as discussed in Section 5. Different architectural options, including values of radix and level of bypass exploitation are analyzed and compared showing area-frequency trade-offs in Sections 6 and 7 [14,15].

Multicast Multistage Switching Networks
The field of RNB multicast networks is quite wide [21], but for eFPGA application we consider architectures capable to guarantee a small area overhead for a number N × N of I/Os that can reach some thousands of points with reduced timing penalties.For that, among all the MSSNs applied in telecommunication fields [21][22][23][24], the starting point for PLD domain are the Banyan networks [25], and their topological equivalent versions [25,26] (e.g., Omega, Shuffle, Butterfly), since they are a class of low-latency multistage networks.
N × N Banyan networks, as shown in Figure 1 for N = 16, are built using 2 × 2 switch elements organized in O (log 2 N) stages [27], and for this reason they are also called log 2 N-networks [28].Thanks to low-latency and recursive architecture, since an N-input network can be recursively decomposed into two N/2 sub-networks, and so on (as shown in Figure 1), Banyan networks are very attractive MSSNs [25].The main drawback is that Banyan networks are blocking, also for unicast traffic.To obtain RNB features, Banyan networks need additional resources to the baseline structure: the resulting networks are called Multi-log 2 N networks [28].One such approach typically used in the literature to get an RNB network is the Horizontal Cascading (HC) of two baseline log 2 N-networks (Figure 2a): Benes networks [29,30] are an example of this technique, since they are built by cascading two baseline networks back-to-back.However, the HC approach implies complex routing algorithms (due to a large amount of reconfiguration activity) [28].As an alternative, the vertical stacking (VS) technique was introduced [27].According to this scheme (Figure 2b), in a Vertically-Stacked MSSN (VS-MSSN) various different copies (called planes) of a baseline log 2 N network are created and connected to I/Os through an input (output) stage of demultiplexers (multiplexers).As regards multicast, in [31] the condition on the number of planes p necessary to obtain RNB features in VS-MSSNs is expressed as: where n is the number of stages of the N × N baseline log 2 N network, such that: In [32] a class of non-blocking networks built with a combination of HC and VS was proposed.Thanks to this approach, obtained by adding x extra stages to each baseline plane, the condition for RNB unicast is proven: This solution can lead to a reduction of the total hardware complexity, since adding x (lower than n) horizontal stages allows one to save one (or more) whole planes.Reference [32] proves Equation (3) for one-to-one connection, while [33] generalized the properties defined by [28] for vertical stacking also for multicast connection.We have intensely tested the proposed MSSN in multicast applications, without finding any unresolved conditions.From a qualitative point of view, the equivalence between horizontal and vertical decomposition finds it foundation in the fact that they are different strategies to provide the same paths multiplicity in connecting a generic input-output pair under the condition defined by (3).Simultaneously, the combination of HC and VS open the way for exploring various trade-off between hardware complexity and latency.

MSSN Based on 2 × 2 Switches
The MSSN architecture we propose in [4] features a Benes-like topology with butterfly connectivity, and is shown in Figure 3a.Based on 2 × 2 switches, an N × N I/Os MSSN features n s stages, with N s = N switch blocks per stage, so that: Among these, one input stage (IN stage ) composed of N s (1 × 2) switches and one output stage (OUT stage ) formed of N s (2 × 1) switches are used as demultiplexing/multiplexing stages for the vertical stacking technique (see Section 3), while the remaining (2 log 2 N − 1) middle stages (MID stages ), each composed of N s (2 × 2) switch elements, are used to form the MSSN planes.Equation ( 4) becomes: This MSSN architecture (shown in Figure 3a) is topologically equivalent [34], i.e., with the same graph, to that in Figure 3b.Such a network has p = 2 planes p 1 and p 2 ), each made of an N × N Banyan network, and x = log 2 N − 1 extra stages.Since, according to Equation (2), the stages number of a N × N baseline log 2 N network is n = log 2 N, condition (3) becomes: and is satisfied for any value of N with p = 2.It must be specified that a lower number of extra stages x should suffice to satisfy condition (3), since with p = 2 and n = log 2 N, only x = log 2 N − 2 extra stages are strictly required; such an area-penalizing choice, a complexity increase proportional to N, was made with a view to obtaining a symmetry feature that opens the way to architectural enhancement of the MSSN described in the following paragraph.

MSSN with Bypass Enhancement
Adding extra stages to obtain RNB features makes latency a critical point of the network.For that reason, and to overcome the fact that butterfly topologies cannot exploit traffic locality, an alternative hierarchy-aware folded version of the MSSN was devised.Leveraging on its symmetry properties, the flat (Figure 3a) MSSN can hence be folded at the central stage, creating a bypassed architecture [4] in which, thanks to the addition of a set of dedicated U-turn connections to the basic switches, an intrinsic hierarchical structure can be exploited.As shown in Figure 4, the application of U-turn connections to each stage of a folded MSSN allows groups (H S ) of U-switches at stage S to be created, which define butterfly-based sub-networks: this potentially allows the upper levels of the hierarchy to be bypassed, thus exploiting local connectivity without any impact on congestion.As an example, in Figure 4 the sub-network obtained grouping the H 2 upper switches allows any connections between I/O from 0 to 3 to be performed bypassing the upper level H 3 of switches.The bypassed version of the MSSN inherits the RNB properties from the flat topology.In fact, as shown in the unfolded view of Figure 5, U-turn bypasses allows non-adjacent stages to be bridged thanks to the addition of dedicated logic, extra pins, and independent configuration bits to each basic SE.We obtain more routing paths and available connections, allowing each bypassed sub-network to keep the same RNB characteristics as flat architecture.With reference to Figure 4, each switch at stage S, for S from 1 to log 2 N − 1, is composed by source and target U-switches, while the switch in the central stage (S = log 2 N) is a simple basic switch.Regardingto the area penalties, these are balanced by performance improvements.Benefits will accrue when the area increase becomes an investment by computational density growth.This aspect will be analyzed in Section 7.

Radix-k MSSN
The previous analyses is extended to networks with k × k switch elements in the MID stages.Parameter k is the radix of the network.MSSN described in Figure 3 is hence a radix-2 flat MSSN (with N = 8), while Figure 6 shows a radix-4 flat network, with N = 16.With reference to Figure 2b, this network is equivalent to a MSSN with two planes, each made of a N × N Banyan network with x = log k N − 1 extra stages, in addition to input/output switches.
Focusing on butterfly-based N × N MSSNs, and generalizing condition (2) for a generic radix-k, the number of stages becomes: Adding x extra stages as before, namely x = n − 1 = log k N − 1, the total number of network stages n s for horizontal cascading strategy is: which takes the form of ( 5) when k = 2. Hwang and Lin [35] generalized the condition (3) for a generic radix-k for unicast connectivity, demonstrating that the number of vertical planes p necessary in order that the networks have the RNB feature is: From the previous definitions of number of stages n and extra stages x, condition (9) evolves into: Therefore, taking p = 2, (10) becomes always satisfied for any value of both number N of I/Os and radix-k.IN, OUT, and MID stages are all composed of N s SEs, with N S = N k • p.Since p = 2, N S = N in a radix-2 MSSN, while N S = N/2 in a radix-4 MSSN.Regarding the multicast connectivity, we have carried out numerous routing tests without finding any critical situation for the routing scheme.

MSSN Programming Strategy
In order to speed up architectural exploration and validation, we implemented a software tool capable to provide both the register-transfer-level (RTL) MSSN description and the configuration bitstream according to the user application.The target architecture is described specifying MSSN parameters, such as the number N of I/Os and number n s of stages.In addition, it allows one to specify the switch elements (SE) used (i.e., radix-k) at each stage and the intra-stage connectivity patterns (e.g., the two butterfly topologies in Figure 3a,b and stages with bypass enhancements as in Figure 5).
Following [4], the MSSN routing engine is an iteration-based path-finding algorithm.The nets are sorted and routed sequentially taking into account the available resources (i.e., the SE).Net sorting is done based on criticality, which tracks the historical difficulty to route each net and is updated after each routing iteration according to the success/fail current attempt.Iterations stop when all the nets are routed successfully.The algorithm splits multi-fanout nets into independently routed single-fanout nets that can share routing resources.
Path-finding strategy is based on Dijkstra's algorithm for weighted graphs [36].The SEs represent the graph nodes connected by weighted edges, which correspond to the wires used to connect MSSN stages.The algorithm tries to find the cheapest path among all available ones, whereas the cost of a path is the sum of the traversed wire's weights.Weights are a fundamental part of the architecture description: they drive routing policy toward preferred solutions and impact convergence speed.We found that assigning lower weights to bypasses and straight paths (zeroing weight for sharable resources) allows to achieve an as-straight-as-possible policy which resulted the best solution for both convergence and bypass utilization (Figure 7).Thanks to this approach, we found that, on networks with dimensions up to 4K I/Os, the number of iterations needed to solve the routing lies in average between 5 and 10.The utilization of a butterfly-based MSSN as interconnection infrastructure also changes the distance metric used on the placement algorithms [4].In fact, trying to minimize the net delays, traditional placement algorithms for FPGA or ASIC (Application-Specific Integrated Circuit) have the goal to minimize the distances among cells, whereas the distance is measured in a 2D space (be it the L1 Manhattan distance or L2 distance), tightly coupled with the Elmore delay associated to the connection.On a hierarchical butterfly MSSN, i.e., with bypass enhancement, the delay is associated to the number of stages to be crossed to connect two points.As visible in Figure 4, the distance among primary I/Os is not linear, but follow some steps which also depend on the radix of the network.With reference to Figure 4, only switches up to H 1 must be crossed to connect IN 2 to OUT 3 , while all the switches up to H 3 are crossed when the required connection is between IN 3 and OUT 4 .This aspect is known as the boundary-radix problem [10] and is a limiting factor for the use of such networks in large FPGA.In small size eFPGAs, as the ones we are tackling, this problem is mitigated by replacing 2D distance metric with a wire length defined as, for generic radix-k: where S min (between 1 and log k N) is the minimum stage S subtending all the points to be connected, as discussed in [4], and +1 is due to the bypass connection.This way, the placement tool tries to limit to S min the number of switches crossed to achieve a full exploitation of the hierarchical topology.
A different approach has been followed by [10] where the boundary-radix problem has been addressed in hardware by isomorphic transformations, which makes the MSSN less regular.

Flat MSSN Performance versus Radix-k
In this section, we show the implementation results of two flat MSSNs featuring radix-2 and radix-4, respectively.The purpose of this analysis is to demonstrate that the synthesizability allows one to exploit a significant design space to find the best radix for a given class of applications.The test-case we consider is a 1024 × 1024 I/Os flat MSSN.According to Equation ( 5), the radix-2 network features n s = 21 stages, including: For radix-4 MSSN, the number n s of stages, defined by ( 8) with k = 4, is n s = 11, and the RNB property is obtained with half the switches per stage (N s = N/2 = 512) compared to the radix-2, but with a more complex structure (Figure 6).The MSSN hence features: For both architectures a set of physical synthesis have been performed using Synopsys Design Compiler Graphical [37] targeting STMicroelectronics CMOS 65 nm LP 1.2 V standard cell libraries.Adopted physical synthesis flow was industrially qualified by the foundry to be well correlated with implementation and silicon.This notwithstanding, some place-and-routes were carried out to further verify feasibility of the approach (especially against routing congestion) on few relevant cases, resulting in a good correlation in line with Electronic Design Automation (EDA) vendor figures.Since target technology makes available standard cells with 3 different voltage thresholds (Low, Standard, and High Voltage Threshold-LVT, SVT, HVT), a couple of different standard cells mixes were analysed: a SVT-only for area-optimization and a mix of HVT+SVT+LVT to target high-speed optimization (HVT are included for leakage optimization on configuration structures).Time-to-fly between one input and one output is the critical path of the network and thus is constrained by the target implementation period, whereas all the configuration pins of the MSSN switches are quasi-static paths and were considered false-paths.
Post-synthesis results are provided in Table 1 for both MSSNs, targeting either area or speed optimization.The resulting design space is significant, with a range of 2× for frequency and a range of 4× in terms of area for both radix-2 and radix-4 MSSNs.It shows that the radix-4 solution radically increases area in min-area implementations (+35%), while in the max-speed case the area penalty is lower (+11%).This is due to both the synthesis optimizations criteria adopted and the fact that in the min-area case we considered two iso-frequency (200 MHz) implementations, while in the max-speed case we reached different frequencies (400 vs. 480 MHz).Each MSSN architecture was also implemented changing the target frequency between the min-area and the max-speed targets.The obtained results are reported for iso-frequency values in Figure 8a, showing similar trends for both architectures: radix-2 MSSN provides better area performance for any frequency value.With the purpose of comparing the two architectures in terms of frequency performance, Figure 8b reports the implementation results for iso-area values: again in this case, radix-2 MSSN guarantees better performance, with a wider gap for low area values.This area increase can be justified with a "handcrafted" analysis of the logic complexity of both architectures in terms of number of equivalent MUX2:1.Flat radix-2 MSSN requires n s stages as defined in (5); observing Figures 3a and 5, it can be noted that input stage does not require any MUX, output stage is composed by a single MUX2:1 and middle stages by two MUX2:1.By considering N s = N switches per stage, the total number of equivalent MUX2:1 (MUXeq radix-2 ) can therefore be expressed as: For N = 1024, radix-2 MSSN features therefore 40K equivalent MUX2:1.On the other hand, in a radix-4 MSSN the middle 4 × 4 switches (Figure 6) can be seen to be made of 4 MUX4:1, each corresponding to 3 MUX2:1, the 4×2 switches in output stage require 2 MUX4:1 while the input stage still does not require any MUX.From ( 8) and considering N s = N/2 switches per stage, the total number of equivalent MUX2:1 (MUXeq radix−4 ) can be therefore expressed as: For N = 1024 the total number of equivalent MUX2:1 proves to be 58K, a +45% increase with respect to the radix-2 solution.Despite of that, a class of applications where radix-4 structure provides better performance will be observed in Section 7.3.In addition, this overhead can be mitigated leveraging on optimized standard cells libraries capable to reduce the intrinsic MUX4:1/MUX2:1 area-ratio thanks pass-transistor exploitation.

Flat versus Hierarchical MSSN Performance
This section evaluates hierarchical MSSNs featuring U-turn bypasses, focusing on finding the best tradeoff between area overhead, due to the additional logic in the U-switches, and the improvements in frequency performances.This will help compare the radix-2 and radix-4 MSSN more effectively, like the flat MSSN versions.
The test case we consider is again a 1024 × 1024 I/Os MSSN, radix-2 and radix-4, with U-turn bypasses applied to realize either a fully-bypassed network (fully), with bypasses at each MSSN stage, or a half-bypassed one (half ), with bypasses only at odd stages.STM CMOS 65 nm LP is still the target technology, considering two different mixes of standard cells: SVT-only for area optimization and full mix with H+S+LVT for high-speed.
Post synthesis results for all three versions (flat, half, and fully) are shown in Table 2 for radix-2 and Table 3 for radix-4 MSSN, with two different implementations optimized by area or by speed: for both, the area overhead introduced by the bypass enhancement is straightforward.As for the flat version, the half-and the fully-bypassed architectures were implemented varying the target implementation frequency according to the values in Table 2.As previously defined, implementation frequency is the time-to-fly between MSSN input and output; the delay due to crossing switches of all MSSN stages is therefore considered.

MSSN Delay Model and Validation
In [4] we developed a post-synthesis application-aware analysis to evaluate the performances improvement thanks to the introduction of bypassed switching elements.Different benchmarks were mapped (through the custom eFPGA CAD flow) on both the flat architecture and the fully-bypassed radix-2 version of the eFPGA.This showed that introducing a bypassed MSSN allows increasing the working frequency by a factor, frequency gain, from 20% to 60% depending on the target application and therefore on the capability to exploit the bypass feature.Starting from these data, in [4] we then proposed a computational analysis that takes into account eFPGA critical path length (i.e., the longest combinational path composed of both LB and MSSN delays) based on the ability of bypass enhancement to guarantee frequency gains in a range between 20% and 60% with respect to the flat version.In that analysis one of the key assumptions is that the internal MSSN stages, both in the flat and in the bypassed version, feature homogeneous delays, so that the delay associated with a single stage in an MSSN with n s stages is approximated as: where T MSSN is the delay of the whole MSSN (i.e., the implementation period in Tables 2 and 3).As an example, for the radix-2 fully-bypassed implementation, T MSSN is 2.53 ns for max speed and 5.5 ns for min area.Hence, according to (12), in the evaluation of the eFPGA critical path the effective MSSN IN2OUT delays is considered to be proportional to the stage (S in 1 to log 2 N) in which the U-turn connection was exploited (e.g., the H S group of switches in Figure 4), so that: where the 2S factor is due to the fact that the network has to be crossed back and forth (corresponding to crossing the source and target switches in Figure 5) and the +2 take into consideration input and output stage.The case S = log 2 N corresponds to non-exploitation of the bypass: all MSSN stages are crossed and the IN2OUT delay reduces to the implementation period T MSSN .
In order to validate the model based on ( 12) and ( 13), and adopted in [4], in Figure 9 we show the difference between piecewise linear model ( 13) and a set of real (post synthesis) input-to-output delays for the fully-bypassed radix-2 implementation.Such delays were obtained, both in the max speed and min area implementations of the fully-bypassed 1024 × 1024 I/Os radix-2 MSSN of Table 2, by forcing the timing analysis tool to exploit the bypass connections at different MSSN stages (Sn, with n = 1, 2, . . ., 10, in the horizontal axis in Figure 9).Sn = 10 corresponds to non-bypass exploitation.Since for each input/output pair the 1024 × 1024 I/Os MSSN provides 1024 possible paths, the results are represented as clusters of points.Solid lines (one for the max speed and one for the min area case) represents the piecewise linear model obtained from (13).It can be observed that piecewise linear model (13) represents a sort of "worst-case" compared to the mean values, lines with circular markers, or to the ninetieth percentile values, triangular markers, of real delays.The only cases where the model under-estimate the network delay are those which consider a bypass exploitation at the lowest levels (S1, S2, and S3 in Figure 9) of the MSSN hierarchy in the max speed implementation.However, when considered in the evaluation of a completely programmable device, such a gap in the approximations would hardly affect computation of the critical path, since dominated by the contribution of the computational logic block (LB) delay.
Similar results were obtained when taking into account a radix-4 architecture and data from Table 3, where the stage delay is larger than that in a radix-2 architecture, due to the greater complexity of radix-4 switch elements (Figure 6).
The previous results allow us to validate model (13) and to apply it to the comparison between different MSSN architecture options, as will be done in Sections 7.2 and 7.3.

MSSN Effective Frequency Versus Bypass Exploitation
To evaluate the area/speed tradeoff of different MSSN architectures, in this paragraph we use model (13) to perform a computational density analysis considering interconnect structures with the same area but different number of stages with U-turn connections.Whereas for the flat MSSN (Section 6) we considered the implementation frequency as a figure of merit, since the network delay corresponds to the MSSN full latency, in the bypassed MSSN the actual working frequency has to be taken into account.A bypassed MSSN (Figure 4) provides shorter paths for logically near I/O pins, and thus the working frequency depends on the locality of the connections required by the applications.As explained in Section 5, locality is tackled by optimized routing strategy, which targets the best bypass utilization.As an example, Figure 7a,b show two paths exploiting (a) and non-exploiting (b) a bypass enhancement, respectively.According to (13), the effective delay of path (a) where the stage S in which the bypass is exploited is S = 2 would be lower than the one in (b) where S = log 2 N = 3 thanks to bypassing switch S 3 .
Figure 10 [5] reports the effective frequency of a flat MSSN together with what is achievable by a bypass-enhanced MSSN, assuming four ranges of frequency gain corresponding to different levels of bypass exploitation (0% non-exploitation of the bypasses, 20%, 40%, 60%).As reported in Table 2, the minimum area necessary to realize a flat version of MSSN is 0.2 mm 2 , 0.34 mm 2 for half-bypassed and 0.4 mm 2 for fully-bypassed.The post-synthesis results in Table 2 are referred to the full-latency of the MSSN (0% bypass exploitation).Hence, targeting small area devices (<0.34 mm 2 ) the flat MSSN is the unique solution among those listed.The half-bypassed architecture (Figure 10a) proves advantageous for area >0.44 mm 2 , since it allows to achieve higher frequencies even if the frequency gain is around 20%, while reaching time-to-fly frequency values beyond 700 MHz when the area is not a tight constraint.
The same idea is applied to the fully-bypassed MSSN (Figure 10b), where frequency >450 MHz can be reached with greater area budgets, while for area <0.8 mm 2 the flat MSSN can provide higher speeds even with frequency gain >40%.Compared to the half-bypassed solution, and below 650 MHz, fully-bypassed architecture enables lower frequencies, owing to increased design density.On the other hand, a fully-bypassed MSSN provides a higher probability of bypass exploitation.With a passing reference to this aspect, in this analysis we have decided not to take into account architectures with a lower application of bypasses (e.g., realizing a quarter-bypassed MSSN), since with a lower probability of bypass exploitation the frequency gain values might no longer be applicable.
Applying the same analysis to a radix-4 MSSN, we achieved similar results: advantages provided by introducing U-turn bypasses, either in half stages or in the full network, only become evident targeting high frequencies, with the drawback of additional area requirements.

Comparison between Different Radices for Hierarchical MSSN
The IN2OUT DELAY (S) validated in Section 7.1 is then used to perform a comparison in terms of area/speed tradeoffs for different radix in hierarchical MSSN.In Section 6 we showed that, in a flat architecture, an MSSN featuring radix-2 switches is able to guarantee better area and implementation frequency performances than its equivalent radix-4 version.Given this comparison, and since in the previous sections we proved that both the half-bypassed and the fully-bypassed architectures can improve effective working frequency with respect to flat architecture, an analysis of the tradeoffs between area occupancy and the opportunities presented by U-turn bypasses in both radix-2 and radix-4 MSSN is performed in this paragraph.
Variation of MSSN radix implies different exploitation of the locality of the connections.In fact, the number of stages to cross in order to connect two MSSN I/Os proves to be different.By way of example, as shown in Figure 11 [5], in a 16 × 16 I/Os radix-2 fully-bypassed MSSN the minimum common stage, S in (13), connecting IN0 to OUT7 is stage S = 3, while in the corresponding radix-4 topology only stage S = 1 has to be crossed.In the radix-2 architecture the number of points (inputs and outputs) subtended from a cone of height S is: while in the radix-4 architecture: where the 2 factor is due to the fact that each IN and OUT stage is connected to two I/Os (Figures 6 and 11b).Extending to a generic radix-k, the number of points turns out to be: for S from 1 to log k N. In the previous equation α (>0) is a factor that takes into account the number of inlets (outlets) connected to each IN/OUT stage α = N N S . From the preceding definitions of Section 4.3: To understand Equaiton ( 16), let us re-write it as: For a generic stage S, the number of points subtended from a cone of the previous stage (S − 1) is: Therefore, at the "current" stage S, the points subtended are those of the previous stage (S − 1) multiplied by k if S isn't the central stage of the network (S < log k N).In the case of stage S = log k N, the different planes (Figure 2) are merged in the central stage (Figure 4) hence, the total number of points is the sum of the points of the previous stage of each p plane (i.e., multiplied by p).Thanks to (18), ( 17) can be written as: which corresponds to (16).The half-bypassed and fully-bypassed version of a 1024 × 1024 I/Os MSSN, both radix-2 and radix-4, were compared.Assuming that the CAD tool discussed in Section 5 places MSSN I/Os minimizing their distance forcing S = S min , we analyzed the number of stages, hence the delay, required to connect two I/Os placed at different distances, according to wire-length definition (11).A set of implementations, including both max speed and iso-area, were considered: for each, the stage delay was calculated according to (12), and the IN2OUT DELAY computed with (13).Figures 12 and 13 report the results for different distances between two MSSN I/Os that are to be connected.The results obtained (HVT+SVT+LVT implementations) show that there is not a unique solution for the MSSN architecture.Design constraints and the capability of the CAD to exploit bypass affect the MSSN architectural choice.In the max speed implementations (Figure 12) the radix-4 MSSN (dashed lines) proves advantageous, i.e., less delay in the average, only when the distance between points is not too large.On the contrary, for iso-area implementations (Figure 13), radix-4 MSSN (dashed lines) allows lower interconnect delay for any distance between points in the half-bypassed structure and in the fully-bypassed version is still advantageous in the average for small distances between I/Os, i.e., when application fields exploit local connectivity.The analysis can be extended to larger values of radix-k.It is worth noting that the area occupation of k × k switch element increases with O(k 2 ) while the granularity of bypass exploitation decreases, hence reducing the probability of performance gain.

Conclusions
We presented an analysis of a class of MSSN suitable for the implementation of routing in ePLD, i.e., soft-core IPs with size in the order of 1 KLUTs, showing full-static multicast connectivity capabilities and the possibility to be synthesized.We described also a parametric programming tool which allows many architectural variations to be applied without affecting the overall routability properties.The implemented programming strategy allows to achieve a straight-as-possible policy in order to minimize the boundary-radix problem, which could affect the performance of hierarchical log-networks.
The MSSNs were implemented in STM CMOS 65 nm technology: the results reveal a significant design space in terms of area/speed tradeoffs, resulting in a range of 2× for frequency and a range of 4× in terms of area.Furthermore, the introduction of U-turn bypasses enabled the design space to be enlarged, providing a frequency performance improvement (up to +60%), at the expense of increased area occupancy.
The parametric MSSN structure was also exploited varying the radix: a comparison between a radix-2 and a radix-4 was then performed, for both bypassed and non-bypassed architectures.The results show that the area-frequency tradeoff is in favor of the radix-2 architecture for low area budgets, while the radix-4 solution is able to provide a better exploitation of the connection locality.

Figure 2 .
Figure 2. Example of construction of Multi-log 2 N networks: with horizontal cascading (a) and vertical stacking (b) technique.

Figure 7 .
Figure 7. Three different multi-fanout paths connecting IN 2 to OUT 0 , OUT 1 , and OUT 2 in order of cost: with bypass enhancement (a), with the as-straight-as-possible policy (b), and using diagonal wires (c).

Figure 9 .
Figure 9. 1024 × 1024 I/Os radix-2 MSSN in-to-out delays (with bypass exploitation at different stages) compared to model (13) varying the MSSN stage S in which the U-turn connection is exploited.

Figure 10 .
Figure 10.1024 × 1024 I/Os radix-2 MSSN effective frequency versus area for different frequency gain thanks to bypass exploitation: flat vs. half-bypassed (a) and fully-bypassed (b).