Edge-Based Heuristics for Optimizing Shortcut-Augmented Topologies for HPC Interconnects

: Interconnection network topology is critical for the overall performance of HPC systems. While many regular and irregular topologies have been proposed in the past, recent work has shown the promise of shortcut-augmented topologies that offer multi-fold reduction in network diameter and hop count over conventional topologies. However, the large number of possible shortcuts creates an enormous design space for this new type of topology, and existing approaches are extremely slow and do not ﬁnd shortcuts that are globally optimal. In this paper, we propose an efﬁcient heuristic approach, called EdgeCut , which generates high-quality shortcut-augmented topologies. EdgeCut can identify more globally useful shortcuts by making its considerations from the perspective of edges instead of vertices. An additional implementation is proposed that approximates the costly all-pair shortest paths calculation, thereby further speeding up the scheme. Quantitative comparisons over prior work show that the proposed approach achieves a 1982 × reduction in search time while generating better or equivalent topologies in 94.9% of the evaluated cases.


Introduction
High-performance computing (HPC) systems are essential in order to run a variety of large-scale applications in multiple domains such as bio-informatics, astronomical analysis, nuclear simulations, financial services, etc., as well as large machine learning models applied to numerous use cases. As the backbone of HPC systems, interconnection networks are responsible for connecting up to hundreds or thousands of compute nodes (a compute node may, in turn, consist of hundreds to thousands of processing cores) [1] by providing fast and low-cost communication in the order of microseconds [2]. A primary factor in dictating the performance of interconnection networks is topology, which specifies the structure that is used to connect compute nodes. To achieve high interconnected performance, it is critical to design network topologies that have small diameters and low hop counts.
Prior works have proposed a number of regular and irregular topologies including rings, meshes, tori, hypercubes, fat trees, Clos, Butterfly, and Dragonfly. Many of them have been deployed in practical HPC systems. Interestingly, despite the seemingly mature development of topologies, recent work has demonstrated the large potential of a very different class of topologies, which we refer to as shortcut-augmented topologies. Such a topology starts with a base topology (e.g., a ring) and adds a series of shortcuts on top of that. Surprisingly, with careful selection, shortcut-augmented topologies are able to reduce both network diameter and average shortest hop count by multiple folds, compared with existing widely used regular and irregular topologies [3].
While this is promising, a major roadblock of further improving this class of topologies is the enormous design space that is formed by the combinations of possible shortcuts.
As an example, for a 64-node network, there are around 2 × 10 3 possible edges. If 128 edges are selected, there are over 4 × 10 205 combinations! To make things worse, this design space grows super-exponentially as the network size increases, thus greatly exceeding the capabilities of exhaustive methods. Conventional methods of design space searching, such as simulated annealing (SA), would be extremely slow. This calls for novel, efficient heuristic approaches that exploit unique characteristics of the shortcut selection problem.
A straightforward heuristic is to add shortcuts randomly (subject to available free ports in the switches). With a large number of repetitions, a good shortcut-augmented topology may be found for a small network size. However, given the rapid increase in design space, this method quickly becomes insufficient for larger networks. Several papers have pointed out that adding random shortcuts can enhance the performance of their proposed network topologies [4][5][6][7], but these works do not directly propose approaches that can generate shortcuts more effectively. The state-of-the-art heuristic [3] is to consider the usefulness of shortcuts when adding shortcuts for a given vertex. Specifically, for each vertex v, the method examines a set of random nodes that are connected to v, and selects the top y (say 3) nodes that have the longest shortest paths from v. It then adds y shortcuts from v-one to each of the y nodes. We refer to this method as vertex-based random shortcut (VRS) approach. While this improves the quality of the generated topologies, our analysis reveals that VRS often adds shortcuts that are locally useful at a vertex but are not much use globally. This is because the y longest shortest paths at a vertex may not necessarily be considered as long paths in the entire network. Consequently, VRS requires more shortcuts to be added, which leads to additional costs of links and higher-degree switches.
To address this issue, this paper proposes EdgeCut, an effective heuristic approach for generating high-quality shortcut-augmented topologies. The main novelty is to identify more globally useful shortcuts by thinking from the perspective of edges, rather than from the perspective of vertices in prior work. We propose two variations of EdgeCut. EdgeCut-Full considers the performance impact on all node pairs after a shortcut is added. It produces the best topologies but needs to update all pairs' shortest paths after each addition, thus resulting in longer search time. EdgeCut-Lite approximates this function, leading to 11× reduction over EdgeCut-Full in search time while still generating satisfying topologies. Evaluation results show that, compared with simulated annealing, the proposed EdgeCut reduces the search time by 1982× and generating better or equivalent diameter in 94.9% of the test network topologies and sizes. Compared with VRS, EdgeCut reduces the network diameter by 55.1% while being slightly faster. These results highlight the effectiveness of the proposed approach.
The rest of the paper is organized as follows. Section 2 provides more background on HPC interconnection networks and shortcut-augmented topologies. Section 3 describes details of the proposed EdgeCut approach for identifying high-quality shortcuts. Section 4 presents evaluation methodology and results, and Section 5 includes further discussions. Finally, Section 6 concludes the paper.

HPC Topologies
The topology of HPC interconnection networks is a very active research direction. While many topologies have been proposed in the past, new topologies are continually being proposed due to new challenges in latency, costs, scalability, reliability, etc., that are associated with ever-growing HPC systems. Only limited research has been conducted on shortcut-augmented topologies, leaving many opportunities for further improvement.
In direct topologies such as tori, meshes, and hypercubes, every switch is connected to a compute node, whereas in indirect topologies such as fat trees, Clos, and Butterfly, only the input and output switches at the edge of an network are associated with computer nodes, and packets sent from computer nodes are forwarded indirectly through middle-stage switches before reaching their destination [8]. Both direct and indirect topologies have been used in practice, e.g., 3D and 5D torus networks are used in Cray Gemini [9] and IBM BlueGene/Q, respectively [10], and Dragonfly networks [11] with virtual routers are used in Cray Cascades [12]. In a sense, various direct and indirect topologies differ in how they trade-off among degree, diameter, and hop count [3].
Variations of regular topologies that result in irregular or ad hoc designs have also been proposed. For example, the Jellyfish topology [6] utilizes random graphs to develop high-capacity networks that allow incremental expansion on a daily basis to large-scale data centers, at a higher of cost of cabling. Slim Fly [4,5] approximates the optimal diameter, which results in lower latency, cost, and energy consumption while sustaining high bisection bandwidth. However, it is not suitable for gradual size expansion due to limited flexibility in the small design space. Distributed Loop Networks (DLN) add chordal edges or shortcuts to a simple ring topology to reduce diameter while maintaining low degree distribution. By adding shortcuts in a less regular manner than being evenly spaced, DLN can achieve more efficient designs, e.g., the diameter of a 36-vertex ring can be reduced from 18 to 9 by adding only five shortcuts [13,14].
Another related line of topology research stems from the famous small-world phenomenon, first proposed by [15], that demonstrates that people living in a country constitute societies of a network with only short path lengths. Later, Wattz and Strogatz characterized the small-world phenomenon into the Wattz-Strogatz (WS) model [16], which allows networks to be generated with short average distance and large clustering coefficient [17,18]. The model uses a few additional long edges to reduce the diameter in random graphs for social networks and Internet topologies [3,18]. Since then, researchers have been exploring the small-world phenomenon in computer networks [19][20][21]. In particular, to exploit the small-world effect in HPC, [3] proposes several methods that add random shortcuts to a base topology, the best of which is the vertex-based random shortcut (VRS) method that is mentioned in Section 1. Although shortcuts are selected optimally at each local vertex, they are not necessarily the most useful shortcuts to add globally. This deficiency is addressed by our proposed approach.

Design Space Exploration
The design space of shortcut-augmented topologies is enormous. This is not only because of the large number of possible shortcut candidates (especially for networks with high-radix switches), but also because of the huge number of combinations of the shortcut candidates. There are three typical ways of exploring design space. The first one is exhaustive search, which is impractical in this problem. The second one is generalpurpose search algorithms, such as ant colony algorithm and simulated annealing [22]. These algorithms may be able to find the optimal or near-optimal solutions but usually require extremely long search time for large design space. The third one is heuristic approaches that leverage problem-specific characteristics to enable approximate but fast searches. In this paper, we aim to demonstrate that, for the problem of shortcut-augmented topologies, it is possible to design heuristic approaches that can find comparable solutions of general-purpose search algorithms but take only a tiny fraction of their time.

Proposed Approach
In this section, we describe the details of the proposed EdgeCut approach (implementation and instructions on running the proposed approach are available at https: //github.com/OSU-STARLAB/EdgeCut (accessed on 1 September 2022)). We start by introducing some notations and definitions, and then present two versions of EdgeCut, namely EdgeCut-Full and EdgeCut-Lite, that offer different trade-offs between efficiency and effectiveness.

Definitions, Notations, and Assumptions
An interconnection network topology for N compute nodes can be abstracted as a graph with N vertices. The hop count between two nodes is the length of the path between the corresponding two vertices in the graph. Without additional information, a typical way to estimate the hop count is to use the shortest path length (SPL) between two vertices. From this, we can define an N × N hop count matrix, where each entry (i, j) corresponds to the SPL of two nodes i and j. As illustrated in the example in Figure 1, the shortest path between node 0 and node 1 is one hop, and the shortest path between node 0 and node 2 is two hops. As a result, entry (0,1) in the hop count matrix has a value of one and entry (0,2) has a value of two. Entries in the hop-count matrix are initialized to +∞. Every switch has a degree cap of d. We assume that d is large enough to enable at least one valid topology that fully connects the N nodes.

EdgeCut-Full (ECF)
The problem with vertex-based random shortcut (VRS) is that it strictly requires every vertex to add a fixed number of shortcuts. Consequently, even if the shortcuts are optimal locally at a vertex, they may not be optimal globally. This also indicates that VRS uses more shortcuts to achieve a better topology than is necessary. To address this issue fundamentally, the proposed EdgeCut considers the pool of all possible shortcut candidates directly, and allows flexibility at a specific node as long as the degree cap is not violated.
The first instantiation, EdgeCut-Full (ECF), evaluates the impact of adding a potential shortcut that is more comprehensive by calculating the SPL between all node pairs if that shortcut is added. The rationale behind all-pair shortest path calculation is that adding a single shortcut may potentially affect all nodes. Algorithm 1 shows the pseudo code of ECF. Similar to other shortcut-augmented topologies, ECF starts with a base topology (such as a ring, mesh, torus, or any existing topology) and gradually adds random shortcuts based on their usefulness on reducing the network for low latency. From the base topology node connections, the hop count matrix and degree of each nodes are initialized, and the current all-pair average shortest path length (ASPL) is calculated. Then, the algorithm generates t random edges from the pool of shortcut candidates. For each edge, it re-calculates ASPL assuming that the edge is added, and updates the current best edge if ASPL is reduced.
Algorithm 1 EdgeCut-Full (ECF) for adding shortcuts 1: Initialize hop_count, degree, aspl from base network topology 2: while (added_edge_count < cap_added_edge_count) do 3: Randomly generate t edges (under degree constraint) 4: min_aspl = aspl 5: for each edge(x, y) in t edges do 6 Once all the t edges are considered, the best edge with the lowest ASPL is added to the network. This shortcut addition is continued until the cap of allowed shortcuts is reached.
In terms of computation complexity, the while loop runs for c iterations assuming that up to c shortcuts can be added. To add an edge, the algorithm searches for t random edges, and each requires the recalculation of ASPL (Line 6). Since only one edge is added, the update can be done by enumerating any two points u and v with the update distance [u, v] Similarly, updating hop_count on Line 12 also takes O(N 2 ). Putting things together, the time complexity of ECF is O(c(tN 2 + N 2 )) which is O(ctN 2 ). Note that the hop count matrix of the base topology is known beforehand, so Line 1 includes only the time to copy the hop count matrix locally, not recalculating all-pair shortest paths for the base topology.

EdgeCut-Lite (ECL)
EdgeCut-Full estimates the impact of adding a candidate shortcut accurately but is also computationally expensive. The dominating component comes from updating ASPL in the inner loop, which takes O(N 2 ) each time. To alleviate the complexity issue of ECF, we propose EdgeCut-Lite (ECL) that simplifies this estimation. To select a shortcut to add, we still generate and evaluate among t random edges. For each random edge (x, y), we focus on how much hop count can be reduced between node x and node y if the edge is added. Since the edge directly connects x and y to have a hop count of one, the reduced hop count is equivalent to the previous hop count between x and y (minus one). This hop count value can be retrieved from the hop count matrix with just O(1) time. The random edge with the greatest hop count reduction is selected. The pseudo code for ECL is given in Algorithm 2. Lines 4-10 correspond to the above process. As an example, consider three candidate random edges for node pairs (2, 6), (3,8), and (5, 10) where the corresponding nodes have an SPL of 4, 8, and 6 respectively. In this case, ECL chooses the edge for (3,8) as it has the largest SPL; an additional edge between node 3 and 8 would reduce their shortest distance to 1, resulting in a saving of 7 hops. At the end of the f or loop, the hop count matrix is updated with the selected edge.  + N 2 )).

Methodology
The proposed EdgeCut-Full (ECF) and EdgeCut-Lite (ECL) are evaluated quantitatively against simulated annealing (SA) and virtex-based random shortcut (VRS) algorithms. Multiple topologies including ring, mesh, and torus are used as the base topology to assess the efficacy of the algorithms in adding shortcuts. All the algorithms are implemented in Python 3 and Numpy and executed on the same CPU to ensure fair comparison. In our ex-periments, in addition to N that denotes the number of nodes in the network, we sometimes also use n = √ N to better reveal scalability trends. Different network sizes are evaluated including N = 16, 25, 36, ..., 256 nodes, which corresponds to n of 4, 5, 6, ..., 16. We use a switch degree cap of 10 and the limit of 2n allowable shortcuts for the main evaluation in this section. More sensitivity studies on different degree cap values and shortcut limits are presented in the Discussion section.

Computation Time
Computation time (i.e., search time) required for SA, VRS, ECF, and ECL to reach a stable solution is measured by the running time on an Intel(R) Core(TM) i9 9900 CPU @3.1 GHz. Table 1 compares the search time for different network sizes under the four schemes. For clarity, only the results for using torus as the base topology are shown, as the ring/mesh have very similar numbers. As can be seen, SA takes the longest time for each size, and the search time increases rapidly as the network size increases. This is not scalable and can be costly to repeat for design space exploration if any of the network parameters change. VRS is much faster due to its simplicity in heuristic but has poor performance, as shown shortly. In comparison, the proposed ECF and ECL are 154× and 1982× faster than SA, respectively (and with better generated topologies as shown later). ECL is 11× faster than ECF and is also slighter faster VRS on average.

Reduction in Diameter
Diameter measures the longest shortest path in a network and is an important metric that relates to a number of considerations such as latency, cost, reliability [23]. Table 2 compares the impact on reducing diameter when shortcuts are added for the four schemes.
As can be seen from the table, the proposed ECL, while being the fastest among the four, also achieves very low diameter. Specifically, ECL produces better or equal diameter in 37 out of 39 test sizes on three base topologies (94.9%) compared with SA. This is significant given that SA is designed to achieve near-optimal diameter at the cost of long search time. The proposed ECF optimizes the network diameter further than SA (and with much faster search than SA). For example, with a ring base topology of 16 × 16 = 256 nodes, the diameter of 19 in ECF is better than the 22 in ECL, and is also lower than the 24 in SA. In contrast, the existing vertex-based heuristic VRS has the largest diameter, e.g., 49 in the above example, reducing only about 61% from the base diameter, whereas ECF reduces 85% from the base. Comparing between VRS and ECL, ECL reduces the diameter by 55.1% relatively. Across the three base topologies, improvement of different schemes depends on the ratio of the number of added shortcuts to the number of base edges. For the given 2n shortcuts, ring with N = n 2 base edges benefits more than mesh with 2n(n − 1), and torus with 2n 2 base edges benefits the least. Table 2. Diameter for simulated annealing (SA), vertex-based random shortcut (VRS), EdgeCut-Full (ECF), and EdgeCut-Lite (ECL). "S" denotes shortcut limit, and "Base" denotes base diameter.

Average Shortest Path Length
While diameter captures boundary cases, the average shortest path length (ASPL) provides insights on the average cases for network latency. Figure 2 plots the ASPL for different sizes of ring, mesh, and torus networks when shortcuts are added. The proposed EdgeCut-Full has the lowest ASPL in all the evaluated sizes and topologies. The proposed EdgeCut-Lite has slightly higher ASPL than EdgeCut-Full, but is still very close to SA. Meanwhile, VRS performs significantly worse, with a wider gap for larger network sizes, indicating that the heuristic in VRS does not explore the full potential of shortcuts.

Randomness
All the compared algorithms exhibit randomness, as part of the algorithms involves randomly generated edges. To examine this aspect in more detail, we plot the ASPL over 100 independent runs of the algorithms for adding shortcuts to a 64-node ring network. Figure 3 demonstrates that SA, ECF, and ECL not only generate lower ASPL than VRS but also produce stable results constantly. In comparison, VRS generates good results only 25% of the time.

Effect on Degree Limit
A per-switch degree cap of 10 is used in the above experiments. To gain more insight on how the four schemes may perform under different degree limits, we have evaluated a 64-node mesh network when the degree limit is varied. The results are plotted in Figure 4. Although the specific improvement varies, it is evident that the proposed ECF and ECL have lower ASPL than SA and VRS for all degree limits. ECF is slightly better than ECL, as expected, due to the more comprehensive consideration of the impact of adding shortcuts.

Effect of Shortcuts Limit
Adding shortcuts reduces latency but increases the cost of cables. Moreover, switches have finite ports to connect to compute nodes and other switches, which also limits the number of shortcuts that can be added. The evaluation in Section 4 assumes a shortcut limit of 2n where n = √ N. Here, we conduct additional experiment that varies the shortcut limit for a 64-node network with ring as the base topology. The results are presented in Figure 5. Note that the previous 2n corresponds to 16 here, and the number of added shortcuts ranges from 2 to 64. As expected, the ASPL reduces in all four schemes as more shortcuts are added. The proposed ECF and ECL consistently perform better than the other two schemes. Although the figure shows a trend of diminishing return with small differences when a large number of shortcuts are added, this is prohibitively expensive. Therefore, for practical systems, the proposed EdgeCut has substantial advantage.

Routing and Multi-Level Networks
Irregular networks including shortcut-augmented topologies usually requires topologyagnostic adaptive routing [3]. Research on this aspect is well established, and routing scheme much as [24] with up*/down* routing as the escape path is able to remove routing from the performance bottleneck, thus reflecting the benefits of the underlying topology.
Shortcut-augmented topologies, such as the ones that are generated from our proposed EdgeCut, can be employed in various networks. Large networks are often constructed in multiple levels, e.g., an intra-group network that connects compute nodes within a rack, and an inter-cabinet network that connects racks. EdgeCut can be used to generate topologies at each of those levels.

EdgeCut over Deterministic Shortcuts
In addition to comparing with VRS and SA that generates shortcuts randomly, we have also compared the proposed EdgeCut with two topology designs that add shortcuts deterministically.
The first topology is hierarchical rings that connect a set of rings hierarchically to achieve better scalability than regular ring networks [25]. Our evaluation shows that, for a 64-node network, hierarchical ring topology has an ASPL of 5.84 and a diameter of 12. In comparison, the proposed ECL with 2n shortcuts has an ASPL of 4.88 and a diameter of 10, and ECF has an ASPL of 4.67 and a diameter of 9. Therefore, EdgeCut is considerably better than hierarchical rings in both metrics.
The second topology is flattened butterfly (FB) that adds shortcuts between nodes on the same row/column in a mesh network [26]. For an 8 × 8 network, we follow the original paper to construct a flattened butter from four 4 × 4 sub-networks to keep the bisectional bandwidth reasonable. Evaluation shows that FB has an ASPL of 3.39 and a diameter of 7; whereas ECL and ECF have ASPL of 3.39 and 3.3, respectively, and diameter of 7 and 6, respectively, so again better than FB in both metrics. Furthermore, ECL and ECF use only 128 total edges in contrast to 204 edges required in FB, thus making EdgeCut a more cost-effective approach.

Additional Considerations
The evaluation in this paper focuses on average hop count, diameter, and number of edges, all of which are well-established metrics to assess topologies. For a given system, however, the choice of topologies also depends on several other considerations, such as traffic patterns, queuing effects, cost of links (e.g., electrical vs. optical), cost of switches, etc. Additional evaluation is needed to take these factors into account, such as simulating on a cycle-accurate interconnection network simulator.

Conclusions
Shortcut-augmented topologies have the potential to surpass conventional regular and irregular topologies but have been hindered by the challenge in searching their enormous design space. In this paper, we address this important issue by proposing an efficient and effective heuristic approach that aims to generate more globally useful shortcuts. The proposed EdgeCut-Full considers the performance impact of shortcut candidates more comprehensively but also incurs higher computation, whereas EdgeCut-Lite simplifies the search process while retaining the ability to find good shortcuts. Evaluation results show that the proposed approach is able to generate comparable high-quality topologies as simulated annealing and achieve faster search time than vertex-based method, essentially reaping the benefits of both worlds and offering a better trade-off.