As communications between tasks in an application depend on data, it is not possible to efficiently schedule data transmissions offline; thus, some network manager must run online to assign routes to data packets traveling the communication infrastructure.
After a short review of the literature, we shall explain our two-step approach with an offline PE placement and computation of the resource sets and cost matrix A, and an online message routing process built on top of the HA.
2.2. Offline Processes
Each node of the application’s DAG to be implemented in a NoC must be mapped onto a PE such that data transmission is performed with the least energy consumption and data contention possible. For this, the edges of the DAG must correspond to the shortest communication paths possible. Therefore, neighbor nodes should be placed in neighbor PEs.
In this work, placement is performed either manually for the example case or automatically by traversing the graph from input nodes to output nodes, level by level, in a zigzag way. Each level is defined by the edge distance to the primary inputs. (It is possible to obtain optimal placements for DAG whose properties have been profiled with a quadratic assignment [
8] of DAG nodes to NoC PEs, but even in this case, optimality depends on the variability of communications over time.)
For a given placement, each edge in G can be covered by a set of paths in the NoC. The paths from node s to node d start at a vertical or horizontal bus port from the PE position and end at .
The generation of these paths is performed by Algorithm 1, which explores all the paths from to for every edge in G. To do so, it explores in a depth-first search manner the tree of subsequent neighbor positions. If the last position of the current path corresponds to , the path is eventually stored in the path list P. Otherwise, the current path is augmented with the positions of neighbors that do not cause cycles. In this case, a path contains a cycle if the line where the new segment would sit intersects any previous position in the path.
The new paths are stored only if they are not included in the other paths. In case some of the other paths include the new one, they are removed from the list
P, and the new one is appended.
Algorithm 1 Path generation. |
Input: ▹G is the DAG and M the mapping of G into the NoC structure Output: P ▹ All paths in NoC mapping M for all edges in G for do ▹ For all PE positions with some task node assigned ▹ Source node s for do ▹ For all edges ▹ Stack of temporal paths with initial path starting at ▹ Counter of paths from to node d while do ▹ gets the last point of path if then ▹ If is the destination point then… ▹… the new path from node s to while do if then else if then else end if end while if not included then ▹… is inserted into P if it was not there end if else ▹ Otherwise, the path must be augmented… for do ▹… with neighbors of if then end if end for end if end while end for end for
|
The generation of paths from to can be improved by dynamic programming, as the set of paths between two NoC nodes and is the same as that of to if . The relative positions of the two points must be taken into account when translating one path from one set to another, and all the paths that would use segments outside the NoC must be removed from the final sets.
However, the generated paths have different lengths, and some of them can include sequences of nodes that match other shorter paths for different source and destination points. In these cases, these paths are packed into a single resource. They are incompatible among them, so they can share the same column.
After the generation of the minimum paths, Algorithm 2 packs them into groups that form the resources. This algorithm starts by sorting the paths according to the number of lines they use and, in descending length order, proceeds to create a set S with the current path and any other path included in it. In the end, the new set is appended to the resource list R.
The lists
C (for counter) and
B (for boundary) store the number of resources to which paths belong and the maximum number of appearances, respectively. Any time a path is inserted into a resource, the corresponding counter in
C increases. If any two paths within the same resource are compatible, i.e., they share no communications lines, they must appear in another set; thus, their occurrence maximums in
B increase.
Algorithm 2 Resource generation (path packing). |
Input: ▹P is the list of paths on node mapping M Output: R ▹R is the list of resources, i.e., sets of paths ▹C counts appearances of paths in R ▹B is the boundary of appearances of paths in R ▹ Sort paths in P in descending-length order for do if then ▹ Path i added to new resource for do if then ▹ is longer than end if end for for do ▹ Check for compatibilities among included paths for do if then ▹ No lines in common end if if then end if end for end for end if end for
|
The calculation of the cost matrix A is quite straightforward. Each position is set to the number of lines used by a path serving the connection of the edge i in or to the maximum number of lines plus one if there is no path in , with the origin and end points corresponding to .
2.3. Online Process
The assignment of resources to tasks must be performed when new pending communications tasks require it. We assume that computing tasks send requests to the network manager in specific time frames or cycles.
In each cycle, some communication tasks, or edges of G, must be assigned to resources from such that the sum of costs is minimal.
A straightforward approach is to pair each edge with the resource that serves it at the minimum cost. Once paired, the resources are blocked for further assignments.
This behavior is represented in Algorithm 3, with a nested for to look for the assignment to (agent i in the algorithm) with the minimum cost among the resources that have not previously been assigned, that is, , as vector P contains the row to which a resource is assigned or 0 if not.
This greedy way of proceeding works fine when the number of resources exceeds, by a large amount, that of the edges to be assigned. Unfortunately, blocking some assignments might lead to suboptimal solutions and, thus, it is much better to do so with the HA.
Algorithm 3 The greedy algorithm. |
Input: A ▹ Output: ▹B binds action to resource and c is the total cost ▹ means unassigned for do ▹ for each row (agent) do… for do ▹ for each column (resource) do… if and then end if end for if then end if end for for do if then end if end for
|
In fact, both algorithms perform equally well if the sequential selection of minimum task-resource values leads to final solutions. In the HA (see Algorithm 4) there is a for loop for the edges to be assigned and another for the resources. The latter is inside a repeat loop that searches for the best possible assignment.
This version of the algorithm was extracted from a program by Andrey Lopatin’s [
16], which is one of the most compact implementations of HA in the literature and was later implemented in Lua [
17].
In contrast with other HA versions, this one does not modify the cost matrix A and uses auxiliary vectors to account for row and column offsets (U and V, respectively), the indices of rows with which columns (P) are paired, the preceding elements in the current decision-taking step (W), the minimum costs per column (L), and a Boolean value to know whether a column is already paired (T). In this case, the zero positions of some vectors are used as control values for the program, and B to store the indices of the columns paired to each row.
Pairing an edge of G with a resource implies that no other path within can be assigned to another edge, i.e., there can be only one 1 per column in the pairing matrix B. Unfortunately, there is no guarantee that the corresponding path is compatible with another assigned path in a different column and row. Therefore, the solutions of greedy and Hungarian algorithms must be checked for compatibility. In case some assignments are incompatible because the corresponding paths in the affected resources share lines, the associated cost is set to the maximum for one of the task-resource bounds, and the assignment procedure is repeated. The loop continues until the result contains no incompatible assignments.
The complexity of the assignment problem for different NoC sizes is illustrated in
Table 2, which shows the averages of several runs. In all of the cases, all the PEs of the NoC are assigned to some node of a random graph with average node fanout of 2. Each row of the table corresponds to averages of at least 25 assignment cycles in 10 or more different random DAGs (i.e., averages of 250 runs or more). The number of paths and resources (column “no. of rsrcs.”) increases with the size of the NoC, although this growth is affected by the fact that local connections also have global effects, as they use all communication bus lines.
Algorithm 4 The Hungarian algorithm. |
Input: A ▹ Output: ▹ B binds action to resource and is the total cost for do ▹ for each row (agent) do… repeat ▹ repeat until task assigned for do ▹ for each column (resource) do… if not then ▹ if not assigned then… if then end if if then end if end if end for for do if then else end if end for until repeat until end for for do if then end if end for
|
The probability of a communication task being requested at any time is set to ; thus, the average number of requests per assignment cycle (column “no. of reqs.”) is relatively low, though it grows with the size of the NoC.
As expected, the greedy approach and the Hungarian algorithm perform equally well with a slight advantage for the latter. It is worth noting that these are the results after solving conflicts among task-resource assignments. Because this conflict-solving procedure has exponential complexity, simpler strategies must be adopted.
We tried several options to see which one let the assignment procedures reach the best values, including selecting the first one or the one with the least cost, but none is as effective as choosing the one with the highest conflict count. This eventually generates suboptimal solutions. For the example in the table, the percentages of cases where the assignment procedure is repeated (column “%iter.”) grow with the size of the NoC and so does the number of iterations (column “#iter.”) to reach a fully compatible assignment of resources, typically with some waits.
Note that there is a cumulative effect of waits (i.e., unassigned tasks that remain for the next cycle), which are not considered here. For the cases in the table, the percentages of unattended tasks go from 3.4% to 18.7%, again with the HA option being the best. However, in a real case, the sparsity of communication needs will probably give enough free cycles to absorb the pending transmissions of a set of requests.
The same cases are simulated with different probabilities of occurrence of the communication tasks requests. As expected, the differences among the two methods reduce for probabilities lower than and increase for the higher ones. In fact, for , the HA outperforms the greedy algorithm by percentages ranging from 5% to 25% in all cases and factors.
To see how much it can improve communication performance in real cases, the two algorithms are compared using the realistic traffic benchmark suite called MCSL [
18]. This benchmark contains the traces of the simulated execution of eight real cases on NoC with various dimensions and topologies, namely, fat tree, torus, and mesh, which is the one used to simulate the assignments.
The recorded traffic pattern files contain data on the execution of tasks (location in the NoC, sequence number in the execution schedule, and execution time in the number of cycles) and on the communications (source and destination tasks, memory addresses, and data size). From this information, it is extracted which data transmissions must be made in each execution of the network manager. Unlike the random application graphs, which occupy all the available PEs of the NoC, the mapping that real applications have concentrates the tasks in a part of the PEs, which leaves more degrees of freedom for message routing, particularly in the bigger NoC.
Table 3 shows the average assignment costs for each application and NoC size. In this case, we choose not to discard the initial part, which is where the highest costs are, nor the final part of the executions, which is where the costs are usually lower. As expected, the smaller the NoC, the more difficult it is to find routes for all simultaneous requests, and the Hungarian algorithm performs better than the greedy one. It can be seen that, as the NoC dimension increases, there are more degrees of freedom, and the difference between the algorithms disappears. In fact, the number of zeros in the “% gain” columns increases as the NoC dimension increases. This percentage is calculated taking the average cost of the greedy algorithm as 100% and the HA in relation to it; thus, the more negative, the better for the HA. The best case shown in the table is for the “RS-32_enc” and the
NoC, where the average cost of the HA is only 76.81% of that of the greedy algorithm. (Note that costs increase with size because the cost of waits depends on
, and actual gains on transmissions are slightly better than shown.)
In summary, in cases where there are many degrees of freedom because, for example, not all the PE of a NoC are used, the greedy algorithm works just as well as the Hungarian one and, in fact, would coincide with what a conventional network manager would do. In these cases, it is necessary to assess whether it is viable to run the offline processes since they involve calculating all possible paths to cover all the arcs of the application graph. If it is, the proposal made in this work greatly reduces the complexity of resource allocation to tasks during the execution of applications, since the minimum paths are calculated beforehand.
In cases with few degrees of freedom, the network manager should use the iterative conflict-solving approach with HA to maximize the use of communication resources and minimize energy consumption and waits.