Application-Speciﬁc SoC Design Using Core Mapping to 3D Mesh NoCs with Nonlinear Area Optimization and Simulated Annealing

: Core mapping, in which a core graph is mapped to a network graph to minimize communication, is a common design problem for Systems-on-Chip interconnected by a Network-on-Chip. In conventional multiprocessors, this mapping is area-agnostic as the cores in the core graph are uniform and therefore iso-area. This changes for Systems-on-Chip because tasks are mapped to speciﬁc blocks and not general-purpose cores. Thus, the area of these speciﬁc cores is varying. This requires novel mapping methods. In this paper, we propose a an area-aware cost function for simulated annealing; Furthermore, we advocate the use of nonlinear models as the area is nonlinear: A semi-deﬁnite program (SDP) can be used as it is sufﬁciently fast and shows 20% better area than conventional linear models. Our cost function allows for up to 16.4% better area, 2% better communication (bandwidth times hop distance) and 13.8% better total bandwidth in the network in comparison to the standard approach that accounts for both the network communication and uses cores with varying areas as well.


Introduction
Core mapping is one important design-time optimization problem for chips interconnected by Network-on-Chips (NoCs).The target of this mapping problem is a better distribution of work among the cores to improve data movement between them.Different objectives can be found in the literature such as reducing power [1] or avoiding bandwidth limitations [2].In this paper, the objective optimizes the area of the chip, which is not commonly found in the literature so far because of the tacit assumption of iso-area cores.However, this is not valid for all systems, as specific blocks will have different areas in contrast to general-purpose cores.We extend our work from [3], in which we proposed a nonlinear model to improve area, by incorporating this model in a simulated annealing cost function to solve area-aware core mapping.
The problem of core mapping is defined as follows: The application's data streams are modeled using a core graph, in which nodes represent cores and edges with their edge weight model the bandwidth of data stream between the cores.This core graph is mapped to a chip, typically a multiprocessor interconnected by an NoC.The chip is represented by a network graph, in which nodes model tiles that reserve space for an NoC router and a core, and edges model links between tiles.The objective of this optimization is minimization of network latency, typically measured in the cumulative hop distance × bandwidth (e.g., [2]), maximization of the throughput, typically measured by the maximum bandwidth transmitted through single links (e.g., [4]), minimization of energy consumption, measured in dynamic router activity (e.g., [1]) or minimization of execution time, measured by the hop distance along the critical path (e.g., [5]).Core mapping is a very common electronic design automation (EDA) task in NoC design and many approaches from exact analytical solutions, e.g., [5], to heuristics such as simulated annealing (SA), e.g., [2], have been proposed.
Recently, Systems-on-Chip (SoCs) gained attention.There are two important differences to multicore processors: First, the function and thus the area of cores varies in SoCs.Thus, the underlying assumption of iso-area cores, which lead to disregarding the core area during core mapping, is not valid anymore.Second, many SoCs such as vision chips [6] are application specific, while multiprocessors are not.Thus, the application properties must be accounted for already during core placement to exploit additional optimization potential.Using conventional approaches, each tile would have to reserve area for the largest core, which is inefficient, naturally.An example is depicted in Figure 1, in which cores of different sizes (orange) allocate less area than reserved (light gray).Therefore, novel approaches are required.Ref. [3] introduced nonlinear models and compared them against linear models to optimize area during core mapping.Here, we extend this work by proposing a cost function for simulated annealing to optimize the core mapping and the chip area; Specifically, the nonlinear models are used to optimize area within the simulated annealing.Since these nonlinear models are exact for area, we propose a mixed-exact-approximate method to minimize communication and area in SoCs during design-time core mapping.
The remainder of this work is structured as follows: A review of the state-of-the-art is given in Section 2. Next, the area-aware simulated annealing is introduced in Section 3 that uses linear or nonlinear models to optimize area.Both of these models are introduced in Section 4, which are based on our preliminary paper [3] that this work extends.The results obtained with the simulated annealing are reported in Section 5.The work is concluded in Section 6.

Related Work
As already explained, many works on core mapping exist.In general, there are two classes of mapping methods: Exact approaches using an analytical model such as a mixed-integer linear program (MILP), an extensive search of the solution space or heuristic approaches that approximate the solution at better runtime, e.g., simulated annealing or particle swarm optimization.Within each class, the approaches can be further classified by their objective functions that e.g., minimize power or maximize performance.
In the first class of exact approaches, the vast majority of works use mixed integer linear programming to solve core mapping: Ref. [7] allows for connecting multiple cores to a single router.By that, energy consumption is minimized by up to 81.2% compared to one-to-one connections.Ref. [8] optimizes mapping and topology selection to minimize bandwidth, area and network component savings by a minimum of 50% each in comparison to traditional design approaches.Ref. [9] maximizes the worst case throughput and also accounts for multi-threaded processes, i.e., mapping of multiple cores to a single tile in the network graph.Ref. [10] minimizes communication energy using mapping; it targets integration into frameworks that find optimal network voltage and frequency.In particular, Ref. [2] is worth mentioning as it is the standard work core mapping in that cores have different areas.The work uses MILP to synthesize an NoC topology from a core graph with area and power annotations.In contrast to this paper, area reduction is not objective.Rather, Ref. [2] reduces power.Different multimedia benchmarks [11] are used for evaluation.This paper is closely related to [2] because of the consideration of area.Thus, we compare against the same benchmarks for a fair comparison.As our objective function is different, we are able to achieve better area figures.
In the second class of heuristic approaches, many different algorithms have been explored, e.g., genetic algorithms (GA), in which solutions evolve, particle swarm optimization (PSO), in which agents collaboratively find a good solution, or simulated annealing (SA), in which cooling processes are used as inspiration to find optimal configurations.Ref. [12] uses GA to minimize the overall execution time of the application.More recently, GA is rarely used due to lower runtime than PSO and SA: Ref. [13] optimizes mapping for partially vertically-connected 3D NoCs to make best use of through-silicon via (TSVs).The authors propose both and MILP and a PSO to improve network congestion, but the MILP has too long of a runtime for realistic use cases.SA is one of the most-used EDA methods.It can be used at many abstraction levels, from gate-level [14] to system-level optimization [15].A reason therefore lies in SA's compelling performance, as we will show with a comparison against PSO in this work.The performance for SA can be further by combination with different techniques: For example, Ref. [16] shows that SA core mapping combined with cluster analysis allows for up to 30% better runtime at the same quality of results in comparison to off-the-shelf SA.Ref. [17] shows that application of further knowledge about the structure of the objective function allows for up to 66% better average energy consumption in comparison to a blind search.Ref. [18] also focuses on reduction of energy consumption.In this approach, the router allocation is done prior to voltage islanding, thus saving up to 63% power and delay over Sunfloor [1].Ref. [19] focuses on runtime reductions under thermal constraints, which are specifically challenging in 3D NoCs.In their work, the authors formulate a communication and thermal aware mapping problem and solve it using custom heuristics.They achieve up to 43% better runtime than related works.
To summarize, there are many approaches in the literature on core mapping in NoC-based multiprocessors.Only a small subset accounts for area because most of the works assume homogeneous cores, which is not always valid for heterogeneous scenarios.As area is nonlinear, intrinsically, it is not possible to model it exact through means of linear models.Thus, a novel approach is required as proposed in this work.

Problem Definition
The problem of core mapping has been defined multiple times, e.g., [16,20].The difference to our definition lies in the annotation of the core graph with area.Even more, this area annotation enables the definition of a new objective function including area.The problem of core mapping takes a core graph and a network as input.These are defined as follows: Definition 1 (Core Graph): The core graph models the area of cores as well as the bandwidth requirements for communication between cores.It is digraph CG = (C, E C ), in which the set C of vertices consists of all cores c i , with i ∈ {1, . . ., |C|} as the set of core indexes.The set of directed edges e i,j ∈ E C models the communication between cores c i and c j ∈ C. Cores are area-annotated by the function area : C → R + .The bandwidth between nodes is given by the capacity function bandwidth : Definition 2 (Network Graph): The network graph models the interconnection topology of the set of target SoC architectures.The network graph is a undirected graph NG = (T, E T ), in which the set T of vertices consists of tiles t i , with i ∈ {1, . . ., |T|} the set of tile indexes, which implement one NoC router each and reserve space for the area of a mapped core.The set of edges e i,j = e j,i ∈ E T models the connections between routers in tiles t i and t j ∈ T.
The aim of the core mapping is to find a mapping that minimizes an objective function.

Definition 3 (Mapping Function):
The mapping function assigns a core c ∈ C to a tile t ∈ T. It is defined as The mapping function is injective because each tile can only host one core.
We also define two auxiliary functions.First, the mapping of cores to tiles results in an area requirement for each tile: Definition 4 (Network Area Function): The area requirement of each tile is given by the function that is defined as F(t j ) = area(map −1 (t j )).Since the mapping function is injective, map −1 is well-defined on the image set of map.Where map −1 (t j ) is not defined, we define F(t j ) = 0 instead.Second, we model the flow of packets in the network graph , i.e., the paths of packets based on the routing algorithm, as a source-sink-flow in the network digraph.

Definition 5:
We define the function f that gives the network flow for each pair of components: ( The value of the c i -c j -flow is denoted by value(f), following the convention used in [21].Flows are a very powerful concept, as they give a natural approach to the conversion of core flows to network flows.The function f assigns a flow value to every edge in the network graph E T by considering the flow induced by all edges in the core graph E C .Hence, a flow for a specific pair of cores is assigned to all links in the network, which will be passed by its packets.Consequently, both deterministic and adaptive routing algorithms can be modeled.As packets following deterministic routing algorithms only have one path through the network, the values of the flows will be binary, i.e., the set of links passed by packets has a flow value of 1, while all other links have a flow value of 0. In case of adaptive routing algorithms, the values of the flow along each link will be in the interval of [0,1].This flow value represents the probability that a packet from the pair of cores will pass this very link when routed.
By that, the objective function of the area-aware core mapping can be defined: The heightened addends are defined as follows: The costs for area O area are the total area of the chip including whitespace based in F for a given mapping.The calculation of area is strictly dependent on the network topology.Here, we use a 2D mesh, as shown in Figure 1.The mesh reduces the spacial freedom and requires that cores are located in a grid.Thus, the area of the chip is given by the width W and the height H of the floorplan for this mapping.As the area W H is a product and difficult to calculate in linear models, we use the easy-to-linearize maximum function to approximate area.This results in a model favouring squared chips.The objective for the chip area thus is: The costs for latency are measured by the hop distance × bandwidth for all data streams in the core graph: The costs for bandwidth are the maximum bandwidth of any link in the network graph for a given mapping: This objective function defines the novel problem of area-aware core mapping.

Simulated Annealing
We solve the area-aware core mapping using simulated annealing.We also implemented an exact solution using an MILP.Since the runtime of this MILP is very poor and only allows for solving input sets with up to seven components in reasonable time, we do not give a detailed definition.Therefore, a heuristic such as a simulated annealing is required.The steps of the simulated annealing are shown in Figure 2.An initial mapping is calculated from a core graph CG and a network graph NG.As depicted, this mapping does not optimize area and therefore includes whitespace (here shown in gray), i.e., unused die area.The initial mapping can either be random or area-efficient, depending on the goals of the optimization.Next, the simulated annealing is executed.The algorithm is initialized with this given valid, but possibly inefficient solution.The solution candidate is modified iteratively by executing the neighbor function that slightly changes the mapping map.As a novel feature, we optimize the area analytically within the simulated annealing before calculating the objective function ("Minimize Area" in Figure 2).This allows for a precise area optimization beyond the limitations of heuristics approaches possible by the simulated annealing.We will explain the analytical optimization of area in a separate Section 4. After this analytical step, the complete objective is calculated in that step of the simulated annealing.The algorithm might accept the novel solution based on the value of the objective function.Naturally, the simulated annealing is iterated and stopped when the terminating conditions are met.The final solution returns a mapping function map that minimizes area and communication, i.e., includes a floorplan for the given mapping.The initial solution and the neighbor function of the simulated annealing are defined as follows: Definition 7 (Initial Solution): There are two ways to generate an initial solution: 1. Randomly generated: The function map is generated such that each core c i is assigned to one random tile t j .2. Area-efficient: The floorplan will be packed area-efficiently, i.e., with minimal whitespace, if all tiles within a row and a column have a similar area.Such a good candidate can be found using a greedy strategy: The cores are sorted descending by area.The tiles are filled from the upper left corner.The cores are assigned to the next free tile in the current row or column, while row and column assignment are alternating.If a row/column is full, tiles will be assigned to the adjacent one.Figuratively speaking, the tiles are filled from the upper left corner to the bottom right corner.
Definition 8 (Neighbor Function): The neighbor function (or move function) modifies a given mapping function map such that it takes the function map as input and returns a modified function map .Specifically, it is modified by selection a core c i and a tile t j = map(c i ) uniform randomly.The selected core is mapped to the selected tile, i.e., position.If a core is already present there, the two mapped cores are swapped.Thus, the modified mapping function map is defined as follows: This concludes the definition of the simulated annealing.It remains to optimize the area for a given mapping map, as explained in the next section.

Area Optimization for a Given Mapping Using Linear and Nonlinear Models
Conventionally, a mapping of tasks to (multiprocessor) cores would not require an optimization of area since cores are assumed to be identical and thus equal in area.However, this assumption will not be valid in SoCs since each task implements a different IP with varying size.Hence, tasks must also reserve adequate area for their implementing IP.The optimization problem's constraints and variables are shown in Figure 3: Cores are mapped to a mesh of tiles.Each core has an area value, denoted by F i,j for a core in a given row i and column j.The height of all rows and width of all columns, denoted by c i and r j respectively must be minimized.The multiplication of width and height is constrained by the size of the mapped cores.By that, the white space (gray in Figure 3) is reduced.More specifically, the area optimization problem during core mapping is formulated as follows [3]: Assume a given mapping of less than or equal of lk cores to tiles in a mesh of l rows and k columns.Each core has the area F i,j , for mapping to row i ∈ [l] := {1, . . .l} and column j ∈ [k] := {1, . . .k}.For all empty tiles without a mapped core, F i,j = 0 will be zero.The height of rows is denoted by r i ∈ R for all i ∈ [l].The width of columns is denoted by c j ∈ R for all j ∈ [k].The area of each tile is constrained by its mapped core: The objective function O 2 minimizes the side length of a square that encloses all tiles (i.e., W = The objective function O 2 is not linear due to the use of the max-function and hence must be linearized.This can be done using an auxiliary variable F ∈ R, which is constrained by the maximum of the summed height and width of the SoC: This relatively easy approach is possible because the linearized objective C function minimizes F: The issue of modeling Equation 4remains, which is not linear.We propose both a linear approximation in Section 4.1, which is fast to calculate but does not yield an approximation error, and a nonlinear model in Section 4.2, which is slower than the linear approximation but has no error.

Linear Model
Since the area of a rectangle F i,j with edge length r i and c j cannot be calculated through means of a linear model, a linear approximation is required.The approach from Lacksonen et al. [22] for the factory layout problem can be applied here as well.Equation ( 4) is depicted in Figure 4: The iso-area-hyperbola r i c j = F i,j is shown in red in the space of row-heights r i and column-widths c j .Linearization of the iso-area-hyperbola is possible by introduction of an additional constraint for the aspect ratio of each tile.The aspect ratio η ∈ (0, 1) limits the height and width of tiles for all i ∈ [l] and j ∈ [k]: r i ≥ c j η and r i ≤ c j η −1 .The constraint aspect ratio is shown in Figure 4a in blue.Figure 4a also shows the solution space as the red-shaded area.Following Equation (4), the area r i c i of a tile i, j must be larger than its core with size F i,j , i. e. F i,j ≤ r i c i .The iso-area-hyperbola is the lower left bound for the solution space.The maximum edge length of the tile further limits the solution space, given by the constraints: c j ≤ y max and r i ≤ x max .Finally, the solution space is limited by the line equations for the aspect ratios η from Equations (10) and (11).
The iso-area-hyperbola is approximated by a line equation given by the intersections between the lines for the aspect ratios and the maximum edge length.This line equation is shown in black in Figure 4a.The resulting linearization error is plotted in green in Figure 4a.In general, it is possible to reduce this error by using multiple equally-spaced knots as shown in Figure 4b.Each linear equation connecting two adjacent knots intersected with the iso-area-hyperbola (r i c j = F i,j ) is called a 1-spline.While more 1-splines reduce the error, they also significantly increase the model complexity.Integer inequalities are required to determine in which spline a given solution is located.There are at least three additional integer inequalities per supporting point.Naturally, this reduces runtime performance.
To summarize, the linear optimization minimizes subject to the following constraints with aspect ratio η ∈ (0, 1): Equation 4 is approximated by Equation 12for one single 1-spline as in Figure 4a.It can easily be deducted from the intersections of the iso-area-hyperbola and Equations ( 10) and (11).The required values for F i,j η + F i,j /η can be precalculated before starting the optimization and thus are constants within the linear model.

Nonlinear Model
To remove the linearization error, SDPs can be used because they can express the red iso-area-hyperbola in Figure 4. We set k l variables X k(i−1)+j such that These matrices are premised to be positive semidefinite (i.e. " 0"); thus, each principal minor is greater or equal to 0: ⇔ r i c j − F i,j ≥ 0, We formulate a SDP.The objective function minimizes the linearized variable x ≥ max{∑ r i , ∑ c i } using Equations ( 6) and ( 7): subject to the following constraints.We assign the corresponding area values to each matrix using the Frobenius inner product: For each i ∈ [l], the upper left entry of the matrices X k(i−1)+j has the same value for all j ∈ [k] (this models r i ): For each j ∈ [k], the lower right entry of the matrices X k(i−1)+j has the same value for all i ∈ [l] (this models c j ): We model the maximum variable x for the objective function (this models x ≥ ∑ r i and x ≥ ∑ c i ): Again, areas of tiles are constrained by an aspect ratio η.Note that this aspect ratio is not violated by the relation between r i and c j .Rather, a component can find a rectangle inside the bounding box given by r i c j .This rectangle has the size of the core.The aspect ratio of its edges is greater than η.We formulate for all i ∈ [l] and for all j ∈ [k]:

simulated annealing (SA) vs. particle swarm optimization (PSO)
Our approach is compared against [13], which uses PSO to map an application on a partially vertically-connected 3D mesh NoC with cores of different sizes.It is one of the most recent works on mapping in NoCs at the time of writing this paper, and it does account for cores of different areas, but it does not optimize area.To compare against this work, we use our cost function with the SA to map video object plane detection (VOPD) benchmark to a 3D-connected 4×2×2 NoC and double video object plane detection (DVOPD) benchmark to a 4×4×2 NoC with a varying number of vertical connections.The benchmark application graphs are from [23].The other benchmarks from [23] are smaller and thus a comparison is not useful because both the PSO and the proposed heuristic algorithm using a simulated annealing will find the global minimum in a small design space in a short time.We chose an arbitrary but identical initial mapping for both benchmarks and both algorithms.We use 20 reruns for both PSO and simulated annealing so that both the algorithms have approximately the same computation time budget.The parameters of the PSO are given by [13] (k1 = 1, k2 = 0.04, k3 = 0.02).The parameters for the simulated annealing are: initial temperature 30, cooling 0.97, 1000 iterations.Both [13] and the our approach use the same objective function that minimizing bandwidth times communication hop distance.We disregard area because it is not used in [13] and therefore would skew the comparison.We change the TSV count in the NoCs to vary the mapping difficulty.The results are shown in Table 1 for VOPD and in Table 2 for DVOPD.The proposed heuristic algorithm allows for up to 15% improved performance with 2.564-3.125%better performance in average.

Linear vs. Non-Linear Model
We compare our linear and nonlinear models by generating results for the same inputs with both the LP and the SDP.We implement our models in MATLAB R2018a and they are available from Github [24].The LPs use IBM CPLEX 12.8.0[25] as optimization engine.The SDPs use Mosek 8.1 [26].We generate three random input benchmarks as in [3].Iso-area cores are used for a fair comparison against conventional approaches: 1.A 3D SoC with two layers and five tiles, of which three tiles are in layer 1 and two tiles are in layer 2. 2. A 3D SoC with four layers and 10 tiles per layer connected by a 2×5 mesh NoC. 3. A 3D SoC with four layers and 20 tiles per layer connected by a 4×5 mesh NoC.
Cores are set to be 10 mm 2 large.Routers with five ports require 1 mm 2 .The router area is linearly proportional to port count depending on the position of the router in the network.TSV arrays,

Figure 3 .
Figure 3. Variables and constraints of area optimization.
Reduced error through multiple linear approximations.