Delay Bound Optimization in NoC Using a Discrete Fireﬂy Algorithm

: The delay bound in system on chips (SoC) represents the worst-case traverse time of on-chip communication. In network on chip (NoC)-based SoC, optimizing the delay bound is challenging due to two aspects: (1) the delay bound is hard to obtain by traditional methods such as simulation; (2) the delay bound changes with the different application mappings. In this paper, we propose a delay bound optimization method using discrete ﬁreﬂy optimization algorithms (DBFA). First, we present a formal analytical delay bound model based on network calculus for both unipath and multipath routing in NoCs. We then set every ﬂow in the application as the target ﬂow and calculate the delay bound using the proposed model. Finally, we adopt ﬁreﬂy algorithm (FA) as the optimization method for minimizing the delay bound. We used industry patterns (video object plane decoder (VOPD), multiwindow display (MWD), etc.) to verify the effectiveness of delay bound optimization method. Experiments show that the proposed method is both effective and reliable, with a maximum optimization of 42.86%.


Introduction
In modern system on chips (SoC) design, average [1][2][3] and worst-case [4][5][6][7][8][9][10][11][12][13][14][15] performance are two essential metrics for communication architecture. Quality-of-service (QoS) in network on chip (NoC) represents the worst-case traverse time of on-chip communications. In NoC based SoC, optimizing the delay bound is challenging. Formal approaches have been proposed for delay bound modeling [16][17][18][19][20][21]. Wang et al. [16] used signal flow chart and signal time measures to analyze the upper bound of transmission delay time. Ren et al. [17] analyzed the communication delay bound for individual flows in NoC using an improved asymmetric multichannel structure of a router. Lu et al. [19,20] modeled a classic input-queuing virtual channel router and a unified platform in Simulink based on xMAS to improve the accuracy of NoC performance analysis.
As an effective tool, network calculus has been applied in NoC performance analysis [21,22]. How to improve the tightness of a NoC performance model based on network calculus has become one of the important research directions. Saggio et al. [23] validated the effectiveness of network calculus for delay bound modeling. To improve the accuracy of a NoC delay bound, Zhao et al. [24] used simulated annealing (SA) to automatically calculate the simulation parameters. They also extended the method from delay bound to a backlog bound tightness study [25], where SA was replaced by an adaptive simulated annealing (ASA) algorithm, with higher efficiency and precision.
The works above show that heuristic approaches can be an effective tool for worst-case performance analysis and that motivated us to improve the delay bound tightness under different mappings. Other than optimizing specific configuration, we focused on optimizing the delay bound in specific application mapping. We observed that different mapping schemes have different delay bounds for NoC traffic flows. When the mapping scheme changes, the contention scenarios also change. However, delay bound optimization is challenging not only because such problems are hard, but also its model should be both accurate and fast enough for solution exploration.
The firefly algorithm (FA) was inspired by the behavior of fireflies flashing and was first put forward by Xin-She Yang [26]. In recent years, FA has been widely used to optimize NoC design in design spaces [27][28][29]. In this paper, we propose a delay bound optimization method using firefly algorithms (DBFA). Instead of improving the simulation parameter accuracy and analyzing network contention, DBFA attempts to find the optimal mapping scheme directly. The experimental results show that the mapping scheme determined by FA is closer to the optimal mapping scheme than discrete particle swarm optimization (DPSO) [30]. Our contribution is summarized as follows: • We modeled the worst-case delay bound in network on chip using network calculus. It is suitable for both unipath and multipath routing NoCs.

•
We adopted FA to optimize the end-to-end delay bound. FA shows high efficiency in many other fields, such as scheduling. To the best of our knowledge, this is the first work using FA for delay bound optimization.

•
We performed extensive experiments using both synthetic and industry patterns. We also integrated our delay bound model into DPSO, a state-of-the-art algorithm, for performance comparison.

Delay Bound Analysis Using Network Calculus
Simulation experiments [31] can easily obtain latency and communication costs, but it is difficult to obtain a delay bound. The network calculus fills the gap. In Figure 1, R(t) and R * (t) represent the actual arrival curve and service curve of NoC, respectively. A linear function α(t) covers the maximum rate of R(t); β(t) represents the minimum service rate of R * (t). Both are defined as follows: where r is the sustainable arrival rate, b is the burstiness, R is the minimum service rate, and T is the maximum processing latency. In particular, R is usually greater than r to ensure no packet dropping in NoC. Delay boundD(t) of a flow f i is calculated by finding the greatest horizontal distance h(α, β) between its arrival curve α(t) and the system equivalent service curve β(t). Hence the delay bound is derived as follows [9]: To express more clearly, we show all the symbols in Table 1.   The equivalent service curve of flow f(i, j) The function to compute the maximum horizontal distance between the arrival curve and service curve ∈ (., .) The function to compute the equivalent service curve inf{a, b} To obtain the minimize value between a and b

Problem Formulation
We first present a general mapping process through the following definitions to optimize the total communication cost.
To map an application characteristic graph G (V, E) to a NoC topology graph P (U, F), the mapping function map() should satisfy the following constraints: The smaller the delay bound, the better the mapping results. In this paper, we assume the NoC topology is a mesh. Therefore, the delay bound is usually calculated by the following equations.
where D n is delay bound of every target flow, whose calculation is shown in the rest of this section. Thus, the optimization target is minDelayBound.

Delay Bound Model
We present the delay bound calculation model in Algorithm 1. The key idea of the delay bound analysis is obtaining the end to end equivalent service curve (ESC) by considering all kinds of resource sharing, shown in line 3-13, and summarized in four steps.
Step 1: Use the function ClassifyConFlow() to classify the flow contentions into a unified representation as f [a,b] , where a and b, respectively, are the traffic injection and output router node. These flows could be a single flow or an aggregate flow consisting of several contention flows; i.e., both unipath and multipath routing are supported in this model.
Step 2: Calculate the arrive curve of f [a,b] at node a.
Step 3: Calculate the equivalent service curve for the target flow and obtain the delay bound.
Step 4: Repeat Steps 1-3 to find the delay bound of every single flow and select the max value as the delay bound in the current mapping.
if CrossContention 12 Cut all cross-contention flows //The treatments of cross contention situation. 13 ACSet = AC_CrossCon( f [a,b] , ACSet) //Calculate the arrive curve of cut-flows and combine the same type flows. 14 else 15 //Obtain the worst-delay-bound for one scheme.

Distance Calculation
The calculation of distance is as follows.
where s mn is the Manhattan distance between firefly X m and X n , (x m i , y m i ) represents the coordinates of the mapping scheme of firefly m in task i. Accordingly, (x n i , y n i ) represents the coordinates of the mapping scheme of firefly n in task i. And T n um is the total number of tasks.
In order to make the distance and the absorption coefficient in the same magnitude, we perform min-max operation to normalize the distance between two fireflies.
where s min is the minimum distance between two fireflies and s max is the maximum distance for the target firefly. To make it more clear, take Figure 2 as an example. There are two fireflies X m and X n in Figure 2a,b, respectively. For firefly X m , task numbers 1, 3, and 5 are mapped to (4, 4)(node F), (3, 4)(node B), and (2, 4)(node 7), respectively. For X n , the positions of the above three tasks are (2, 1)(node 4), (2, 2)(node 5), and (2, 3)(node 6), respectively. According to the Equation (7), we can calculate The positions of task numbers 0, 2, and 4 have not changed, so s 0 , s 2 and s 4 are all 0. At last, we can calculate the distance between fireflies X m and X n as s mn = 0 + 5 + 0 + 3 + 0 + 1 = 9. Figure 2c shows the theoretical max distance for firefly X m , which does not exist actually. The max distance is a ideal up-bound value, which is defined as mapping the current task to the theoretical farthest corner node; e.g., task 0 and 2 are mapping to node 15. In this example, s max = 6 + 5 + 5 + 5+5 + 6 = 32. As a result, the distance between firefly X m and X n is S mn = 9 32 = 0.28125.

Refreshing of Firefly Locations
The original firefly location refreshing formula [26] is as the following: In our approach, we rewrite the firefly refreshing formula as follows.
Thus, this formula consists of the following two parts: 1. β movement. Fireflies refresh because of the attractiveness between any two fireflies; it is related to attractiveness β(r), so we call it β movement. 2.
α movement. Fireflies refresh because of the random movement; it is related to the maximal random step α, so we call it α movement.
Each firefly moves towards the brighter fireflies through β movement (the mapping scheme least delay bound). Later, each firefly moves randomly though α movement to find a better mapping scheme. The α movement rule is different between the non optimal firefly and the optimal firefly, so we call the former α1 movement and the latter α2 movement.
1. α1 movement In order to learn the α1 movement of firefly X i (t i ) after β movement, we defined a set w to record the positions of elements which occupy different positions between X i (t i ) and X j (t j ). Then we chose α number of positions from w as exchange positions, and exchanged each two elements with probability p to finish the α1 movement.
2. α2 movement The α2 movement is used for preventing an optimal firefly from falling into a local optimum. It increases the exploring capability of DBFA. We chose α number of positions in local optimum mapping scheme and exchanged each two elements with probability q to finish the α2 movement. The probability q was randomly generated by a uniform distribution function.

Pseudo Code of Firefly Algorithm
The pseudo code of DBFA is shown in Algorithm 2. All the steps described above are covered in this algorithm; i.e., defining firefly, calculating distance, refreshing locations, and optimizing the delay bounds.
For algorithm complexity, if there are n fireflies in the colony, we obtain a local optimum firefly in every generation when executing Algorithm 2. For other fireflies, the calculation process of each firefly from line 5 − 14 would be carried out (n − 1) times. Thus the iterations of all fireflies is (n − 1) 2 . After all, the complexity of the FA is O((n − 1) 2 ). Therefore, the complexity of whole program of DBFA is o(m(n − 1) 2 ( j × (j + 1) 2 ) l ). for all i < Fnum 10 { 11 S = Compute Distance(X ij ) //Compute the distance between X i (t i ) and X j t j . 12 if 1

{ 14
Compute the attractiveness between X i (t i ) and X j t j . 15

Example of the Delay Bound Optimizing Process
To further understand the movement procedure, still take Figure 2 as an example. We calculated the delay bound of firefly X m in Section 3 asD m = 1139.833. Using the same method, we can also obtain the delay bound of firefly X nDn = 366. The distance between X m and X n is S mn = 0.3125, so we assign γ = 0.3, which satisfies the moving condition 1 1139.833 < 1 366 × exp(−0.3 × 0.3125). Therefore firefly X m moves towards X n . The attractiveness between fireflies can be calculated with the following formula.
We define β 0 = 1, so the attractiveness between X m and X n is β = 1 × e (−0.3×0.3125 2 ) = 0.97. As Figure 3 shows, compared with the mapping scheme of X n , there are three different positions in X m (regardless the position with the value −1). We can see that the first different value is 1, so we look for the location where 1 is in X m . Use the probability β to change the value in the last one with value in the fifth number. Repeat this step until all the different positions are changed to the same. So far we have already finished the β movement.  The firefly X m which has gone through the β movement continues to complete the α1 movement. This step makes sure the firefly will move towards a lighter firefly exactly (with a lower delay bound). We supposed firefly X n was the local optimum firefly, so we made it move according to the α2 movement rules. As Figure 4 shows, we randomly chose α positions in local optimum firefly and randomly produced a change probability q. In this case, we supposed that for q = 0.57 and α = 3, the three positions would be <1,5,8>, changing the values with probability 0.57 in turn. So far, we can finish the α2 movement.

Experiments and Results
We performed experiments for the following three purposes: (1) proving the effectiveness of DBFA in delay bound optimization, (2) comparing the results with state-of-the-art work DPSO, and (3) verifying the tightness of DBFA compared to a simulation.

Setting Up
We mapped some applications to a mesh-NoC to test the reliability of our method. The characteristic graphs of industry patterns [30] such as picture-in-picture(PIP) [32], multiwindow display (MWD) [33], 263DEC MP3DEC [34], MP3ENC MP3DEC [34], MP3ENC MP3DEC [34], video object plane decoder (VOPD) [35], and DVOPD [36] are as shown in Figures 5, 6a, and 7. The mesh-NoC scale was 4 × 4 and the whole progress was simulated by C + + and run in the platform of Ubuntu12.04. Experimental parameter settings are shown in Figure 2. The experimental parameters are shown in Table 2. Fnum represents the total number of fireflies in fireflies group and GMax represents the maximum number of iterations. γ and α represent the absorption coefficient and the maximal random step, respectively.

Experiment Results
VOPD has 21 flows, which is the largest and most complex characteristic graph in this paper, and we take it as an example to prove that optimization performance of DBFA is more convincing. The mapping scheme using the RAND, DPSO, and DBFA methods is shown in Figure 6b-d, where circles represent tasks and rectangles represent network nodes. The mapping scheme obtained by the three methods has corresponding delays of 55, 46, and 42, respectively.
The results of minimum worst-case delay bound in every generation are shown in Figure 8. At first, the beginning the worst delay bound was 55 cycles; after 400-times optimization, the delay bound was 42, which reduced by 23.64%. Compared with DPSO, DBFA can avoid the situation of the algorithm falling into a local optimum. DBFA is designed for NoC, by introducing α movement and β movement to successfully identify and jump out of local optimal traps. This is an important reason why DBFA is more efficient than DPSO. Its delay bound for each flow is shown in Figure 9.
In order to enhance the comparison of the results, we have added DVOPD to the original six industry patterns. The scale of DVOPD is much larger than any other pattern. The experimental results are shown in Figure 10. Although the scale of NoC has been greatly expanded, DBFA still has stronger performance than DPSO.
For other applications, the optimized results of every flow are shown in Figures 11-15 and the biggest delay is the delay bound. What needs special explanation is in application PIP: DBFA is almost unoptimized because the PIP contains eight tasks and eight cores, seven of which are the same. The application is so simple that there is little room for optimization.

Scalability Analysis
The seven industry patterns in the experiments can be divided into small-scale, medium-scale and large-scale applications. Specifically, the scale of NoC in PIP is only 4 × 2, which is the smallest among all patterns, and the scale of NoC in DVOPD is 8 × 4, which is the largest among all patterns. We calculated the delay bound of PIP in NoCs of different scales. The experimental results are shown in Figure 16. Experimental data shows that, although the scale of NoC varies greatly, the delay bound is almost unchanged. This shows that DBFA and DPSO have strong stability in delay bound optimization.  Figure 16. Comparison of delay bounds of PIP in networks on chips (NoCs) of different scales.

Running Time of CPU
We also studied the feasibility of the DBFA. We mapped these applications, varying the scales of NoC and the whole CPU (Intel i5-8400) running time from initialization, to get optimal mapping schemes, which are shown in Table 3, where with the increasing of flows, the running time of CPU would increase by a slight to moderate amount. DBFA optimization time reduced compared to DPSO, which shows that DBFA is more efficient at optimizing delay bounds.

The Comparison of Optimized and Simulation Results
In order to verify the validity of DBFA, we performed simulations and compared the optimized analytical results with the simulation results, the using application MWD. In simulations, we used Verilog to design a 4 × 4 NoC; the global clock network was 50 MHz and each router node handled one flit with two cycles, so the maximum delay of data in the network was four cycles. The experimental results are shown in Figure 17.This figure proves the validity of DBFA for optimizing the delay bound. For several flows (flow 8 and flow 11), the difference between theoretical results and stimulation results was minor, proving the tightness of the analytical results too. It is also important to point out that for some flows, such as flow 5 and flow 10, there existed a big gap. This is partly because the simulation time and flow contention were not well explored during the simulation.

Conclusions
Optimizing a delay bound in NoC is both important and hard. When the application mapping changes, the contentions between flows also change, which result in a different delay bound. In this paper, we first derived an analytical model for end-to-end flows in NoC, which can automatically compute delay bound for the target flow, when given the specified mapping. Then, we proposed a firefly algorithm for application mapping, with the delay bound minimization as the optimization objection. We called this framework as DBFA. Experiments showed that the proposed DBFA can not only optimize the delay bound for a specified application, with an optimization rate up to 42.86%, but also has a fast running time and tight accuracy.