A Machine Learning Mapping Algorithm for NoC Optimization

: Network on chip (NoC) is a promising solution to the challenge of multi-core System-on-Chip (SoC) communication design. Application mapping is the first and most important step in the NoC synthesis flow, which determines most of the NoC design performance. NoC mapping has been confirmed as an NP-hard (Non-Polynomial hard) problem, which could not be solved in polynomial time. Various heuristic mapping algorithms have been applied to the mapping problem. However, the heuristic algorithm easily falls into a local optimal solution which causes performance loss. Additionally, regular topologies of NoC, such as the ring, torus, etc., may generate symmetric solutions in the NoC mapping process, which increase the performance loss. Machine learning involves data-driven methods to analyze trends, find relationships, and develop models to predict things based on datasets. In this paper, an NoC machine learning mapping algorithm is proposed to solve a mapping problem. A Low-complexity and no symmetry NoC mapping dataset is defined, and a data augmentation approach is proposed to build dataset. With the dataset defined, a multi-label machine learning is established. The simulation results have confirmed that the machine learning mapping algorithm is proposed have at least 99.6% model accuracy and an average of 96.3% mapping accuracy.


Introduction
With the increase in integrated circuits (ICs) power consumption in the past few decades, some ICs finally reached their fundamental thermal limit at the beginning of the past decade [1]. System on chip (SoC) has begun to shift from a high-performance singlecore design into a multi-core design [2], which follows the communication challenge. Network on chip (NoC) is one of the mainstream solutions for multicore SoC communication design [3,4]. NoC is an on-chip communication network based on packet switching [5], which consists of Resource Network Interfaces (RNI), routers, and interconnecting links [6].
NoC brings more communication bandwidth and increases communication cost. In 100 nm technology, the percentage of power consumption caused by communication between IP (Intellectual Property) cores exceeds 30% of the total power consumption. This value increases with more advanced technology and higher die integration [3,4]. Communication delay and power consumption are two important aspects of NoC design [7].
Application mapping is the first and most important step in the NoC design flow [7][8][9]. A good mapping solution will lead to a low communication cost. NoC application mapping has proven to be an NP-hard problem. Consequently, no exact algorithm is expected to solve the problem in a polynomial time, and even small instances may require considerable computation time [7,10,11].
The heuristic method is a natural and useful method to provide high-quality local optimal solutions [12,13]. Various heuristic mapping algorithms have been proposed to solve the application mapping problem, such as NMAP (Near-optimal Mapping), PSO (Particle Swarm Optimization), SA (Simulated Annealing Algorithm), GA (Genetic Algorithm), and ACO (Ant Colony Optimization) [14][15][16][17][18][19][20][21][22]. However, heuristic mapping technologies with limited computation resources easily fall into a local optimal solution, which causes performance loss [23][24][25]. Additionally, the symmetry of NoC mapping solutions may influence the performance of the heuristic NoC mapping algorithms, which increase performance loss.
Machine learning involves data-driven methods to analyze trends, find relationships, and develop models to predict things based on datasets [24,25]. Various NoC machine learning technology examples of research have been reported [26][27][28][29][30]. Ref. [26,27] use machine learning technology to predict the performance of NoC mapping. Ref. [28][29][30] use machine learning to explore the design space of NoC. More and more examples of research have applied machine learning technology to network on chip.
In this paper, a NoC machine learning mapping algorithm is proposed to solve a mapping problem. The space complexity of the existing mapping solution description [5,26,31] is too large for a machine learning model to fit with few samples. A low-complexity NoC mapping dataset is defined, and a data augmentation approach is proposed. With the characteristics of the dataset defined, a multi-label machine learning model is established.

The Methods of Establishing Machine Learning Mapping Algorithm
In this section, a machine learning mapping algorithm is presented. Firstly, an NoC mapping performance model is established. Secondly, symmetry analysis and space complexity of mapping solution are presented. Thirdly, the process of mapping dataset construction is detailed. Finally, a multi-label machine learning model is established.

Problem Formulation
NoC mapping aims to assign NoC cores that minimize the energy consumption and communication delay. Referring to [15,32], the NoC mapping problem is formulated with the following definitions.

Characteristics Definition
"Definition 1: The Application Characteristic Graph (APCG). The application task is modeled as a directed graph G (C, A), where each vertex ci∈C represents an IP core, each edge aij∈A represents the communication between ci and cj, and the weight of each edge Vij indicates the communication volume on edge aij [32]". APCG is shown in Figure 1a.
"Definition 2: The Architecture Characterization Graph (ARCG). The NoC architecture is modeled as a directed graph G (R, L), in which nodes of the graph represents routers ri∈R and the edges between nodes lij∈L represents physical links between routers [32]". ARCG is shown in Figure 1b.
"Definition 3: The Channel Communication Graph (CHCG). The NoC architecture is modeled as a directed graph G (R, L), in which nodes of the graph represents routers ri∈R and the edges between nodes Lij∈L represents the channel load of the link after application mapping between router ri and rj which is the result calculated with the application mapping result and routing algorithm and constrained by the physical links lij and the bandwidth Bwij [32]". CHCG is shown in Figure 1c.

Energy Model
In order to formulate the NoC energy consumption, one-bit energy [25] is introduced. ri means ith router. The energy of sending one-bit data from ri to rj is Eij, which is formulated by (1).
where Elbit represents one-bit data energy consumption transmitted through a link, ERbit represents one-bit data energy consumption transmitted through a router, ERbit includes buffer and switch energy consumption. Hops is the Manhattan distance from the sending node (xi, yi) to the receiving node (xj, yj) and is given by (2).
Here, bij is the count of bits sent from ri to rj. Referring to [5], energy consumption can be calculated with the bit energy values for link, switch, read and write buffer as 0.449 pJ, 0.284 pJ, 1.056 pJ, and 2.831 pJ, respectively.

Delay Model
The switching technology for the network is wormhole. The communication latency of NoC can be estimated by the average network delay (Tav) [27], which is introduced as (4).
where Tl and Tr are the delay from routers and links. Router transfers information by data packages. A data package is composed of several flits. λi,j represents the flits quantity, which are sent from ci to cj. n represents the amount of Processing Element (PE). ci means the ith core in NoC topology. At the transfer level, the latency of global links can be calculated by (5).
where i and j are the node indexes of the sending node and receiving node. Ci,j is the minimum count of links required, which are used to send packets from the sending node to receiving node, and ut is a unit of time.
The latency of m-ports queuing router has been established in [29] as (6).
where Taq is the average count of time steps spent by a flit in the queue. Taq has been modeled by [5], shown as (7).
where Bav is the average size of queue, and N0 is the throughput (packet/time step). Referring to [5,26], they can be calculated out. The total router delay is formulated as (8).

Optimization Model
Energy and delay are two important performance optimization targets. In order to meet the requirements of different design, the performance cost function can be expressed as (9).
where α ∈ [0, 1] is a weight parameter, and NEc and NTl are the normalized energy and delay, respectively.

Details of Machine Learning Mapping Algorithm
In this section, the space complexity and symmetry of NoC mapping solutions are analyzed. Based on the analysis, a low-complexity NoC mapping dataset with no symmetry is proposed. Considering the difficulty of searching global results of NoC mapping problems, a data augmentation approach is proposed to generated samples with a globally searched sample.

Space Complexity and Symmetry
On the existing examples of research, NoC mapping solutions are defined as a sequence. For example, Figure 2 is the VOPD (Video Object Plane Decoder) NoC mapping problem.  Application VOPD is mapping on a 4 × 4 mesh NoC. One of its solutions is described as [1,5,10,11,4,6,9,12,2,3,7,8,16,15,14,13]. The space complexity of the VOPD mapping solution sequence is 16!, which means it has 16! mapping solutions. Mapping algorithms are designed to search the solution space and find the optimal one. The space complexity of an n-core mapping solution sequence is n!, which is too large for a machine learning model to fit with a few-samples dataset.

A Low-Complexity and No Symmetry NoC Mapping Dataset
NoC mapping is a problem to find appropriate application mapping solution on NoC topology. The inputs of this problem are application task graph and network topology information, and the output of this problem is the mapping sequence or corresponding mapping solution. NoC mapping performance could be calculated out with mapping sequence and network information.
The space complexity of mapping sequence is too large, which means that the problem needs a large machine learning dataset.
In order to solve the dataset space complexity and the symmetry of mapping sequence, the point-to-point task matrix is defined as the input data of the dataset and the column vector of the nearest neighbor matrix is defined as the output data of the dataset.
The Point-to-Point Task Matrix: The task graph is modeled as a matrix, where ci∈C represents intellectual property (IP) core i, and each element Vij∈V in the matrix represents the communication volume between ci and cj, as Figure 4 shows. The Nearest Neighbor Matrix: The mapping solution is modeled as a matrix, where ci∈C represents intellectual property (IP) core i, and each element dij∈D represents the connecting relations between ci and cj. If ci is connected with cj. then dij is equal to 1. If not, dij is equal to 0. Figure 5 shows the nearest neighbor matrix of a ring NoC mapping solution.

R1
R3 R2 Through the machine learning method, the vector matrix of each core is predicted. After combining all core vector matrixes, the mapping sequence could be calculated out and the mapping performance could be obtained with network information.
If NoC has n cores, the space complexity of the sequence is n! and the space complexity of the nearest neighbor matrix is m n n C ⋅ , where m is the port count of each core in NoC.
To further reduce the complexity, each core's column vector is set as the output data. The space complexity is In the ring NoC, the space complexity is 2 n C . In the torus topology, the space complexity is 4 n C . The space complexity of the nearest neighbor matrix is much smaller than the space complexity of sequence.

A Data Augmentation Approach
The NoC mapping problem is an NP-hard problem, which means that optimal mapping solution is hard to obtain. It is too expensive to directly generate datasets that meet the requirement of machine learning.
A data augmentation approach is proposed to augment datasets with few globally searched samples. The approach is shown below (Algorithm 1):

Algorithm 1 Data Augmentation with Number Exchange
Input: task graph and optimal mapping solution sequence Output: a group of task graphs and corresponding optimal mapping solution sequences 01: {B} is the optimal mapping solution sequence of task B, calculated with global search; 02: Fetch data in an integer space [1, n] without repetition to build an n-dimensional order sequence. n! sequences could be obtained. 03: Let the initial order sequence be {1, 2, 3, …, n}. 04: Loop begins, i = 2, 05: the ith order sequence is {i1, i2, i3, …, in}. 06: Loop begins, j = 1, 07: Swap the number Bi in task graph B and {B} with number Bij. 08: If j = n, end Loop. 09: If i = n!, end Loop. 10: n! task graphs and corresponding optimal mapping solution sequences have been obtained.
For example, the optimal solution of task graph G2 (C, A) can be obtained from G1 (C, A) with its optimal solution, as Figure 6 shows. By swapping the identifier of C1 and C5, the task graph G2 (C, A) can be obtained from G1 (C, A). With the same changing, the optimal mapping solution of G2 can be obtained from G1. In this way, n! task graphs with corresponding optimal solutions can be obtained from G1 (C, A) with its optimal solution.

Task Graph
Mapping Solution + Mapping Sequence   Figure 7b is the other application mapping and corresponding augmentations with Algorithm 1. Assuming that the dataset is composed with two applications and corresponding augmentations. Due to the large number of samples generated, the batch size is smaller than the number of samples, which causes that one batch training to happen in one application and its augmentations. If the number of augmentations is several times that of the batch size, network parameters will nearly converge to an unexpected local optimal point and cause the convergence process difficulty in the other application.
Although the above problem can be alleviated through training parameter adjustment, the training problems cannot be completely solved.
In order to avoid this problem, the order of dataset should be changed. Through changing the order of dataset, in a batch, there will be no two augmentations generated by one mapping solution.
A dataset disorder approach is proposed to eliminate the risk mentioned above. The approach is shown below (Algorithm 2):

Algorithm 2 Dataset Disorder
Input: dataset with augmentation, which includes N initial samples and N•n! samples generated. (The initial sample is written as A, and the corresponding sample generated is written as a).
Output Finally, a multi-label machine learning model has been established to fit the dataset proposed, as Figure 8 shows.
The input of the multi-label machine learning model is the point-to-point task matrix. The output is the core column vector of the nearest neighbor matrix. After combining the core column vector of each core, the mapping sequence could be obtained.
The parameters and outputs for the multi-label machine learning model are as follows.
1. The input layer, the task graph matrix is 9 × 9. 2. The Conv1 layer, performs a convolution operation on the input matrix with a 3 × 3 size and a convolution kernel with a channel number of 32, with a step size of 1 and no boundary padding. 3. The Conv2 layer uses 32 5 × 5 size convolution kernels, with a step size of 1 and the same boundary padding. 4. The Max Pooling layer sizes 2 × 2 with the same boundary padding. 5. The Dropout layer's rate is set as 0.25. 6. The Conv3 layer, uses 64 4 × 4 size convolution kernels, with a step size of 1 and the same boundary padding. 7. The Conv4 layer, uses 64 3 × 3 size convolution kernels, with a step size of 1 and no boundary padding. 8. The Flatten layer.
(The Full Connect layer is written as Fcn layer). 9. The Fcn1 layer output dimension is 1024 with 0.5 dropout rate. 10. The Fcn2 layer output dimension is 4096. 11. The Fcn3 layer output dimension is 9 with sigmoid.

Simulation and Results
Simulations are performed on Python, Matlab, and TGFF (Task Generate for Free), where TGFF is used to generate tasks and python is used for searching the global optimal solution and establishing the machine learning model. The topologies of NoC are ring and torus. The size of NoC is set as 9. The routing algorithm is 'XY'. Two optimization targets are set. One is the minimum delay, and the other is the minimum energy.
In order to verify the proposed model, two kinds of experiment are designed. One is worked out with task graphs with fixed count connections, which is designed for verifying the model with a regularity task graph. The other is simulated with task graphs with random count connections, which is designed for verifying the universality of proposed model. Experiments are both developed in ring and torus topology.
In  On the basis of model validation, mapping sequence accuracy and performance distance between predicted samples and correct samples are simulated and calculated.
If the proposed machine learning model predicts correctly, the performance of the global search and the model proposed are the same. If not, the sequence predicted cannot achieve any performance advantage. In order to compare the performance difference between the global search and the model proposed, 200 samples of a test samples dataset are selected for performance calculation. Tables 2-9 show the performance of wrong task graph mapping samples; the rest are correct and have the same performance with global searching.      Table 10 shows the test mapping sequence accuracy and the relative average performance distance (RAPD) of wrong test samples and corresponding correct samples. Table 10. Test accuracy and performance distance. Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Accuracy  Table 10, the test mapping accuracy is lower than the model accuracy. The wrong sample loses its performance advantage and has a large performance gap with the correct mapping sequence, which causes large system performance loss. The average NoC mapping accuracy is 96.3%.
The machine learning mapping algorithms proposed have a good performance in predicting mapping sequence under different optimization targets with ring or torus NoC.

Conclusions
In this paper, an NoC machine learning mapping algorithm is established to solve NoC mapping problem. Firstly, the space complexity and symmetry of NoC mapping are analyzed. A label problem is presented with mapping symmetry. Secondly, in order to solve the symmetry problem, a network on a chip mapping dataset with low complexity and no symmetry is proposed for machine learning. The complexity of the dataset changes with the number of ports in a single core of the network on chip. If it is a fully connected network on chip, the proposed method does not reduce the complexity. Thirdly, a data augmentation approach is proposed to establish a machine learning dataset, which includes an augmentation approach and a disorder approach. Finally, a multi-label machine learning model is established and verified with a defined dataset. The simulation results have confirmed that the machine learning mapping algorithm proposed has at least 99.6% model accuracy and an average of 96.3% mapping accuracy to predict optimal solutions in different optimization targets and different topologies. Although the proposed model has high prediction accuracy, it will cause large mapping performance loss in the case of wrong mapping sequence prediction. It is different from the heuristic mapping algorithm, as the heuristic mapping algorithm falls into the local optimal solution and causes performance loss. The performance of heuristic mapping is better than randomly mapping, however, a wrong mapping sequence obtained by the proposed model can be approximated as a random mapping sequence, which results in greater performance loss. In future work, the method of identifying and dealing with wrongly predicted mapping solutions will be discussed.