DagTM : An Energy-Efficient Threads Grouping Mapping for Many-Core Systems Based on Data Affinity

Tao Ju 1,2,*, Xiaoshe Dong 1,*, Heng Chen 1 and Xingjun Zhang 1 1 School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China; hengchen@mail.xjtu.edu.cn (H.C.); xjzhang@mail.xjtu.edu.cn (X.Z.) 2 School of Electronics and Information Engineering, Lanzhou Jiaotong Universtiy, Lanzhou 730370, China * Correspondence: jutao2011@stu.xjtu.edu.cn (T.J.); xsdong@xjtu.edu.cn (X.D.); Tel.: +86-158-0290-1805 (T.J.); +86-29-8266-3951 (X.D.)


Introduction
Improving computing performance and reducing energy consumption remain key problems in the high-performance computing domain [1].The heterogeneous many-core system has emerged as a promising solution in energy-efficient computing.In the emerging heterogeneous many-core systems composed of a host processor and co-processor, the host processor is used to deal with complex logical control tasks (i.e., task scheduling, task synchronizing, and data allocating), and the co-processor is used to compute large-scale parallel tasks with high computing density and simple logical branch.These two processors cooperate to compute different portions of a program to improve the program energy efficiency.The host processor generally adopts chip multi-processors that contain a limited number of processor cores, and the co-processor generally adopts an emerging many-core processor, such as graphics processing unit (GPU) and Intel many integrated core (MIC), which integrates many processing cores (generally tens or even hundreds of cores) in a single chip, and these processing cores are connected via interconnection network and employs simultaneous multithreading Energies 2016, 9, 754 3 of 20 (2) We propose data locality patterns that reflect the different data correlations, and use an affinity matrix to quantify the data affinity between threads.
(3) We design an affinity subtree spanning algorithm based on an affinity graph to implement thread grouping.
(4) We implement the thread grouping mapping strategy DagTM based on the data affinity on the Intel MIC heterogeneous many-core system.

Related Work
A large number of efforts have been done for mapping threads to processing cores based on the data locality.Jiang et al. [5] introduced the concurrent data reuse distance concept by extending the traditional data reuse distance, connected concurrent reuse distance with the data locality of each individual thread by using a probabilistic model, and presented a solution to collect and apply the concurrent reuse distance on Chip Multi-Processor (CMP) platforms.Zhang et al. [6] designed a novel data locality optimization strategy for multicores, which is able to balance both inter and intra-core reuses.The strategy is essentially an exhaustive comparison method, which needs to calculate the data reuse weights, construct a data dependence graph and data sharing graph in advance, and tradeoff performance and overhead.In addition, the strategy mainly focuses on the multi-core single threading without considering simultaneous multi-threading.Drebes et al. [7] proposed a resource-aware approach combined with topology-aware work stealing, dependence-aware memory allocation, and work pushing.The approach can significantly improve the performance of some memory-bound applications.However, for computation-bound applications, the approach may not achieve the ideal performance due to additional run-time overhead.Lu et al. [8] proposed a software framework that partitions the cache at the data object level to reduce cache misses.Unlike our strategy, the work focuses mainly on reducing cache misses to improve performance without considering the data interference between different processing cores.Moreover, it needs to modify the Linux kernel to implement the proposed framework, which restricts its generality.Muralidhara et al. [9] proposed a cache hierarchy-aware application grouping algorithm to find an application-to-core mapping.The work mainly analyzes the memory access relationship between different applications according to the sampling reuse distance distribution on the simulator, and groups the workload on the coarse-grained program level, which could not fully reflect the data interaction characteristics between different threads in the same application.Diener et al. [10] proposed a mechanism to improve memory access locality by reducing accesses to remote caches and memories.In the process of program execution, sharing threads are migrated to the same memory level processing core, and memory pages accessed by a thread are migrated to a node that runs the thread to improve memory access performance.However, the mechanism will introduce additional runtime overhead.Ding et al. [11] proposed a cache hierarchy-aware code mapping and scheduling strategy for multicore architectures.In the process of mapping, the loop iteration, data reuse, and processor memory hierarchy are abstracted as iteration vector, reuse vector, and core vector, for implementing the loop iterations to processing cores mapping by using the algebraic method.The method calculates the data locality by comparing the vectors.For the computation intensive loop iteration, this method can achieve good computing performance, but for the storage density and communications intensive applications, simply comparing the vector is not able to effectively extract relative complex data correlation between threads.Cruz et al. [12] proposed a mechanism to detect the communication patterns of shared memory applications by monitoring cache coherence protocols, and proposed an algorithm to dynamically migrate the threads.This mechanism could implement the dynamic thread mapping to reduce communication overhead supported by the certain hardware.Tousimojarad et al. [13] proposed an extended lowest load technique by using a heuristic to find the optimal target core for each thread.The work aims mainly to provide load balancing in a multithreaded multiprogramming environment.Marongiu et al. [14] considered a representative template of a modern multi-cluster embedded Multiprocessor System-on-chip (MPSoC), and presented an extensive evaluation of the cost associated with supporting OpenMP.They adopted a hierarchical barrier algorithm to improve the performance of global synchronization, and introduced extensions for data distribution at the cluster level to implement data sharing in an effective manner.Poovey et al. [15] proposed four novel, hybrid hardware/software, pattern-based thread mapping predictors, which aims to provide load balancing to improve the performance.Our work mainly utilizes the data locality between threads to map the thread to the appropriate processing cores to improve energy efficiency.However these works are not orthogonal to our work, they can mutual combine to further improve the energy efficiency.
The different thread mapping strategies mentioned above, either introduce extra runtime overhead, or need customized support from compilers or special hardware, which limit their versatility and effectiveness.However, DagTM simultaneously takes into account the memory access characteristics of application threads and hardware architecture features of many-core processors, analyzes the data affinity of application threads, and then divides the threads into different groups.DagTM can reduce shared memory access conflicts and unnecessary data transmission, improve program computing performance, and reduce system energy consumption without requiring additional runtime overhead and special hardware support.

Threads Grouping Mapping Framework
The thread grouping mapping framework of DagTM is shown in Figure 1.First, the application program is divided into the corresponding number of application threads according to the maximum number of hardware threads supported by the many-core processor.The computing tasks are evenly allocated to different application threads.The data locality of computing tasks is not considered during task partition, and will be taken into account in the subsequent process of threads grouping.We develop the parallelism based mainly on the loop portions of benchmark programs.Given that the major computing task is focused on the loop portions for most of the programs, inserting the OpenMP directive statement # pragma omp parallel for in the loop parts of benchmark program can realize the parallelization.Second, DagTM detects the data locality of thread by computing the data reuse distance.The different threads are merged into the different pattern classes by data locality pattern classification.Third, DagTM analyzes the data affinity between different threads through calculating the number of shared data.The data affinity reflects the inherent data correlation of program, and is independent of the specific running platform.DagTM uses the affinity matrix to quantify the data affinity.Fourth, threads are categorized into different thread groups relying on the data affinity matrix and data affinity graph.Finally, thread groups are mapped to different processing cores to execute by considering the hardware architecture feature of many-core processor.

Detecting Thread Data Locality
After the computation tasks are partitioned into different application threads, DagTM detects the data locality of thread by collecting the memory access data of each thread, and calculating the data reuse distance.

Calculating Data Reuse Distance
Data reuse distance is an ideal metric for detecting program data locality [2,5], which refers to the number of distinct data elements referenced between current and the previous access to the same data element.The reuse distance is inherent to a program, and is independent of hardware configuration.A small reuse distance means the accessed data has good data locality, which can be accessed at high frequency; on the contrary, it means a poor data locality, and a low access frequency.We design the Pin tool by using Intel Pin Application Program Interface (API) [16,17] to collect the memory access data and calculate the data reuse distance, in parallel.After that, the average data reuse distance of each thread is calculated.
The traditional data reuse distance calculation is realized by using stack-or tree-based algorithms.Stack-based algorithms are inefficient due to a need to sequentially traverse all of the data sequences from the stack top to the bottom.However, the tree based algorithm has high computing efficiency, because it is able to utilize some special properties of the tree to reduce the redundant traversal.In order to improve the speed of traversal data nodes in the tree, Niu et al. [18] used a hash table to assist the data reuse distance calculation.Because the hash table must be constructed before calculating the data reuse distance, it will incur additional overhead in both time and space.
In order to reduce the additional overhead in both time and space, we calculate the data reuse distance when inserting the memory access data into a balanced binary tree, and simultaneously record the data reuse distance into the corresponding data node.In this way, the calculation of data reuse distance and collection of thread access data can be simultaneously completed, which does not need the support of other auxiliary data structure (e.g., a hash table).Therefore it could reduce additional overhead in both time and space.When the whole memory access data scanning is finished, the balanced binary tree containing the memory access data and reuse distance is also generated.The balanced binary tree node is used to record the memory access data.It is defined as a struct.The node data structure definition is shown as follows: The whole process of calculating the data reuse distance includes inserting nodes, deleting nodes, and traversing nodes in the balanced binary tree.The average data reuse distance of a whole thread is calculated by traversing the balanced binary tree, which is used to quantify the data locality of a thread.

Collecting Thread Access Data
Collecting memory access data is implemented through inserting data nodes in the balanced binary tree.The data node insertion operation adopts the in-order inserting algorithm of a balanced binary tree based on the time stamp as a primary key.
Before inserting a new data node, the algorithm first traverses the current binary tree, and judges whether the new data has been recorded in the current binary tree.If the data has been recorded, the data reuse distance will be calculated.The existing data node that contains the new data is deleted from the current binary tree, and the current binary tree is adjusted to maintain the balance.Then the new data node is inserted into the current balanced binary tree.If new data is not recorded in the current binary tree, the new data node will be inserted into the binary tree.The algorithm constantly iterates according to the above steps until all the memory access data and corresponding information are recorded.A complete algorithm of collecting access data consists of the following five procedures: scanning memory access data, inserting data node, deleting the original data node, counting data reuse frequency, and calculating data reuse distance.The concrete realization process is as follows: (1) Initialization: define the data structure of new data node: Node (TS; Element; Frequency; Weight; RD), an empty binary tree.(2) Call Pin tool to scan threads access variables, and record time stamp t i and data element d i .
(3) Assign initial values to data item variables of new data node: TS = t i ; Element = d i ; Frequency = 1; Weight = 1; RD = ∞.(4) Judge whether the data element of new data node is included into the current binary tree, and inorder traverse the binary tree.Adjust the balanced binary tree.The conventional AVL algorithm (balanced binary tree algorithm) is used to adjust the balanced binary tree based on the time stamp as a primary key.e Insert the new node into the current balanced binary tree.
(6) If the node is not contained in the current balanced binary tree, then directly insert the new data node into the binary tree.(7) Repeat steps (2)-( 4), until all the access data of thread have been scanned.Generate the new balanced binary tree which contained all the memory access data, corresponding data access frequency, and corresponding data reuse distance information.(8) Adjust the data reuse distance which equals ∞ in the data node.The values ∞ are replaced with the M (refers to the number of data nodes in the current binary tree).The M is the value of Weight of root node, and it also refers to the max data reuse distance of current thread.( 9) Finish the collecting access data of thread.

The Algorithm of Calculating Data Reuse Distance
The data reuse distance calculation process is encapsulated into an isolated function, which is directly called during collecting thread access data.The algorithm realization is as follows: (1) Search the data node which contains the data to be inserted into the current binary tree.If the data node is not included, which means that the data is firstly accessed, and its data reuse distance is set as ∞.
(2) If the data node is found in current binary tree, then the data reuse distance calculating process is as follows: If the time stamp of N target is larger than the time stamp of N root , it indicates that N target is the right subtree node of N root .The data reuse distance is the number of nodes which are included in the right subtree of N root and their time stamps are larger than the time stamp of N target .
(3) Compare the data reuse distance of the current data with the original node in the binary tree, and set the smaller value as the final data reuse distance RD of the data node to be inserted in the binary tree in order to ensure the validity of data reuse distance.
Energies 2016, 9, 754 7 of 20 (4) In the left or right subtree of N root , the calculation process of the number of nodes in which time stamp is larger than the time stamp of N target is as follows.(rd refers to the final number of node).If the N current is not the root node, and its time stamp is larger than the N current , the rd equals to the weigh value of N current subtracts its left child weigh value.Set the current parent node as the N current , and go to the step (c).(e) If the time stamp of N current is smaller than N target, then set the current parent node as the N current and go to the step (c).(f) If the N current is N root , then calculation is completed.
The algorithm details are shown in Algorithm 1.

The Instance of Calculating Data Reuse Distance
The following concrete instance explains the process of collecting the memory access data and calculating the data reuse distance.Table 1 shows the data sequence accessed by the thread, the values of frequency and reuse distance are obtained by executing Algorithm 1.
Figure 2 shows the process of inserting data elements of the Table 1 into a balanced binary tree, and calculating the data reuse distance and frequency as well.The process mainly includes the searching, deleting, and inserting operation.Finally, it generates a balanced binary tree which contains the unique memory access data d, b, c, e, f, a, and their corresponding data reuse distance information.

The Instance of Calculating Data Reuse Distance
The following concrete instance explains the process of collecting the memory access data and calculating the data reuse distance.Table 1 shows the data sequence accessed by the thread, the values of frequency and reuse distance are obtained by executing Algorithm 1.
Figure 2 shows the process of inserting data elements of the Table 1 into a balanced binary tree, and calculating the data reuse distance and frequency as well.The process mainly includes the searching, deleting, and inserting operation.Finally, it generates a balanced binary tree which contains the unique memory access data d, b, c, e, f, a, and their corresponding data reuse distance information.(i) There is not data node contained the data f, and directly insert the node 9:f in the binary tree.(l) Adjust the data reuse distance, replace the data reuse distance ∞ with the number of node 6.

The Average Data Reuse Distance of Thread
After the balanced binary tree is generated, the average data reuse distance of every thread is computed by traversing the corresponding binary tree.Let K refer to the total number of threads, and RDj (j = 1, 2,…, K) be the average data reuse distance of every thread, the data reuse distance of every data is rdi, and the number of unique data is M. RDj can be calculated as follows: The average data reuse distance reflects the internal data locality of a thread.The average data reuse distance is greater which means the data reuse rate is low and data locality is poor, and otherwise the high data reuse rate and better data locality.

The Algorithm Complexity Analysis
Let N refer to the total number of access data of a thread, and the M be the number of unique access data.The whole algorithm implementation mainly includes the insertion, traversal, and deletion of nodes in the balanced binary tree.The main computation overhead is spent on searching the data and calculating data reuse distances.The time complexity of searching for target data nodes is O(M), and the time complexity of calculating data reuse distances is O(log(M/2)), so the total time complexity of the whole algorithm is O(N(M + log(M/2))).The space complexity of the whole algorithm is О(N).

Determining Data Affinity
We merge all the threads into different data locality pattern classes based on the data reuse distance.We then analyze and quantify the data affinity between threads.First, we determine the data locality pattern according to the average data reuse distance of every thread.After that, the threads are merged into different pattern classes.Lastly, we analyze the data affinity between threads in every pattern class, and use the data affinity matrix to quantify the data affinity between different threads.

Definition of Data Locality Pattern
Data locality patterns are classified into three types: data sharing patterns, data dependency patterns, and data isolation patterns.The different data locality patterns are quantified by the data reuse distance.We set the data reuse distance threshold values as Dmin and Dmax, which reflect different data access characteristics, and divide the data reuse distances into three different ranges, each of which corresponds to one of the data locality patterns.Finally, we identify the data locality pattern of each thread by comparing its average data reuse distance with the threshold values Dmin and Dmax.The data locality pattern definitions are as follows:

The Average Data Reuse Distance of Thread
After the balanced binary tree is generated, the average data reuse distance of every thread is computed by traversing the corresponding binary tree.Let K refer to the total number of threads, and RD j (j = 1, 2, . . ., K) be the average data reuse distance of every thread, the data reuse distance of every data is rd i , and the number of unique data is M. RD j can be calculated as follows: The average data reuse distance reflects the internal data locality of a thread.The average data reuse distance is greater which means the data reuse rate is low and data locality is poor, and otherwise the high data reuse rate and better data locality.

The Algorithm Complexity Analysis
Let N refer to the total number of access data of a thread, and the M be the number of unique access data.The whole algorithm implementation mainly includes the insertion, traversal, and deletion of nodes in the balanced binary tree.The main computation overhead is spent on searching the data and calculating data reuse distances.The time complexity of searching for target data nodes is O(M), and the time complexity of calculating data reuse distances is O(log(M/2)), so the total time complexity of the whole algorithm is O(N(M + log(M/2))).The space complexity of the whole algorithm is O(N).

Determining Data Affinity
We merge all the threads into different data locality pattern classes based on the data reuse distance.We then analyze and quantify the data affinity between threads.First, we determine the data locality pattern according to the average data reuse distance of every thread.After that, the threads are merged into different pattern classes.Lastly, we analyze the data affinity between threads in every pattern class, and use the data affinity matrix to quantify the data affinity between different threads.

Quantifying Data Affinity among Threads
We analyze the data affinity between threads in every pattern class by calculating the number of identical accessed data between different threads, and use the data affinity matrix to quantify the data affinity between different threads.The data affinity matrix reflects the data sharing characteristics between different threads.The matrix row and column label respectively represent different thread Identifies (IDs).Every element value of the matrix represents the number of sharing data between threads marked by the corresponding row and column ID.A greater element value means the data sharing is better between corresponding threads, and the data affinity is also better between them.
We calculate the number of identical accessed data between different threads, and set the number as the corresponding data element value of the data affinity matrix.In order to improve the calculation speed, we compare in parallel the same accessed data between threads that belong to the same data locality pattern class.By comparing the corresponding balanced binary tree of different threads, we compute the number of same data nodes between different binary trees, and the number is the sharing data volume of the corresponding two threads.In addition, the number is recorded into the corresponding element of the data affinity matrix.Finally, a complete data affinity matrix is constructed, which reflects the data affinity between threads.After that, the data affinity matrix is transformed to the data affinity graph which can intuitively reflect the data affinity between threads.The data affinity graph is an undirected weighted connected graph, whose vertex refers to the thread ID, and edge weight refers to the data sharing volume between threads.

Threads Grouping Mapping
DagTM implements mapping threads to processing cores in two stages based on the data affinity combined with the memory hierarchy feature of many-core system.The first stage is threads grouping; the second stage is assigning thread groups to different processing cores.

Threads Grouping
Threads are divided into K different thread groups based on the data affinity graph combined with the max number of hardware threads supported by a processing core.The threads grouping needs to ensure the good data affinity between threads in every similar group.It is essentially a combinatorial optimization problem to divide the threads into different groups and ensure a better data sharing in the same thread group.In the process of thread grouping, the impact of the current thread on other threads should be considered, similarly, the impact of other threads on the current thread should also be considered.In this article, the thread grouping is abstracted as a graph decomposition problem.By designing an affinity sub-tree spanning algorithm, the data affinity graph is decomposed as K subtrees to meet the above requirements.The threads with high data sharing are merged into the same thread group, and the threads with strong memory access conflicts are merged into different thread groups.

Affinity Subtree Spanning Algorithm
The detailed execution process of the affinity subtree spanning algorithm is as follows: (1) The G = (V, E) is a weighted undirected connected graph (i.e., affinity graph).The vertex V refers to the set of threads, and edge E refers to the set of data affinity between different threads.Each edge (T i , T j ) ∈ E has a weight value ω (T i , T j ), which refers to the sharing data volume between the corresponding threads (as shown in Figure 3a).The total number of vertexes of graph G is N t , which refers to the total number of threads; N p refers to the number of nodes of every subtree (i.e., the number of threads of every thread group), the number corresponds to the max number of hardware threads supported by the specific many-core processor; K refers to the number of finally generated subtrees.

The Algorithm Complexity Analysis
The complexity of the affinity subtree spanning algorithm is related to the number of vertexes and edges of the affinity graph.The time complexity is mainly related to the number of edges of the data affinity graph.The initial comparison needs to compare the weight values of all edges.Subsequently, the number of comparisons will be reduced gradually.If the number of edges is n, the number of comparisons will be n, n − 1, n − 2,…, 1, the total time complexity of algorithm is О(n 2 /2 + n/2).In order to reduce the space complexity of the algorithm, the adjacency matrix is used to store the affinity graph, and the upper diagonal information of the adjacency matrix is only stored.So, the space complexity of the algorithm is О(V 2 /2).

Instance of Threads Grouping
A concrete instance of the affinity sub-trees generating procedure is shown in Figure 3.The graph contains eight threads, and the max number of hardware thread supported by the processing core is four.Finally, the two sub-trees are generated, i.e., the eight threads are merged into two thread groups ST1 (T1, T7, T3, T6) and ST2 (T2, T4, T5, T8).The generated each thread group contains four threads at most, and the sum of weight value of sub-tree edge is the largest, which ensures the data correlation between threads within the same group as large as possible.

Mapping Rules
Combined with the memory hierarchy graph, the threads in the affinity sub-tree are mapped to the different processing cores of many-core processor.Referring to the mapping algorithm in the [9,10], the data affinity sub-trees and memory hierarchy graph are used as input, we realize the thread mapping by static binding of threads to processing cores.The mapping rules are as follows: Rule 1: The application threads in the same thread group should be assigned to different hardware threads in the same processing core as far as possible.If the hardware threads in the same processing core are all allocated, the application threads should be assigned to the adjacent processing cores.The aim is to reduce the additional data replication and memory access latency, and improve the utility of the sharing cache.(2) Generating the K subtrees from the weighted undirected connected graph G. ST k refers to the different subtrees.Each generated subtree contains N p nodes at most, and it must ensure the sum of weight values of subtree edges is the largest, so it should satisfy following constraint conditions: (3) The sub-tree spanning algorithm is shown as Algorithm 2.
the data affinity sub-trees and memory hierarchy graph are used as input, we realize the thread mapping by static binding of threads to processing cores.The mapping rules are as follows: Rule 1: The application threads in the same thread group should be assigned to different hardware threads in the same processing core as far as possible.If the hardware threads in the same processing core are all allocated, the application threads should be assigned to the adjacent processing cores.The aim is to reduce the additional data replication and memory access latency, and improve the utility of the sharing cache.
Rule 2: The application threads in the different thread groups should be assigned to different processing cores.Let the isolated threads be dispersed among different processing cores with isolated cache space.The aim is to avoid a high data transmission latency and shared cache contention caused by the great number of different data replications of application threads.

DagTM Implementation
DagTM implementation includes the data locality detecting of thread, data affinity quantifying, thread grouping, and thread to processing core mapping and executing.The concrete implementation is shown in Figure 4.  We utilize an eight-thread parallel application to explain the complete DagTM mapping process to the Intel MIC processor.The task model and target platform model can refer to our previous article [19].The detailed mapping process is as follows: (1) DagTM first computes the average data reuse distance of each thread, identifies the locality pattern of different threads, and merges the threads into different pattern classes on the basis of the locality pattern [20][21][22].After that, DagTM constructs the data affinity matrix (as shown in Figure 5) of threads by counting the sharing data volume between threads in different pattern classes, and transforms the data affinity matrix to the data affinity graph (as shown in Figure 6).
previous article [19].The detailed mapping process is as follows: (1) DagTM first computes the average data reuse distance of each thread, identifies the locality pattern of different threads, and merges the threads into different pattern classes on the basis of the locality pattern [20][21][22].After that, DagTM constructs the data affinity matrix (as shown in Figure 5) of threads by counting the sharing data volume between threads in different pattern classes, and transforms the data affinity matrix to the data affinity graph (as shown in Figure 6).(2) The threads are categorized into different groups via the affinity subtree spanning algorithm.The presented example is based on the Intel MIC many-core system.For the specific Intel MIC heterogeneous system architecture readers can consult reference [19], and the memory architecture is shown as Figure 7.The MIC processor supports four hardware threads in each processing core.(3) After the thread grouping is completed, the thread groups are assigned to the different processing cores.We need to make sure that the threads assigned to the same processing core have better data locality, and threads assigned to the different processing cores have the smallest data affinity.The finally mapping result is shown in Figure 8c.
Figure 8 compares the mapping results between traditional OpenMP mapping and DagTM mapping for the same application threads on the same Intel MIC many-core system.The OpenMP Compact mapping mechanism mainly considers making full use of every processing core, and the data locality between threads is not considered.It will assign the threads with high sharing data to the different processing cores, and result in a high additional memory access.
As shown in Figure 8a, the threads T1, T3, T6, and T7 with high data affinity are assigned to different processing cores 1 and 2. The same data copy needs to be stored to the chip cache of core 1 and core 2, respectively, which added the additional memory overhead.The OpenMP Scatter mapping in Figure 8b mainly considers the load balance, it evenly assigns threads to the different processing cores, and also it does not consider the data locality between threads.Therefore, the scatter mapping method will also introduce high additional data memory access, moreover it is unable to make full use of the processing cores source and will cause high system energy consumption.However, as shown in Figure 8c, the DagTM mapping considers the data locality between different threads, and divides the threads into different groups according to the hardware architecture features of the processing core and data affinity, and then maps the different thread groups to the specific processing cores of a many-core processor, so it could utilize the data locality (2) The threads are categorized into different groups via the affinity subtree spanning algorithm.The presented example is based on the Intel MIC many-core system.For the specific Intel MIC heterogeneous system architecture readers can consult reference [19], and the memory architecture is shown as Figure 7.The MIC processor supports four hardware threads in each processing core.(2) The threads are categorized into different groups via the affinity subtree spanning algorithm.The presented example is based on the Intel MIC many-core system.For the specific Intel MIC heterogeneous system architecture readers can consult reference [19], and the memory architecture is shown as Figure 7.The MIC processor supports four hardware threads in each processing core.(3) After the thread grouping is completed, the thread groups are assigned to the different processing cores.We need to make sure that the threads assigned to the same processing core have better data locality, and threads assigned to the different processing cores have the smallest data affinity.The finally mapping result is shown in Figure 8c.
Figure 8 compares the mapping results between traditional OpenMP mapping and DagTM mapping for the same application threads on the same Intel MIC many-core system.The OpenMP Compact mapping mechanism mainly considers making full use of every processing core, and the data locality between threads is not considered.It will assign the threads with high sharing data to the different processing cores, and result in a high additional memory access.
As shown in Figure 8a, the threads T1, T3, T6, and T7 with high data affinity are assigned to different processing cores 1 and 2. The same data copy needs to be stored to the chip cache of core 1 and core 2, respectively, which added the additional memory overhead.The OpenMP Scatter mapping in Figure 8b mainly considers the load balance, it evenly assigns threads to the different processing cores, and also it does not consider the data locality between threads.Therefore, the scatter mapping method will also introduce high additional data memory access, moreover it is unable to make full use of the processing cores source and will cause high system energy consumption.However, as shown in Figure 8c, the DagTM mapping considers the data locality between different threads, and divides the threads into different groups according to the hardware architecture features of the processing core and data affinity, and then maps the different thread (3) After the thread grouping is completed, the thread groups are assigned to the different processing cores.We need to make sure that the threads assigned to the same processing core have better data locality, and threads assigned to the different processing cores have the smallest data affinity.The finally mapping result is shown in Figure 8c.
Figure 8 compares the mapping results between traditional OpenMP mapping and DagTM mapping for the same application threads on the same Intel MIC many-core system.The OpenMP Compact mapping mechanism mainly considers making full use of every processing core, and the data locality between threads is not considered.It will assign the threads with high sharing data to the different processing cores, and result in a high additional memory access.
As shown in Figure 8a, the threads T1, T3, T6, and T7 with high data affinity are assigned to different processing cores 1 and 2. The same data copy needs to be stored to the chip cache of core 1 and core 2, respectively, which added the additional memory overhead.The OpenMP Scatter mapping in Figure 8b mainly considers the load balance, it evenly assigns threads to the different processing cores, and also it does not consider the data locality between threads.Therefore, the scatter mapping method will also introduce high additional data memory access, moreover it is unable to make full use of the processing cores source and will cause high system energy consumption.However, as shown in Figure 8c, the DagTM mapping considers the data locality between different threads, and divides the threads into different groups according to the hardware architecture features of the processing core and data affinity, and then maps the different thread groups to the specific processing cores of a many-core processor, so it could utilize the data locality between threads to improve the data sharing between hardware threads, and reduce the additional data access and data transmission.In addition, it could make full use of the processing core sources to improve utilization of every processing core and reduce the whole system energy consumption.

Experimental Methodology
We used the PARSEC [4] benchmark suite to evaluate the DagTM.The benchmark programs were executed by using the native input size based on the OpenMP APIs.The experiment was conducted on an Intel MIC heterogeneous many-core system that consists of two eight-core E5-2670 CPUs and one Xeon Phi 7110P MIC co-processor with 64 GB memory and a 300 GB hard disk.The MIC co-processor contains 61 processing cores, and every processing core supports four hardware threads.The PCI-E x16 bus that connects the CPU and MIC co-processor can transfer data at a maximum transmission speed of up to 16 GB/s.The OS is Red Hat Enterprise Linux Server release 6.3, the performance metrics were obtained by the PAPI_5.4.1 performance measurement tool [16,17].The soft development environment is Intel parallel_studio_xe_2013_update3_intel64.
The DagTM, OpenMP Compact, OpenMP Scatter, Oracle (the ideal optimized thread mapping

Experimental Methodology
We used the PARSEC [4] benchmark suite to evaluate the DagTM.The benchmark programs were executed by using the native input size based on the OpenMP APIs.The experiment was conducted on an Intel MIC heterogeneous many-core system that consists of two eight-core E5-2670 CPUs and one Xeon Phi 7110P MIC co-processor with 64 GB memory and a 300 GB hard disk.The MIC co-processor contains 61 processing cores, and every processing core supports four hardware threads.
The PCI-E x16 bus that connects the CPU and MIC co-processor can transfer data at a maximum transmission speed of up to 16 GB/s.The OS is Red Hat Enterprise Linux Server release 6.3, the performance metrics were obtained by the PAPI_5.4.1 performance measurement tool [16,17].The soft development environment is Intel parallel_studio_xe_2013_update3_intel64.
The DagTM, OpenMP Compact, OpenMP Scatter, Oracle (the ideal optimized thread mapping for the application obtained by empirical observation), and Kernel Memory Affinity Framework (kMAF) [10] mapping mechanisms were used in benchmark programs, respectively, to compare their performance from the following three aspects: computing performance, energy consumption, and extra overhead.

Computing Performance
Figure 9 shows the relative improvement of computing performance of different mapping mechanisms for different benchmark programs.The normalized performance improvement ratio was computed via the relative reduce ratio of application execution time of different mapping mechanisms compared to the baseline that OS default mapping mechanism (first-touch policy).As shown in Figure 9, the average computing performance was increased by 2%, 3%, 17%, 14%, and 12% compared to the baseline (OS) by Compact, Scatter, Oracle, DagTM, and kMAF, respectively.The DagTM computing performance amounted to 82.4% of the Oracle and was better than the other three mapping mechanisms.The computing performances of the Compact and Scatter were lower than the DagTM and kMAT.In some cases, their computing performances were even lower than the OS default mechanism (e.g., Blackscholes, x264, Dedup, and Facesim).The main reason is that the Compact and Scatter do not completely consider the data locality between threads, which will cause the sharing source contention and additional data transmission delay.Furthermore, the DagTM computing performance in some cases was lower than the kMAF (e.g., Streamcluster, Raytrace, Freqmine, and Dedup).The reason is that kMAF can dynamically adjust threads according to the runtime data affinity between different threads.kMAF can achieve a better computing performance when running behavior of application has significant changes and the performance benefits obtained by dynamic adjustment is greater than the additional overhead.However, for the applications whose running behaviors have no significant changes, the additional runtime overhead introduced by the kMAF will offset the performance benefits, so their computing performances are lower than the DagTM.
Figure 10 shows the reduction ratio of the last level cache misses normalized to the OS default mapping mechanism.The smaller the normalized value is, the better.Overall, the reduction ratio of the L2 level cache misses of DagTM was superior to the Scatter and Compact.The reduction of the cache misses of kMAF was the closest to the Oracle, and superior to the others.The main reason is that kMAF is able to monitor the cache status in real-time and dynamically adjust threads to reduce the cache line replication.However, real-time monitoring data affinity will introduce additional runtime overhead, which will offset the part of performance benefits obtained from the reduction cache misses, and impact the whole computing performance.So, as shown in Figure 9, the kMAF average computing performance is not superior to the DagTM.

Energy Consumption
Apart from performance, reducing energy consumption is another important goal of thread mapping.On the one hand, the energy consumption can be reduced by hardware approaches (e.g., DVFS [23,24]); on the other hand, it can be reduced by efficiently exploiting the computing sources and reducing the execution time.DagTM relatively reduces the whole system static energy consumption by making more efficient use of the computing resources and reducing the execution time.The system energy consumption was measured during the execution of each application by using PAPI components, which provides access to the energy and power values returned by the Intel RAPL interface [17].
As shown in Figure 11, the average energy consumption was reduced by 2.3%, 3.2%, 12.4%, 10.3%, and 8.5% compared to the baseline (OS) by Compact, Scatter, Oracle, DagTM, and kMAF, respectively.Because the DagTM considered the data affinity before the thread mapping, which is able to reduce the memory access contention, improve the sharing sources utilization, reduce data transmission overhead, and reduce the whole execution time, so it could relatively reduce the whole system energy consumption.To approximate the extra overhead for the benchmark program, we measured the time spent mostly on data locality detection.Figure 12 shows the extra overhead of DagTM, which was measured by the execution time ratio of data locality detection compared to the whole program execution time of the different benchmark programs.The average extra overhead introduced by DagTM is nearly 11%.The Compact and Scatter mappings do not consider the program itself data locality, and directly map the thread to the processing core, which will not introduce additional overhead before and during the program execution.The Oracle mapping obtained the best performance by exhaustively comparing, and introduced the largest additional overhead, which only serves as an ideal mapping standard, and does not serve as practical mapping approach.The is able to dynamically adjust the to the running status of a program, which will introduce a certain additional runtime overhead and impact on the computing performance of a program.It needs to trade off the performance benefits and additional runtime overhead.However, DagTM directly implements thread mapping after completing threads grouping.Due to the fact that threads grouping is implemented before the program execution, it will not introduce additional runtime overhead.DagTM is able to realize thread grouping at the cost of a negligible preprocessing overhead.It could make up the shortcoming of the Scatter and Compact approaches, and obtain similar performance improvements without introducing additional runtime overhead compared to the kMAF.

Conclusions and Future Work
In this article, we have investigated the mapping problem of thread to processing core based on data affinity.The mapping of threads to the different processing cores of a many-core processor was implemented based on the data affinity between threads considering the memory hierarchy architecture features.The ultimate purpose of this work is to improve the whole system energy efficiency by reducing sharing memory access contention, increasing sharing resource utilization, and reducing data transmission overhead.Specifically, the data locality is detected by computing the data reuse distance; the data affinity is quantified via an affinity matrix; the threads are divided into different thread groups via an affinity sub-tree spanning algorithm.Finally, the thread groups are assigned to processing cores by static binding.The benchmark programs evaluation results show that the DagTM is effective for improving program computing performance and reducing energy consumption.DagTM is able to reasonably map the threads to different processing cores relying on the data affinity between threads, and improve the whole system energy efficiency without introducing additional runtime overhead.
For the future, we will extend DagTM and combine with the dynamic detecting the phase changes of the running program to realize the hybrid static and dynamic thread mapping based on the data affinity.In addition, we will combine the DagTM with the other thread mapping strategies to adapt multithreaded multiprogramming environment and cluster architecture.
struct Node {int TS; //time stamp that records the access order of data.float Element; //records the accessed data.int Frequency; //records the access times of memory access data.int Weight; //records the number of sub-node contained in the current node.int RD; //records the data reuse distance of the current node.}.

( 5 )
If the data element has been contained in the data node of the current binary tree: a Call the data reuse distance calculation function (shown in Algorithm 1) to calculate current data reuse distance RD. b Count the data reuse frequency of new data node: the data reuse frequency equals to the data frequency value of current node plus one.c Delete the current data node in the current binary tree.d stamp of the target node (N target ) is smaller than the time stamp of the root node (N root ), it indicates that N target is the left subtree node of N root .The data reuse distance equals the number of right subtree nodes plus the number of nodes which are included in the left subtree of N root and their time stamps are larger than the time stamp of N target .(b) value of rd as 0. (b) Assign the weight value of the right child node of N target to rd, and set N target as the current node N current .(c) Backtrack to the parent node of N current .(d)
Search the data f, and calculate the corresponding data item information.

Figure 2 .
Figure 2. The instance of collecting access data.

5. 1 . 20 Definition 1 :
Definition of Data Locality PatternData locality patterns are classified into three types: data sharing patterns, data dependency patterns, and data isolation patterns.The different data locality patterns are quantified by the data reuse distance.We set the data reuse distance threshold values as D min and D max , which reflect different data access characteristics, and divide the data reuse distances into three different ranges, each of which corresponds to one of the data locality patterns.Finally, we identify the data locality pattern of each thread by comparing its average data reuse distance with the threshold values D min and D max .The data locality pattern definitions are as follows:Energies 2016, 9, 754 10 of Data Sharing Pattern (DSP): RD j < D min .Under this pattern, the data accessed by the thread has strong temporal locality.Threads that belong to this pattern should be assigned to different hardware threads on the same processing core, and the data accessed by the thread should be allocated to the same memory location of thread.Definition 2: Data Isolation Pattern (DIP): RD j > D max .Under this pattern, the data accessed by the thread has poor temporal locality.Threads that belong to this pattern should be assigned to different hardware threads of different processing cores of different processors.Definition 3: Data Dependency Pattern (DDP): D min ≤ RD j ≤ D max .Under this pattern, the data accessed by the thread has partially temporal locality.Threads that belong to this pattern should be assigned to different hardware threads of different processing cores on same processor.The specific threshold values of D min and D max are obtained by empirical observation.By measuring the different benchmark programs, we calculate the data locality, and analyze the relationship between data locality and data reuse distance.Comparing the data reuse distance values of programs with strong data sharing, let the max value be D min ; and comparing the data reuse distance values of programs with isolated data access characteristic, let the min value be D max .In this article, the obtained values of D min and D max are 35% and 85% of the amount of data access in certain interval, respectively.

Figure 10 .Figure 9 .
Figure 10.The reduction of the last level cache misses.

Figure 10 .Figure 10 .
Figure 10.The reduction of the last level cache misses.

Figure 11 .
Figure 11.The reduction of the system energy consumption.

Figure 11 .
Figure 11.The reduction of the system energy consumption.

Figure 11 .
Figure 11.The reduction of the system energy consumption.

Table 1 .
The memory access data sequence.

Table 1 .
The memory access data sequence.