Afﬁnity-Based Task Scheduling on Heterogeneous Multicore Systems Using CBS and QBICTM

: This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the afﬁnity scheduling problem in heterogeneous multicore systems. The non-afﬁnity scheduling of tasks has a negative impact as the overall execution time for the tasks increases. Furthermore, non-afﬁnity-based scheduling also limits the potential for data reuse in the caches so it becomes necessary to bring the same data into the caches multiple times. In heterogeneous multicore systems, it is essential to address the load balancing problem as all cores are operating at varying frequencies. We propose two techniques to solve the load balancing issue, one being designated “chunk-based scheduler” (CBS) which is applied to the heterogeneous systems while the other system is “quantum-based intra-core task migration” (QBICTM) where each task is given a fair and equal chance to run on the fastest core. Results show 30–55% improvement in the average execution time of the tasks by applying our CBS or QBICTM scheduler compare to other traditional schedulers when compared using the same operating system.


Introduction
Multicore systems usually operate in a shared resource environment. They must cooperate in order to process sharable resources, especially the cache. Caches in multicore systems are private as well as shared. Task scheduling in heterogeneous multicore systems is mainly dependent on how the various tasks are distributed among all available cores [1]. The processing power of each core in a heterogeneous system is different, so distributing the workload according to each core's maximum capacity or efficiency is essential. Thus, the computing power of each core has to be considered. Shared caches play a vital role in increasing the throughput of the overall system, but dependent tasks need to be scheduled concurrently [2].
The prime objective of multicore systems is to execute many tasks in considerably less time. In such circumstances, the role of the cache memory is vital and, if tasks are scheduled on a non-affinity basis, the cache miss rate will increase significantly [3]. When a higher priority task requests use of the CPU, the system grants that task processing priority over the currently running lower priority task. In granting this priority, the higher priority task wipes the data of the previously running lower priority task from the cache resulting in wastage of time as the system will have to bring back the same data into the cache from off chip memory when processing of the lower priority task resumes [4]. The affinity-based scheduling of tasks maximizes the cache hit rate which reduces the overall execution time of the total volume of tasks [5,6]. The slower nature of some cores in a heterogeneous multicore system negatively affects the overall performance of the system. The cores in heterogeneous systems operate at different computing speeds, thus, even after equal distribution of the workload, the slowest core might end up being overloaded while, at the same time, the fastest core remains idle, doing nothing [7].
Heuristics have been proposed for affinity-based scheduling of tasks using task migration in NUMA (non-uniform memory access) multicore systems [8]. The identical or dependent tasks in a multicore system are scheduled on the same core, thereby reducing the cache thrashing while maximizing the cache hit rate [9]. In multicore systems, shared caches operate between the CPU and the off-chip memory, mainly random access memory (RAM). Gollapudi et al. [10] proposed a completely fair scheduler (CFS) that uses classification to allocate the tasks on the cores in a way that tasks scheduled on the same core use the core's shared cache without thrashing the data of other tasks in the cache. Several heuristics are proposed for the concurrent scheduling of tasks that share the same cache content and cooperate to improve cache performance [11,12]. The overhead of shared caches in multicore systems has been considered in the context of WCET (worst case execution time) analysis [13]. Several shared cache partitioning mechanisms have been proposed to reduce cache interference and thrashing [14,15].
Classifying the dependent tasks together into a same cluster using different algorithmic approaches is proposed [16,17]. Algorithms have been proposed to predict the possible dependency ratios among the tasks that are to be processed [18][19][20]. Some tasks have read or write dependencies on other tasks for certain memory locations; in such cases, the task waits for a certain memory location to be written which is later read by that task and then carries on with its execution [21,22]. The total number of interactions between certain tasks indicates the level of dependency the tasks are having [23]. The affinity relations between multiple queries is checked and closely related queries are assigned to the same cache structure [24]. Affinity scheduling allows multiple tasks to execute more quickly due to locality of the data, cache utilization or interaction with other tasks [25]. Certain approaches are proposed where the conditions are defined on the basis of which the affinity-based tasks are grouped into a cluster, making sure the grouped data have some kind of dependency [26][27][28][29]. To ensure load balancing in heterogeneous systems, imbalance between scheduling domains NUMA and SMT (simultaneous multithreading) NUMA is identified and load balancing is performed by the balancing interval of scheduling domains [30]. Static and dynamic load balancing techniques are proposed. In static load balancing, the load is distributed according to the computation speed of processors. In dynamic load balancing, the load is distributed after the computation starts [31]. Tasks in the same cluster are allocated together to the processing cores taking their dependency advantages [32]. Dependent data clustered together is beneficial as all the data can be processed using the same resources at once [33,34].
Several affinity techniques on a number of realistic applications are presented in [35]. Jia et al. [36] presented a task scheduling mechanism where dependent tasks are scheduled concurrently on the same core ensuring memory affinity, while reducing task switching overhead. Regueira et al. [37] presented a technique which enables efficient utilization of NUMA systems by assigning tasks to a set of cores taking into account the communication and synchronization times between tasks. The proposed scheme calculated the communication cost and the synchronization cost between the tasks. Johari et al. [38] performed load balancing and scheduling tasks on cores by maintaining a queue (sequential search, random search, random search with spatial locality, average method) and without maintaining the queue (sequential, random and random search with spatial locality). Muneeswari [39] proposed a technique in which critical and non-critical tasks are classified, where the critical tasks are scheduled, such that re-utilization of the cache is minimized while, for non-critical tasks, round robin scheduling is used. Holmbacka et al. and Gantel et al. [40,41] proposed a task replication mechanism in which tasks are migrated between different cores. When a task is created on one core, a replica of the same task is created on other available cores. This enables a task's replica to become active on a certain core when a task is migrated to that core. However, existing work does not clearly address the load balancing issue in heterogeneous systems, or, if it intended to do so, the way in which tasks are scheduled onto the cores where the computing power of each core is different is not clearly defined. Although all the cores in a multicore system are given a fairly equal number of jobs to process, still the best potential performance of the system cannot be exploited because some cores perform slowly by comparison with the fastest cores.
In this paper, we propose two task scheduling techniques, named CBS (chunk-based scheduler) and QBICTM (quantum-based intra core task migration) based on even load balancing by considering the processing speed of all the cores. The main contributions of our work are as follows.

1.
We use the Bayesian data analysis model to cluster dependent tasks. All similar tasks based on whether they are communicating or they have any kind of dependency are grouped into a single cluster. Using the Bayesian analysis model, we found the dependencies between tasks, given they are communicating or synchronized.

2.
Clusters are then allocated to the cores in a heterogeneous multicore architecture ensuring fair load distribution. We propose a chunk-based scheduler (CBS), where the computing speed of each core is determined. All the tasks that are to be processed are divided into variable size chunks equal to the number of cores available. Each chunk contains a number of tasks depending on and commensurate with the processing frequencies of the respective cores. The largest chunk of tasks is allocated to the fastest core while the smallest chunk is given to the slowest core.

3.
The second scheduler we propose in our work is the quantum-based intra-core task migration (QBICTM). All the tasks are divided into equal size chunks equal to the number of cores available. We introduced a time quantum at the fastest core giving each task a fair and equal chance to execute at the fastest core. Fastest core in a multicore architecture is the one which operates at a greater maximum frequency than the other available cores. As soon as the quantum expires, the migration of tasks between the cores takes place in a heterogeneous multicore architecture.
We used the Pycharm community IDE development tool to obtain the results. The Python multiprocessing module is used to implement our proposed schedulers. All the results that are explained in detail in the later part of this paper were obtained on the core i5 7th generation multicore system with 8 GB of DDR4 RAM. The multicore system contains four logical and two physical cores. Each physical core is operating at a maximum frequency of 2700 MHz which can be boosted up to 3100 MHz. The maximum operating frequency for each physical core is changed to a different value ensuring that all the cores are now heterogeneous in nature. All the cores in a multicore architecture are utilized to their full capacity, i.e., 100% utilization.
The rest of the paper is organized as follows. Section 2 introduces the methods used in this study. Sections 3 and 4 describe the experimental results and provide a discussion, respectively. Section 5 concludes our work. Figure 1 shows the structure of our proposed affinity-based CBS and QBICTM schedulers. Affinity-based task scheduling can significantly reduce cache thrashing and for this we used a Bayesian data analysis model. The affinity-based scheduling of the tasks on the cores has already been briefly described in the latter part of Introduction.

Bayesian Generative Model
The Bayesian data model is used in the research to group identical tasks. We used the Bayesian model because of the advantages it has over other algorithms; these include the use of the prior information that is available with the data. The Bayesian model analyzes the data and makes use of previously available information and forms priors for future analysis. It works on the basis of assumption and estimation. Priors are the information the model contains before seeing the data [42]. We used the non-informative uniform prior for the tasks in this research. The Bayesian generative model does not contain any information about the tasks that are to be processed before seeing the data. The prior dependencies between the tasks have a uniform value between "1% to 100%". Figure 2 shows the working mechanism of the Bayesian generative model. All incoming tasks contain prior information which is then forwarded to a generative model as input; the information contained in each task is analyzed by the Bayesian model and the similarities between tasks are identified as either the tasks are communicating or synchronized. The generative model produces the posterior data as an output which is affinity-based. The tasks which tend to be similar, given the communication or synchronization between them are grouped together to form a single cluster. The total number of clusters made by the Bayesian data model equals the number of cores present in a heterogeneous multicore system.
D is the dependency and C is the communication between tasks, where p(D|C) is the posterior distribution for dependency given that tasks are communicating.
is the marginal probability, specifying that the tasks are communicating. The Bayes theorem for probability distributions is stated as where p(D) is the prior distribution for parameters and Q is an arbitrary constant.
We assumed different ratios of dependent and independent tasks and, using the prior information contained with the tasks, we input them into a generative model to find the existing dependencies between tasks on the basis of communication or synchronization. The detailed results are presented in Section 3.

Chunk-Based Scheduler (CBS)
We performed load balancing on a heterogeneous multicore system by proportionately dividing the total load into multiple chunks of variable sizes according to the various processing speeds of all the respective cores. The Bayesian generative model gives the list of dependent and independent tasks in the form of posterior data as an output. All tasks, whether dependent or independent, are listed together without making clusters. Similar or dependent tasks are listed in a contiguous manner. All tasks that are to be processed are present in a single list that functions like a queue, where each task is requesting allocation to the core in order to be processed. Tasks are then divided into chunks that may differ in size according to the respective processing speeds of the cores.
The computing power or capacity of each core in a heterogeneous system is predetermined, with each core taking a job or task from its private queue. The largest chunk of tasks is allocated to the fastest core. All the tasks in a chunk are put into the core's private queue. In the same way, each core gets a chunk of tasks with respect to its computing capacity, the smallest chunk being given to the slowest core in the heterogeneous multicore system. This results in each core being proportionately loaded, the fastest core being more heavily loaded and the slowest core processing fewer jobs or tasks, each according to its capacity. Figure 3 shows how the CBS technique is applied in the system. Workloads consisting of multiple tasks are not distributed evenly and fairly across the cores because that would result in some cores being overloaded but because of the heterogeneous nature of the system under use, the fastest core which processes tasks at much higher speeds receive more of the load. The proportionate allocation of more tasks to the faster cores and fewer to the slower cores means all cores are more fully utilized and jobs or tasks are more likely to be completed in the same time. Ideally, no core is overloaded and no core is left idle. The fastest core might be idle when other cores are still operating and processing the tasks but we have made sure that the fastest core has already executed a larger portion of the tasks that were to be processed. Table 1 provides the description of the terms used in the research.
Set of tasks t t Total tasks to be processed C k = {k 1 , k 2 , . . . , k c } Chunks of tasks equal to the number of cores available Computing max frequency of all the cores f n Total/max processing speed of the system The number of tasks in each chunk is computed using Each chunk is of variable size and contains different number of tasks.
Algorithm 1 describes the working mechanism of CBS. All the tasks that are to be processed are divided into various-sized chunks. The next step is to calculate the processing speed of each of the cores in the multicore architecture. Once the processing speed of each of the cores is determined, the chunks are allocated to the cores by considering each core's speed, i.e., workload is allocated proportionate to each core's processing speed. The fastest core in a multicore architecture gets the largest chunk of tasks, the smallest chunk being allocated to the slowest core.

Algorithm 1 Scheduling tasks on the cores using CBS algorithm
Inputs: List of tasks, Mapping of chunks k c onto the set of Cores C Outputs: Computation of overall execution time of set of Tasks T BEGIN 1.
List tasks in the output generated by Bayesian generative model 2.
Compute processing speed of each core c 1 -c m in set of Cores C 3.
Make Chunks Ck of Tasks T = Number of available Cores C 5.
Chunk k c = c m processing speed Total_ f requency , * Number o f tasks in list 6.
Create Child of Parent process 7.
Largest Chunk k c is queued or piped at the fastest core c m 8.
Smallest chunk being queued on the slowest core 9.
foreach core c m do 10.
Process each task t t from core's queue until no task left for processing in the queue 11.
end for END

Quantum Based Intra Core Task Migration Load Balancing Scheduler
Fair and even load balancing is performed on a heterogeneous system by using a quantum based intra core migration of tasks. The working mechanism of the quantum based intra core migration of tasks is presented in Figure 4.
The processing speed of each core in a multicore architecture is computed. The cores are sorted in increasing order according to their respective processing speeds. Each task which is to be processed on a multicore system is given a fair and equal chance to run on the fastest core. A time quantum is introduced at the fastest core and tasks are being processed on a first come first served basis from each core's own private queue. Each core in a heterogeneous multicore system owns a private queue from where the tasks are allocated to a core and which functions like a ready queue for that core. The maximum and minimum threshold is defined for each core's queue for the total number of tasks present and as soon as the number of tasks in any queue is less than the defined minimum threshold, an intra core migration of tasks occurs. Every running task on each core is migrated to the next corresponding core's ready queue as shown in the Figure 4. When a task utilizes its quantum while running on the fastest core, the CPU is preempted from the task and it is migrated to the ready queue of the next in line processing core, i.e., 2nd fastest core. When the CPU is preempted from a task running on the fastest core, the number of tasks in the queue of the fastest core falls below the minimum threshold. This is because the next task waiting in the queue will be allocated to the fastest core. In that case, the task running on the slowest core is migrated to the fastest core's private queue or ready queue and the current state of the task is saved. The total number of tasks in each queue of the multicore system is computed as Whenever a quantum of a certain task running on the fastest core expires, the number of tasks in the queues of all the cores of a heterogeneous system fall below a defined minimum threshold and intra core migration of the tasks takes place. Using the described technique helps us eliminate problems related to load balancing on heterogeneous systems. The load balancing issue in heterogeneous systems, vis-à-vis that the fastest core finishes the tasks allocated much more quickly than the slowest core in the system, is addressed in our research and has been minimized. Our proposed technique ensures that the fastest core will always process a task as long as other cores are operating and the tasks are available for processing. Thus, the problem we dealt with to some extent is that one core may be under-loaded or idle while another core may be over-loaded even after fair and equal distribution through load balancing in heterogeneous systems.
Algorithm 2 describes the QBICTM scheduling of tasks. When a quantum of a certain task expires the intra core migration of tasks takes place and the next task waiting is allocated to the fastest core.

Algorithm 2 Load Balancing using QBICTM scheduler alogrithm
Inputs: Allocating each cluster cl a to a Core c m Outputs: Computation of overall execution time of set of Tasks T BEGIN 1.
Make clusters of tasks = Number of cores available 2.

Tasks in each Cluster Cl a =
Total Number o f tasks Total Cores available

3.
Compute processing speed of each core c m in set of Cores C

4.
Allocate one cluster cl a to exactly one Core c m ensuring load balancing with each core getting fair and equal number of tasks to process 5.
Tasks T in each cluster cl a are queued onto the core c m 6. Introduce a time quantum at the fastest core, each task getting fairly equal time for processing at fastest core in one cycle

7.
When time quantum for a task expires, preempt core c m from the task t t and allocate next task waiting in queue onto the core 8.
for each core c m do 9.
Define a minimum threshold for each core's queue 10.
if total_number_of_tasks in core c m queue < threshold then 11.
Migrate task t t being processed on every core C to the queue of next corresponding core c m 12.
Continue processing tasks T until there are no tasks left to process 13. end if 14.
end for END

Results
In this section, we evaluate the performance of our proposed methodology. The chunk-based scheduling technique and quantum based intra core task migration scheduler are applied to two programs, namely, a factorial program and a real-life working example (ingredients ratio for baking a cake). For the factorial program we calculated execution time for an individual number, as well as for a process (a range of numbers). Real-life working example is a linear program for baking a cake with four ingredients, which are flour, butter, eggs and sugar. The scenario for the execution of a program is as follows

•
All the ingredients weigh exactly 700 g in total. • Amount of butter was half that of sugar. • Combined weight of flour and eggs was at most 450 g. • Weight of eggs and butter combined was at most 300 g.

•
The combined weight of eggs and butter was at most that of flour plus sugar.
The program calculates the weight of each individual ingredient after processing multiple tasks, which have read-write our write-read dependencies. The characteristics of the system on which we have performed experiments are given in Table 2. We have applied Bayesian data analysis on a factorial program. The input values for the factorial program contain different ranges. We have taken different parameter values based on assumptions for, e.g., if there are 100 tasks to be processed, we assumed 30 of those are dependent while 70 tasks have no dependencies. We find the probability based on the task dependencies given they are communicating or synchronized or both out of those 30 dependent tasks based on the assumptions. The posterior data generated as an output by the Bayesian model show different ranges, as shown in Figure 5. We set the non-informative prior for task dependency as uniform where the dependency ratio of the tasks can be any value between 0 and 1. There might be 100% task dependency or it can also be 0%. The different parameters are given as an input to the Bayesian generative model. The total of 100 iterations were run because we took each value in the range of 1 to 100 as an input parameter.   The Bayesian generative model takes different parameter values as inputs and calculates the probability of the task's dependency based on the parameter value. The dependent tasks based on the communication or synchronization are listed together by the generative model, which are later grouped together in single or multiple clusters.
Once the dependent tasks are listed separately, we make the clusters of similar and independent tasks. The number of clusters equals the number of cores in a heterogeneous multicore architecture. Each output generated by the Bayesian analysis model for different parameters is listed and clusters are made one by one. Now, the mapping of the clusters onto the cores of the multicore architecture is initiated. Two different techniques are applied to ensure maximum load balancing on a heterogeneous system. The first technique used for load balancing puts the tasks from all clusters into a queue where similar or dependent tasks are listed in continuous order. We applied the technique on a factorial program by allocating a larger bunch of tasks to the fastest core based on its processing power. Results show an improvement in the execution time for each individual task and the process as a whole. The throughput of the overall system also increased whereas the waiting time for each task decreased significantly. Table 4 shows the average execution time of the tasks for our chunk-based technique which is used to find the factorial for a given range of numbers by taking different input parameters from the Bayesian model.  We took each single output from the Bayesian model based on the input parameters and made clusters from identical tasks. We computed the execution time for the factorial of each individual number in the given range listed in the Table 4 and computed the overall execution time for the whole program. Figure 6 shows a comparison of the average execution time for the factorial program for the different range of numbers based on the input parameter value. Results show an improvement in the overall average execution time by applying our chunk-based technique compared to the default operating system scheduler. The average execution time for the whole factorial program was reduced by 32-50% compared to the default OS scheduler. The CBS load balancing technique decreases the total execution time of the factorial program compared to the schedulers used by the operating system.
The results for the average execution time for the factorial program by scheduling the clusters on cores and applying the QBICTM scheduling mechanism by introducing a quantum at the fastest core are presented in Table 5.
We took different values for the time quantum and found the factorial for a different range of numbers. The quantum based intra-core migration technique is applied on each task generated as output by the Bayesian generative model based on the input parameter values. A time quantum is introduced at the fastest core and each single task gets a fair and equal chance to be processed at the fastest core. When the quantum expires, the intra core migration of tasks takes place as discussed earlier in Section 2.3. Our load balancing technique outperformed the default OS scheduler in terms of the execution time for the factorial program as it processed the whole program in significantly less time. The reduction in overall execution time for different ranges of numbers is between 30-55%. Smaller values for the quantum benefit tasks that require to be processed for shorter amounts of time but they increase the average waiting time for tasks that require larger amounts of time on the CPU. As we increase the quantum, the larger tasks benefit and waiting time is reduced. The quantum-based technique increased the context switching of the tasks and it thereby under-performed for some tasks when the quantum value was smaller compared to the chunk-based technique, but execution time slightly improved as compared to the chunk-based scheduler when the quantum value was increased. As shown in Table 6, we have executed different numbers of tasks for a real-life working example, namely, the ingredients ratio for baking a cake where there are multiple dependent and independent tasks. We first list the dependent tasks into a group by using the Bayesian model so that, while making chunks, all the dependent tasks are listed serially in contiguous manner. Results show that by using a chunk-based scheduling technique, we are able to reduce the overall execution time of the tasks by an average of 45-51% compared to the scheduler used by the operating system. Figure 7 shows the average execution time comparison between the CBS and the OS schedulers. Quantum based scheduling is applied on a real-life working example and the results are shown in Table 7. We computed the execution time for different numbers of tasks in a process. Results show an improvement of 44-51% in the overall execution time compared to the traditional operating system scheduler. Figure 8 shows the average improvement rate in the execution time. We applied the chunk-based scheduling and quantum-based scheduling by comparison with the traditional scheduler used by the operating system. The overall execution time of the tasks improved quite significantly.  Table 8 shows a comparison for average improvement in throughput using different task scheduling techniques in comparison with the schedulers used by the running operating system. Results show the effectiveness of our proposed schedulers in comparison with other existing state-of-the-art schedulers. The average improvement in the throughput using our proposed schedulers can further be improved if the proposed technique is applied on a multicore machine having a higher core count.

Discussion
The proposed schedulers in this study aim to distribute the workload according to each core's maximum capacity or efficiency. As the cores in the heterogeneous systems usually have different computing speeds, the equal division of the workload might not be as effective as in homogeneous systems. Most of the previous works focus on distributing equal and even workloads among cores but they do not clearly address the load balancing mechanism on heterogeneous systems.
The experimental results conducted in Section 3 prove the effectiveness of our proposed schedulers. We have presented two different scheduling techniques to ensure load balancing on a heterogeneous system. The load balancing mechanism is based on fair distribution of workload by keeping the processing power of multiple cores in consideration rather than just equal distribution of tasks.
By comparison with other proposed techniques, our affinity-based schedulers have several limitations which direct the future study of our work. The affinity-based scheduling of tasks maximizes the data reuse potential in shared caches. In future work, we will focus on improving the cache hit ratio thus reducing the cache miss penalty. In the QBICTM scheduling technique, the context switching rate significantly increases whenever a quantum at the fastest core expires and the migration of tasks between the cores takes place. We will focus on minimizing the context switching of tasks, which will further improve the average execution time of an overall program.
Despite the mentioned limitations, our proposed schedulers applied load balancing fairly on the heterogeneous system. With the aim of achieving the maximum performance of the system on offer, our proposed schedulers focus on maximizing the utilization of all the processing cores in multicore architectures.

Conclusions
We proposed two techniques for scheduling tasks on heterogeneous multicore systems. All the dependent tasks are listed together using a Bayesian analysis model in which the non-informative uniform prior is used. All the input parameters are attached with the prior and passed onto the generative model as an input. The posterior data generated by the Bayesian model contains all the dependent tasks in a list in a contiguous manner and those tasks are passed on to the schedulers. Our CBS lists all the tasks together where dependent tasks are present in a contiguous manner. We then create a number of chunks of tasks of variable sizes, equal to the number of cores available. The largest chunk is assigned to the fastest core and the smallest chunk is allocated to the slowest core in the heterogeneous architecture.
QBICTM is the second proposed load balancing scheduler in our study. Here we have introduced a time quantum at the fastest core which enables each task to have a fair and equal chance of being processed at the fastest core. The migration of tasks is performed by defining a minimum threshold in each core's queue. When the total number of tasks in a certain queue falls below the defined minimum threshold, the tasks currently being processed are migrated to the queue of the next corresponding core. The only problem with the quantum based intra-core migration of tasks is that it increases the number of context switches, however, it still significantly out-performs the traditional operating system scheduler in terms of the execution and task waiting times.
Our load balancing schedulers outperformed the default OS scheduler in terms of execution time for the factorial program as it processed the whole program in significantly shorter time. The reduction in overall execution time for a process containing different range of numbers is between 30-55% while for an individual task the execution time is reduced by an average of 45-51% using our proposed affinity-based schedulers compared to the scheduler used by the operating system for the real-life example of baking a cake. For future work in this regard, the cache hit rate can be maximized while reducing the cache miss penalty by grouping the tasks into clusters on the basis of the maximum interactions between them.