Log Analysis-Based Resource and Execution Time Improvement in HPC: A Case Study

: High-performance computing (HPC) uses many distributed computing resources to solve large computational science problems through parallel computation. Such an approach can reduce overall job execution time and increase the capacity of solving large-scale and complex problems. In the supercomputer, the job scheduler, the HPC’s ﬂagship tool, is responsible for distributing and managing the resources of large systems. In this paper, we analyze the execution log of the job scheduler for a certain period of time and propose an optimization approach to reduce the idle time of jobs. In our experiment, it has been found that the main root cause of delayed job is highly related to resource waiting. The execution time of the entire job is a ﬀ ected and signiﬁcantly delayed due to the increase in idle resources that must be ready when submitting the large-scale job. The backﬁlling algorithm can optimize the ine ﬃ ciency of these idle resources and help to reduce the execution time of the job. Therefore, we propose the backﬁlling algorithm, which can be applied to the supercomputer. This experimental result shows that the overall execution time is reduced.


Introduction
A batch job scheduler periodically monitors the status of computing resources in a cluster and distributes jobs efficiently. Such a batch job scheduler is called, in general, the Distributed Resource Management System (DRMS) [1], Workload Management System (WMS) [2], or simply a job scheduler.
The basic role of the scheduler is to accurately reflect the resource status [3]. This includes the various computational resource, such as licenses, CPU, and memory, as well as many-core based acceleration systems, such as GPU and Intel PHI (many-core architecture) [4], which are of recent interest. It also meets requirements such as the fair-share policy [5] to ensure the fair distribution of resource distribution after awareness of resources, preemption support for high-priority jobs, resource scalability assurance, and support for various system environments. Most job scheduler software reflects the user job environment, from job submission to termination, as well as the state of the inventory and system status of the entire managed object. It also stores various pieces of information related to job execution, such as job scripts, environment variables, libraries, waiting, and the starting and ending times for a job. In order to configure the cluster environment and manage batch jobs, the scheduler is chosen to reflect the characteristics of the computation environment in software and hardware perspective [6].
This study analyzes the job execution logs performed in the batch scheduler of the Tachyon2 system, which has been actively operated in the National Supercomputing Center at Korea Institute of of Science and Technology Information (KISTI) [7]. The acquired log of the supercomputer contains rich pieces of information related to submitted jobs, such as user information, execution time, resource size, job exit status, and so on [8]. When a large-scale job is submitted to the batch scheduler, it waits without using the returned compute resource until the requested resource is prepared. This supercomputer log shows that idle resources are not used during this waiting time, causing inefficient resource waste. Therefore, the efficiency of resource allocation has to be evaluated by the scheduling algorithm [9]. It can be seen that optimization can be performed by applying the backfilling. In this paper, we will show that turnaround time can be reduced by optimizing the inefficiency of waiting resources.
The rest of this paper is organized as follows: In Section 2, we will introduce system configurations of Tachyon2, which is the supercomputer used for our analysis. In Section 3 and Section 4, we will introduce the backfilling algorithm and our experimental results, respectively. We will conclude this paper in Section 5.

System Overview
This study was carried out using a Tachyon2 supercomputer that is currently in service. This system consists of various hardware and software stacks. On the hardware side, it is made of compute nodes, storage, backup archivers, and infrastructure nodes (login, data mover, scheduler, admin, etc.). In Tachyon2, each machine is connected by an InfiniBand interconnect [10]. For software, it works with a variety of software stacks, from the system OS to the application layer. As shown in Figure 1, KISTI's Tachyon2 supercomputer is a Linux-based cluster system with 3200 computing nodes. Each node is equipped with two quad-core Intel Xeon X5570 2.93GHz (Nehalem) CPUs, each of which has 8 cores. And DDR3 type of 24GB memory is installed in 3 channels. In this system, the Sun Grid Engine (SGE) is being used as a batch job scheduler [11].

Batch Job Scheduler
Scheduler is a program that manages the order and schedule by giving priority to fairness of numerous submitted jobs. There are various solutions, such as a Sun Grid Engine (SGE), Slurm [12],

Batch Job Scheduler
Scheduler is a program that manages the order and schedule by giving priority to fairness of numerous submitted jobs. There are various solutions, such as a Sun Grid Engine (SGE), Slurm [12], LSF [13], Portable Batch System (PBS) [14], LoadLeveler [15], etc. Tachyon2 is using SGE to handle batch jobs. With such a scheduler, users can submit numerous jobs without concerning where and when the work will be assigned to computing resource and performed [16].
In order to execute an application on the supercomputer, the user writes a job specification script, as shown in Table 1, and submits the job script to the scheduler using the "qsub" command. When an available resource is acquired while waiting in the queue, the job is allocated to that resource and executed on that system. When assigning jobs, the policy for job assignment has to work. In Tachyon2, the exclusive node assignment policy is configured and applied. This means that only one job is assigned to a node, and interference can be eliminated when multiple jobs are executed simultaneously.  Figure 2 shows the execution process of parallel application in the high-performance computing (HPC) batch-type environment. In general, all batch jobs are processed through the scheduler, the status of resources is continuously monitored, and abnormal resources are excluded from available resources. In this system, a job is not preempted by another until it is submitted to the scheduler, executed, and finished [17].
Appl. Sci. 2020, 10, 2634 3 of 14 LSF [13], Portable Batch System (PBS) [14], LoadLeveler [15], etc. Tachyon2 is using SGE to handle batch jobs. With such a scheduler, users can submit numerous jobs without concerning where and when the work will be assigned to computing resource and performed [16]. In order to execute an application on the supercomputer, the user writes a job specification script, as shown in Table 1, and submits the job script to the scheduler using the "qsub" command. When an available resource is acquired while waiting in the queue, the job is allocated to that resource and executed on that system. When assigning jobs, the policy for job assignment has to work. In Tachyon2, the exclusive node assignment policy is configured and applied. This means that only one job is assigned to a node, and interference can be eliminated when multiple jobs are executed simultaneously  Figure 2 shows the execution process of parallel application in the high-performance computing (HPC) batch-type environment. In general, all batch jobs are processed through the scheduler, the status of resources is continuously monitored, and abnormal resources are excluded from available resources. In this system, a job is not preempted by another until it is submitted to the scheduler, executed, and finished [17]. The turnaround time of a job (Jobtat) is the time from when the user submits the job to get the results, which can be optimized according to the scheduling algorithm [18]. The steps of each process related to the work execution can be represented as follows: The turnaround time of a job (Job tat ) is the time from when the user submits the job to get the results, which can be optimized according to the scheduling algorithm [18]. The steps of each process related to the work execution can be represented as follows: Job tat = T ready + T running + T processing (pre, post) (1) • Job tat : Amount of time from submission to termination of the job; Ultimately, Job tat consists of waiting time before submitting a job, real job running time, and preprocessing time. Scheduling optimization here can effectively reduce the latency of the job (T ready ). In this paper, we focus and simulate Job tat reduction by optimizing T ready through the backfilling algorithm.

Job Execution Logs
Tachyon2 stores information about job execution at the start and endpoint. When a job is submitted, it is queued while available resources are ready. When available resources are obtained, the job is assigned to the compute resources. At this time, the user's job description script is backed up, and the library, compiler, and node information required to run the program are stored. At this time, the user's job scripts, libraries and compilers required to run the program, and the allocated resource information are stored [19].
When the job is completed on the Tachyon2 system, the scheduler (SGE) leaves the job execution information, then regenerates this log and saves it again daily in the format shown in Table 2. This data is used in accounting, statistical extraction, user job tracking, and so on. This log contains the log recording time, unique job ID, job name and date information. In addition, such a log format contains several important pieces of date information that are related to the job. For example, Submit-Date, Start-Date, and End-Date are representing the submission, starting, and ending date information in the format of yyyymmddhhmmss. While such date information is provided in the perspective of jobs, there are pieces of resource utilization perspective information, such as Wait-Time (T ready ) and Execution-Time (T running ). Such time information provides how long the jobs were queued for waiting resources and utilized assigned computing resources. Finally, the log format also provides pieces status information, such as the number of used CPU, Exit-Code, Failed-Code, and the number of threads [6].
As mentioned earlier, Tachyon2 uses an exclusive policy that executes only one job per node. Even if the number of cores used by a job is 4 (CPUS), it is subtracted after multiplying by 8 (E-CPU), which is the total number of cores installed on the node. Resource usage is calculated for all nodes on which the job has been run. Each of the CPU USAGE (s), MEM USAGE (KB), and MAX VMEM (KB) is expressed as the sum of CPU time, memory usage, and virtual memory usage of all nodes where the job is executed. Normally, jobs in HPC are executed using multiple resources, and the scheduler recognizes the events of each processor at the end of the job and logs the status. Most of the job is normally completed and then terminated, but it can be abnormally terminated for reasons such as forced termination by the user, error in the job submission script, exceeding the wall time limit of the job scheduler, software and hardware problems, etc. At the end of the job, the scheduler leaves each signal code according to the above causes [20]. The code for this cause is logged in the EXIT-CODE (#19) and FAILED (#20) fields. In the case of the OpenMP (Open Multi-Processing) code, the number of threads is stored in the OMP_NUM_THREADS (# 21) field. This study confirms that the utilization of idle resources is effectively used when the large-scale job is submitted in Tachyon2. It extracts statistical data and analyzes patterns of large-scale jobs using thousands of cores or more from the job log of Tachyon2 stored in the form of Table 2.

Analysis of Job Execution Logs
This section describes the way to optimize resource utilization through the analysis of large-scale jobs. It was mentioned that the Tachyon2 system is composed of 3200 compute nodes and has 8 cores per node. In total, it consists of 25,600 cores. The scheduler divides resource groups through queues. In this system, queue groups are largely divided into two types: public queues and private queues [21]. The public queue is used by various researchers to share computational resources through competition, and the exclusive queue is assigned to groups for specific mission-oriented research. The size of the computational resources allocated to the public queue and the exclusive queue can be changed depending on the purpose of the study.
This experiment focused on log data from public queues used through resource competition and conducted experiments for statistical analysis of the data. The first approach extracted the number of large-scale jobs using more than 2000 cores over a two-year period on a monthly basis. As shown in Figure 3, the number of large-scale jobs is being gradually increased and more than 4000 cores were concentrated from November 2016 to March 2017. Figure 4 shows the trend of available resources during the same period. The All Available Resources (core) shows the size of the resources allocated to the public queue. As discussed in Section 2.3, the exclusive policy was used in the Tachyon2 system. Therefore, the difference between Idle Resources (cores) and Avail Resources (Cores) represents the sum of the total number of unused resources and the number of cores of unused nodes. The usage statistics of the Tachyon2 system during the period are as follows.
-Statistic results of the resource usage (January 2016-December 2017) As shown in the summary, the average waiting time for public queues reaches 9.3 hours. However, the actual overall node utilization is only about 85%. This means that even though about 15% of the nodes were available, resources could not be used efficiently due to scheduling load. In other words, as shown in Figures 3 and 4, when a large-scale job is submitted, it waits until the required resources are available. In the meantime, even if a small job can be executed, it should wait if the priority is low. Therefore, even if a job requires a small number of resources and is performed shortly if the job priority is low, it must wait until the large-scale job is finished, even if there is an available resource.  The algorithm adopted by default in the scheduler is the First Come First Serve (FCFS) method, where the first arriving task in time is executed first. FCFS is the best way to ensure fairness of job order, but fragmentation occurs as the size of the computed resource increases, limiting the efficient use of resources. [22].
All jobs have their priorities. Such priorities are given, in general, based on arrival time. In addition, all jobs have different resource requirements. Therefore, it should be emphasized that actual available resources are not used if a high priority job holds those resources. Moreover, it is also noted that such available resources cannot be fragmented and shared for lower priority jobs. The most extreme way to reduce resource fragmentation is to allocate the shortest job first, which is called the Shortest Job First (SJF) algorithm [23]. Such an algorithm prioritizes small jobs according to fragmented resources. This can improve resource utilization and improve overall performance, but it does not guarantee the fairness of job order. Therefore, it is worth noting that, in the worst case, the algorithm might result in the starvation problem for large-scale jobs. In that sense, the scheduling policy should be used in a way that combines FCFS and SJF considering resource size and job characteristics [24].
Most job scheduling policies simply use an FCFS method in which jobs submitted first to the queue are executed first. The backfilling algorithm is an excellent method for solving the resource fragmentation, but only simple and passive techniques are applied in a site-level service   The algorithm adopted by default in the scheduler is the First Come First Serve (FCFS) method, where the first arriving task in time is executed first. FCFS is the best way to ensure fairness of job order, but fragmentation occurs as the size of the computed resource increases, limiting the efficient use of resources. [22].
All jobs have their priorities. Such priorities are given, in general, based on arrival time. In addition, all jobs have different resource requirements. Therefore, it should be emphasized that actual available resources are not used if a high priority job holds those resources. Moreover, it is also noted that such available resources cannot be fragmented and shared for lower priority jobs. Figures 3 and  4 show the trend of increasing available resources as the number of large-scale operations increases towards the end of 2016 and early 2017.
The most extreme way to reduce resource fragmentation is to allocate the shortest job first, which is called the Shortest Job First (SJF) algorithm [23]. Such an algorithm prioritizes small jobs according to fragmented resources. This can improve resource utilization and improve overall performance, but it does not guarantee the fairness of job order. Therefore, it is worth noting that, in the worst case, the algorithm might result in the starvation problem for large-scale jobs. In that sense, the scheduling policy should be used in a way that combines FCFS and SJF considering resource size and job characteristics [24].
Most job scheduling policies simply use an FCFS method in which jobs submitted first to the queue are executed first. The backfilling algorithm is an excellent method for solving the resource fragmentation, but only simple and passive techniques are applied in a site-level service The algorithm adopted by default in the scheduler is the First Come First Serve (FCFS) method, where the first arriving task in time is executed first. FCFS is the best way to ensure fairness of job order, but fragmentation occurs as the size of the computed resource increases, limiting the efficient use of resources. [22].
All jobs have their priorities. Such priorities are given, in general, based on arrival time. In addition, all jobs have different resource requirements. Therefore, it should be emphasized that actual available resources are not used if a high priority job holds those resources. Moreover, it is also noted that such available resources cannot be fragmented and shared for lower priority jobs. Figures 3 and 4 show the trend of increasing available resources as the number of large-scale operations increases towards the end of 2016 and early 2017.
The most extreme way to reduce resource fragmentation is to allocate the shortest job first, which is called the Shortest Job First (SJF) algorithm [23]. Such an algorithm prioritizes small jobs according to fragmented resources. This can improve resource utilization and improve overall performance, but it does not guarantee the fairness of job order. Therefore, it is worth noting that, in the worst case, the algorithm might result in the starvation problem for large-scale jobs. In that sense, the scheduling policy should be used in a way that combines FCFS and SJF considering resource size and job characteristics [24].
Most job scheduling policies simply use an FCFS method in which jobs submitted first to the queue are executed first. The backfilling algorithm is an excellent method for solving the resource fragmentation, but only simple and passive techniques are applied in a site-level service environment.
Nurion system, the next generation of the Tachyon2 system that was the background of this study, uses Portable Batch System (PBS) as a scheduler [25].
PBS uses the backfill_depth parameter, which is the number of backfill targets, for the top jobs with the highest priority among the queued job list [26]. PBS can backfill only a few lower priority jobs (backfill_depth) based on the highest priority parent job in the list of held jobs.
In addition, when a specific event occurs, the scheduler updates the resource status and the job profile, which is called a scheduling cycle [27]. The events that trigger the scheduling cycle include a certain period of time when submitting and terminating a job, when a scheduling server starts, and when a new configuration of a scheduler is applied, and so on. In the batch scheduler, the job size consists of the number of processors and the execution time, where the execution time is estimated by the user and specified in the job script and then submitted to the scheduler. Therefore, the actual job execution time can often be terminated earlier than the expected time. In this case, if the job ends earlier than expected, resource fragmentation occurs again and the backfill target (backfill_depth) job fills up. If this situation repeats and small jobs continue to be backfilled, eventually, the top job will not run and will lead to starvation. In general, to avoid this starvation, production level systems set the backfill_depth value very low to adjust the number of backfill operations. In the case of the Nurion system, the value is set to 3 or less. Because of this weakness, if the scheduler limits the number of backfill jobs to a setting like backfill_depth, it will not be able to take full advantage of fragmented resource utilization. Therefore, in this study, the simulation is performed by utilizing all fragmentation resources without limiting the number of jobs that can be backfilled as follows.

Backfilling Algorithm
Backfilling scheduling is a method of rearranging the order when a small job cannot be performed due to a relatively large predecessor job This aims to improve performance by placing small-scale jobs that can be executed first, while maintaining fairness. In order to apply this algorithm, the user must specify the size and execution time of the task.
This algorithm can improve performance with fairness. In backfilling scheduling, the execution time and size of each job must be specified. In this work, we traced and analyzed the job logs executed by the scheduler. As a result, we identify the idle resources that have not been used for a certain period of time and perform an experiment to optimize resource utilization by applying a backfilling algorithm [28].
The conservative backfilling algorithm is a basic version, and it adheres to the FCFS scheme, which is the basic principle of scheduling, and running first when a subordinate job satisfies a fragmented resource. In general, batch jobs submitted to the scheduler have attributes for required resources and execution time. The backfilling algorithm also has these two properties. For this mechanism, there are two data structures. The first is a list structure that stores jobs and execution time of the queue list and the second is the resource processor profile to be used.
This algorithm requires no latency for prior jobs due to subordinate jobs but cannot guarantee the planned sequence of jobs if the preceding job is terminated earlier than the expected time. Therefore, there is a more advanced algorithm for this, which is called EASY (the Extensible Argonne Scheduling System) backfilling [28,29]. However, the experiment in this study simulates the performance improvement through backfilling at a fixed time for a large-scale job and therefore excludes the case where the job ends before the expected time. Figure 5 and Algorithm 1 show the conservative backfilling algorithm and how it works, respectively.
The 1st priority queued job does not have enough resources to run, so it will be scheduled after the two running jobs (a) and (b) have ended. The 2nd priority queued job finds the available resource point (t 1 ), as indicated by the dotted line in Figure 5. Then, from this point t 1 to the end of the execution, it checks whether it is possible to free up resources. However, at t 2 , the 1st priority queued job is waiting and the 2nd priority queued job will delay it. The 2nd priority queued job can potentially free up resources if only one predecessor is terminated. However, the execution time of this job delays the predecessor, so a new start point must be found. In other words, it is a mechanism that prevents future arrivals from delaying previous pending operations. Finally, the 3rd priority queued job is backfilling because there is enough gap to do the job.
Appl. Sci. 2020, 10, 2634 8 of 14 future arrivals from delaying previous pending operations. Finally, the 3 rd priority queued job is backfilling because there is enough gap to do the job.

Simulation and Results
In our experiment, we have simulated certain large-scale jobs that were run on Tachyon2. This system has a scheduling formula based on the job size weighting, but the backfilling algorithm is not applied. Figure 6 shows the scenario of the conservative backfilling algorithm applied to Tachyon2. As discussed, the queued job has two properties, which are the resource and time requirements [30]. Here, it is assumed that the priority is higher as the number is smaller. If large-scale job #0 is inserted into the queue list, it should wait until it has a suitable resource. The conservative backfilling can execute low priority jobs (box #1~#6) first during the waiting time of job #0. Job #6 has a lower priority

Simulation and Results
In our experiment, we have simulated certain large-scale jobs that were run on Tachyon2. This system has a scheduling formula based on the job size weighting, but the backfilling algorithm is not applied. Figure 6 shows the scenario of the conservative backfilling algorithm applied to Tachyon2. As discussed, the queued job has two properties, which are the resource and time requirements [30]. Here, it is assumed that the priority is higher as the number is smaller. If large-scale job #0 is inserted into the queue list, it should wait until it has a suitable resource. The conservative backfilling can execute low priority jobs (box #1~#6) first during the waiting time of job #0. Job #6 has a lower priority than #5, but it runs first because the available resources are acquired first and do not delay predecessor #5.
Appl. Sci. 2020, 10, 2634 9 of 14 than #5, but it runs first because the available resources are acquired first and do not delay predecessor #5. Figure 6. Applying backfilling to available resources before the large-scale job.
As shown in the graphs of Figures 3 and 4, the simulation is performed by selecting the job that were the main cause of the increase in the number of available nodes due to large-scale operation. This job was a large-scale job that required 2048 nodes (16,384 cores, 8 cores/node), and the job ran after waiting about 40 hours and 16 minutes. During the waiting time, the computing nodes are emptied to match the requested resources of the job, as shown in Figure 7. As a result, the latency of the entire subordinated job increases (Jobtat). If a small job with a low priority leapfrogs first for the resources that are emptied during the waiting time, it can use the resources efficiently. When a job is submitted to the scheduler, it has a unique job ID, submission time, resource requirements, execution, and queue status, as shown in Figure 8 (a). The queue status indicates whether the job is currently waiting (qw), running (r), pending (h), or terminating (e). This As shown in the graphs of Figures 3 and 4, the simulation is performed by selecting the job that were the main cause of the increase in the number of available nodes due to large-scale operation. This job was a large-scale job that required 2048 nodes (16,384 cores, 8 cores/node), and the job ran after waiting about 40 hours and 16 minutes. During the waiting time, the computing nodes are emptied to match the requested resources of the job, as shown in Figure 7.
Appl. Sci. 2020, 10, 2634 9 of 14 than #5, but it runs first because the available resources are acquired first and do not delay predecessor #5. Figure 6. Applying backfilling to available resources before the large-scale job.
As shown in the graphs of Figures 3 and 4, the simulation is performed by selecting the job that were the main cause of the increase in the number of available nodes due to large-scale operation. This job was a large-scale job that required 2048 nodes (16,384 cores, 8 cores/node), and the job ran after waiting about 40 hours and 16 minutes. During the waiting time, the computing nodes are emptied to match the requested resources of the job, as shown in Figure 7. As a result, the latency of the entire subordinated job increases (Jobtat). If a small job with a low priority leapfrogs first for the resources that are emptied during the waiting time, it can use the resources efficiently. When a job is submitted to the scheduler, it has a unique job ID, submission time, resource requirements, execution, and queue status, as shown in Figure 8 (a). The queue status indicates whether the job is currently waiting (qw), running (r), pending (h), or terminating (e). This As a result, the latency of the entire subordinated job increases (Job tat ). If a small job with a low priority leapfrogs first for the resources that are emptied during the waiting time, it can use the resources efficiently. When a job is submitted to the scheduler, it has a unique job ID, submission time, resource requirements, execution, and queue status, as shown in Figure 8a. The queue status indicates whether the job is currently waiting (qw), running (r), pending (h), or terminating (e). This information can be checked with a command such as qstat provided by the scheduler. The simulation procedure is performed as follows: 1 Extract jobs that do not exceed 2048 nodes between submission time and start time (Waiting time, T ready ) during subordinated jobs, as shown in Figure 8a; 2 Backfilling is performed on the extracted jobs, as shown in Figure 8b. Each timestamp specifies the available idle processors (total number of cores); 3 Resource profile update -Scans the profile (request resource) and finds the first point in the available processor. This operation is performed for the job to be backfilled in 1 . -Starting from this point, it continues to scan the processor to see if it is available as a profile (request resource) until the job is terminated as expected. Figure 8 shows the updated resources profile. Conservative backfill does not delay the start time of predecessors. Therefore, job 3029155 is excluded because it ends after the start time of the target job. Moreover, job 2019182 is out of resources due to the backfill operation in front.
The calculation process is shown in Appendix A.
Appl. Sci. 2020, 10, 2634 10 of 14 information can be checked with a command such as qstat provided by the scheduler. The simulation procedure is performed as follows: ① Extract jobs that do not exceed 2048 nodes between submission time and start time (Waiting time, Tready) during subordinated jobs, as shown in Figure 8(a); ② Backfilling is performed on the extracted jobs, as shown in Figure 8 (b). Each timestamp specifies the available idle processors (total number of cores); ③ Resource profile update -Scans the profile (request resource) and finds the first point in the available processor. This operation is performed for the job to be backfilled in ①.
-Starting from this point, it continues to scan the processor to see if it is available as a profile (request resource) until the job is terminated as expected. Figure 8 shows the updated resources profile. Conservative backfill does not delay the start time of predecessors. Therefore, job 3029155 is excluded because it ends after the start time of the target job. Moreover, job 2019182 is out of resources due to the backfill operation in front. The calculation process is shown in Appendix A.
(a) Extraction of backfilling enabled jobs.
(b) Job profile (resource, time), update, and exception handling of out of resource jobs. Resource Efficiency can be obtained as Equations (3) ~ (5). The backfilling job has two properties (Resource usage (P) and Runtime (T)), as shown in Figure 5. The resource efficiency by backfilling scheduling is measured as follows:  Resource Efficiency can be obtained as Equations (3)- (5). The backfilling job has two properties (Resource usage (P) and Runtime (T)), as shown in Figure 5. The resource efficiency by backfilling scheduling is measured as follows: Backfilling jobs: B 1 (P 1 , T 1 ), ···, B n (P n , T n ) R(128cores × 12hours) × 7jobs + (4, 096cores × 12hours) × 2jobs +(512cores * 12hours) × 2jobs = 72, 192seconds (Reduced overall job execution time) Backfill scheduling recognizes the status of jobs and resources every scheduling cycle event covered in Section 2.4 and attempts scheduling policy. This is because, as mentioned above, the status of the job changes continuously depending on the user's intention or the state of the system. In this study, the simulation was performed while reflecting the job profile for each timestamp. The target job of the experiment was shortened by about 20 hours through a backfill compared to FCFS. If this study is applied to other similar large-scale jobs, the resource efficiency is much higher and the overall job execution time (Job tat ) can be greatly reduced.

Conclusions and Future Work
Parallel computing is a suitable solution for solving large-scale problems. The main content of this study is to optimize the order of jobs executed in the Tachyon2 system to efficiently use the entire available resources. This experiment analyzed the actual logs of job execution using the converted log in the Tachyon2 system. Note that the average utilization of the Tachyon2 system is about 85.6% and the average wait time is about 9.3 hours. This can be seen as a fragmentation of resources that occur during scheduling. Of course, it is not possible to use all the resources perfectly to reflect both usage and execution time. However, it is possible to minimize resource fragmentation occurring during job scheduling and to make resource utilization more efficient by analyzing the jobs performed on the computational resources. Such utilization can be more improved by studying and applying the appropriate scheduling algorithms. One way is to analyze the statistics of large-scale operations, which are gradually increasing in the Tachyon2 system, to grasp the user's success rate, execution time, and resource size. Based on the statistical results, the backfill scheduling algorithm is studied and simulated to reduce resource fragmentation in Tachyon2. This study effectively used available resources and reduced turnaround time for backfilled tasks. In the end, we improved the performance of the overall scheduler. This paper focuses on the simulation of the backfill scheduling algorithm that analyzes the supercomputer's job statistics and targets cases of inefficient resource use. Although it is useful to understand how to improve the utilization of an actively working system, it is necessary for us to apply the various scheduling algorithms to the current system and analyze the work execution history continuously and repeatedly as future work. In that sense, research is needed to apply advanced algorithms that reflect when a job is done earlier than expected or when available resources are suddenly added. Such research will optimize resource utilization and reduce overall user latency. In this experiment, our main target system was the Tacyon2 supercomputer. As of writing, the next generation of KISTI's 5th supercomputer Nurion consists of 8400 compute nodes based on many-core architectures and production services launched in December 2018. Therefore, it is worth evaluating newly delivered supercomputers with the same approach to find the resource underutilization problem, ultimately providing more resources to scientists for better and faster scientific outcomes.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Jobs with a lower priority than the larger job being simulated in the queue are listed in Figure 8a. These jobs are targeted for backfilling jobs and should not delay the start time of the original job. This is the basic principle of a conservative algorithm. Each job has its own resource requirements, such as the number of processes and the execution time. The scheduler calculates resource usage at regular intervals (time stamp), and this time is performed for each scheduling cycle mentioned in the previous text. Idle Resources in Figure A1 shows the total resources available when resources are allocated or released by a task.
For example, when the first backfilling job (Job ID: 3028615) requests 128 cores, it allocates resources from the start time to the end time and update Idle Resources. In the case of the backfilling job (Job ID: 3029154), the job is submitted at 21:42:04 and requests 4096 resources, but, at that point, there is not enough resources. After about 2 hours, the available resources are released at 00:00, but this job cannot be backfilled because the 12-hour execution time delays the original job.

Conflicts of Interest:
The authors declare no conflict of interest

Appendix A
Jobs with a lower priority than the larger job being simulated in the queue are listed in Figure 8 (a). These jobs are targeted for backfilling jobs and should not delay the start time of the original job. This is the basic principle of a conservative algorithm. Each job has its own resource requirements, such as the number of processes and the execution time. The scheduler calculates resource usage at regular intervals (time stamp), and this time is performed for each scheduling cycle mentioned in the previous text. Idle Resources in Figure A1 shows the total resources available when resources are allocated or released by a task.
For example, when the first backfilling job (Job ID: 3028615) requests 128 cores, it allocates resources from the start time to the end time and update Idle Resources. In the case of the backfilling job (Job ID: 3029154), the job is submitted at 21:42:04 and requests 4096 resources, but, at that point, there is not enough resources. After about 2 hours, the available resources are released at 00:00, but this job cannot be backfilled because the 12-hour execution time delays the original job.