An Efficient Approach to Consolidating Job Schedulers in Traditional Independent Scientific Workflows

The current research paradigm is one of data-driven research. Researchers are beginning to deploy computer facilities to produce and analyze large amounts of data. As requirements for computing power grow, data processing in traditional workstations is always under pressure for efficient resource management. In such an environment, a tremendous amount of data is being processed using parallel computing for efficient and effective research results. HTCondor, as an example, provides computing power for data analysis for researchers. Although such a system works well in a traditional computing cluster environment, we need an efficient methodology to meet the ever-increasing demands of computing using limited resources. In this paper, we propose an approach to integrating clusters that can share their computing power on the basis of a priority policy. Our approach makes it possible to share worker nodes while maintaining the resources allocated to each group. In addition, we have utilized the historical data of user usage in order to analyze problems that have occurred during job execution due to resource sharing and the actual operating results. Our findings can provide a reasonable guideline for limited computing powers shared by multiple scientific groups.


Introduction
Research methodology has changed from traditional observational methods to a data-driven research paradigm, which is called the fourth generation research paradigm [1]. According to the paradigm of existing scientific research, there has been a change from the first generation paradigm that describes natural phenomena through observation, the second generation through modeling and generalization, and the third generation research paradigm using computer simulation technology [2]. Recently, a data-driven research paradigm that analyzes and makes discoveries using vast amounts of data from large research equipment has emerged [3,4]. Efficiently analyzing this vast amount of data requires massive amounts of computing power [5,6]. In general, computer clustering technologies are utilized in computing resources of groups of various sizes, from small lab units to data centers. This is called high throughput computing (HTC) [7]. A job management program is required for batching jobs and managing queues to use the HTC system more efficiently, such as HTCondor [8][9][10], PBSPro [11,12], Slurm [13,14], and Torque [15,16]. HTCondor is an open source program that was released in 1988 (the name was Condor at the time of release [10]) and has an active open source community.
We currently operate a data center to support researchers in basic science [17]. Not only do new researchers continue to ask for support, but the demand for resources annually submitted by existing researchers continues to grow [18]. Figure 1 shows that resource demand steadily increases by about 2000-3000 cores each year [19]. The total amount of demand will increase from 3900 cores in 2019 to 13,700 cores in four years. This trend is not confined to domestic, but also to international research groups [20]. However, due to the limited size of equipment that can be introduced on a limited budget, it is impossible to meet the requirements of all researchers. However, the user's request is to meet peak demand at a specific point in time, and since the utilization rate of the cluster is not always high, we integrated the dedicated clusters of each research group into one in order to satisfy this as much as possible. First, we chose to change the existing system without deploying a new system as a method for management of integrated resources [21]. Accordingly, we chose to share worker nodes using existing HTCondors but limit quotas of individual groups. In this case, it is difficult to continuously match the information of newly added or deleted users and groups. To solve this problem, we separated the job submission node. Users' access and defined user groups are based on the submission node. We also considered how the temporary resources that occur occasionally can be effectively utilized while maintaining the minimum allocated resources for each group [22].
In this paper, we discuss how shared resources can be used simultaneously and how the minimum allocated resources can be maintained for each group. We also analyzed the user's job histories to examine changes in job processing patterns. We also analyzed TimeLoss according to job characteristics.

Job History Information
We extracted the job history information through HTCondor's condor_history command. The user's job histories were extracted and analyzed from April to September of 2018 before the integration and the same period of 2019 after the integration. The information used in the analysis is AcctGroup, AcctGroupUser, CMD for job categorization and CommittedTime, CumulativeSlotTime, JobCurrentStartDate, JobCurrentStartExecutingDate, JobStartDate, NumJobStarts, and QDate for job characteristics. The meaning of each item is summarized in Table 1.  [23].

AcctGroup
Group information for submitted jobs AcctGroupUser User information for submitted jobs CMD Executed command CommittedTime The number of seconds of wall clock time that the job has been allocated to a machine CumulativeSlotTime Cumulative number of seconds the job has been allocated to a machine JobCurrentStartDate Time at which the job most recently began running JobStartDate Time at which the job first began running NumJobStarts An integer count of the number of times the job started executing QDate Time at which the job was submitted to the job queue

Status before Integration
Before the integration, each experiment group used dedicated computing resources. Each experimental group could only use the dedicated resources assigned to them. For this reason, the congestion level of the cluster varies according to the group members who follow the same schedule as the conferences or workshops. In other words, the cluster is crowded at certain times, but most of the time it is idle. The trend of job submissions for each group is shown in Figure 2. As shown in Figure 2, Group A actively submitted jobs in mid-April and June, while Group B did in early April. It is important to note that there are time mismatches for the utilization of computing resources dedicated to the two groups.  In the extracted job records, jobs with different submitted times and job started times were classified. The time difference was assumed to be longer than the cycle reported by the job manager. In the HTCondor Job Manager, this cycle is called the matchmaking cycle [24]. These jobs mean that the job cannot run immediately because there are no slots available when the job is submitted. A percentage of 15.99% of jobs submitted from Group A were accumulated in the job queue, and 55.46% of jobs submitted from Group B were accumulated in the job queue. The number of jobs submitted is not proportional to the number of jobs waiting, as in Figure 2a,b. This is because the submitted jobs did not overlap in a short time.
Long overlap periods reduce the benefits of cluster consolidation. According to the job histories of the year 2018, the percentage of jobs in the queue overlaps 3.37% of the total for Group A and 0.29% of the total for Group B, and in fact jobs rarely overlap between the two groups. The overlapping periods can be seen in Figure 3. Therefore, we expected high efficiency when the two groups were integrated. Despite the above situation, individual users feel that there is a resource shortage as demand increases, and they want to increase the size of the overall cluster. During this period, the utilization rate of Group A is 10.24%, and the utilization rate of Group B is 20.40%, so the utilization rate of the entire cluster is not high. Therefore, it is difficult to guarantee that the overall cluster utilization will increase as you increase the cluster size. In other words, if we provide more resources for each experiment group, resource utilization will still be similar. The integration of the two group's clusters provides a way to meet the needs of users while increasing utilization across the clusters.

Configuration of the Integrated Cluster
Target clusters for integration have similar properties and provide close functionalities in terms of analytical tools or individual jobs used as experimental groups in the same field. The analysis environment is similar, so no extra action had been required to shared worker nodes. The integrated cluster is composed of independent job submit nodes, and there is a common job management node, and there are worker nodes. The two groups to be consolidated were assigned worker nodes with 400 cores and 1656 cores, respectively. In order to integrate the two groups, we set the minimum guaranteed quota to the number of cores of the worker nodes owned by each group. Storage, not computing resources, is set up to be accessible from the same configuration and all worker nodes. The problem caused by the difference of OS versions before integration was solved by introducing Singularity, a Linux container program [25][26][27]. Figure 4 is a schematic of the integration cluster.

Configuration of Submission Node
Because job submission nodes in each group are separated, we can identify user groups by job submission nodes. The user's group authentication was based on the UI server information submitted by the user, not information such as ID. In this concept, HTCondor does not restrict the group information of the user's submission, so it adds a setting to refuse to submit the job to the user access node such as GROUP_NAMES = group_a, SUBMIT_REQUIREMENT_NAMES = GROUP, and SUBMIT_REQUIREMENT_GROUP = (AcctGroup =?= "group_a"). This setting forces the users to clarify a valid group name for their jobs. If a user submits a job with the wrong group name, the job manager rejects the job and informs the correct name to user. For example, SUBMIT_REQUIREMENT_GROUP_REASON = "Wrong accounting group. Your group is group_a".

Configuration of the Job Management Node
In the job management node, the job is assigned to a worker node by matching the requirements from the submitted job and the information of the worker node. Group information and the quota of each group were set in this node, such as GROUP_NAMES = group_a, group_b, group_etc, GROUP_QUOTA_group_a = 400, GROUP_QUOTA_group_b = 1656, and GROUP_QUOTA_group_etc = 80. A setting that can be used in excess of the set quota when there are extra slots in the other group's slot (GROUP_ACCEPT_SURPLUS = True) and a setting to preempt a slot when another group uses more than their own quota (PREEMPT = True, NEGOTIATOR_CONSIDER_PREEMPTION = True, PREEMPTION_REQUIREMENTS = True, and PREEMPTION_REQUIREMENTS = $(PREEMPTION_REQUIREMENTS) && ((SubmitterGroupResourcesInUse < SubmitterGroupQuota) && (RemoteGroupResourcesInUse > RemoteGroupQuota))) were added. The question of which slot to preempt thus arises. PREEMPTION_RANK = 2592000 -ifThenElse(isUndefined(TotalJobRuntime),0,TotalJobRuntime) is in this setting. This setting allows specific jobs to perform beyond the group's quotas. When a job is requested by a different group of users, it is necessary to terminate the running job over the quota. Furthermore, a returned slot is then assigned to that group. In this case, we have set the most recent job to be preempted. Users may think that they are losing time, but it is not at all a loss because it is a job that would have to wait in the queue if resources were not consolidated.

Characterization of Jobs before and after Integration
We found changes in job processing patterns in both periods through job statistics before and after integration. The statistical information is shown in Table 2

Job Submission Trends
In the same way that the 2018 job statistics were compared, the trend patterns of the job submitted by each group in 2019 and the level of overlap between the two groups were checked. Comparing Figures 2 and 6 shows that the timing of job submissions changed. Similarly, we also compared the trends of queued jobs. In the case of Group A, the ratio in the queue during the submission process increased to 32.97%, and in Group B the ratio in the queue during the submission process decreased to 46.98%. In fact, the overlapping intervals of the jobs stacked in the queue are shown in Figure 7-15.79% of the total in Group A and 1.68% of the total in Group B. Both groups increased compared to before the integration. This part appears to be a natural change due to the increase and decrease in the amount of job processing and wall time of Groups A and B. In particular, users in Group A submitted larger jobs than in 2018 as the total slots grew. In fact, the utilization rate of Group A increased significantly to 70.36%, and the utilization rate of Group B decreased to 15.33%. In conclusion, the utilization rate of the entire cluster was 25.06%, which was 16.64% before the integration, that is higher than before despite the side effect due to integration.

Characteristics of Preempted Jobs
When a slot was preempted, the job assigned to the slot had to be restarted when an idle slot was available. Therefore, the job with a NumJobStarts value of 2 or more was judged as preempted, and the characteristics of the job were examined. The distribution of NumJobStarts used in the analysis is shown in Figure 8. Percentages of 17.10% in Group A and 0.76% in Group B were preempted job ratios. It was noted that the ratio difference between the two groups resulted from a fourfold difference in size. We also looked at the situation of preempted jobs and preempted possibility along their committed time. The result is shown in Figure 9. In both groups, the longer the execution time was, the higher the probability of being preempted was. However, Group A maintained a high probability, while Group B tended to have nothing preempted after a certain period. In the probability graph at Figure 10, redrawing only up to the percentage of committed time 99 would look like Figure 11. Both groups tend to grow linearly, but Group A grows faster than Group B.

Time Loss along Preemption
If a slot is preempted, that is, in excess of the quota, the running job is canceled, and time is lost. We calculated these loss times to measure the inefficiency of the preemption. As Figure 12 shows, most of the time lost in the preemptive slots is small. For Group A, the average 19,003 s quartiles are 502, 3149, and 10,363 s. For Group B, the mean is 419 s, and the quartiles are 1, 2, and 25 s.

Conclusions
As scientific discoveries through data analysis form the mainstream of research, there is a need for efficient use of available resources that are distributed and used independently. In the course of the study, we showed examples of practical applications about the sharing of computing systems used in different experimental groups. However, in the process of integrating resources, the effect that can be obtained depending on the size of the resources was different. When integrating two groups of different resources, the small group will have benefits from more resources than before, and the larger group will see little benefit, which is expected. In practice, small groups benefited by up to 527.0% of resource utilization, while large groups benefited by 127.4% and thus saw no significant benefit. Meanwhile, the probability of a job being preempted or of time loss caused by being preempted was greater in the small group than in the large group. Ultimately, when consolidating resources, all groups obtain a reasonable benefit. These benefits will contribute to academic development by increasing the utilization of resources by researchers. In the viewpoint of computing resources, you can use the same resources more efficiently and meet the requirements of users. In addition, the results discussed in the paper might provide a basis for persuading groups of different sizes. In the future, we will integrate groups of various sizes and further study the benefits and effects of each group.