# A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- Parallelization of hyper-parameter optimization process: Hermes parallelizes hyper-parameter optimization by time-sharing between containers running DL jobs. The container preemption is implemented using the model checkpointing feature supported by TensorFlow [13].
- Convergence-aware scheduling policy: Hermes accelerates hyper-parameter optimization by prioritizing jobs based on the convergence speed. This sharply contrasts to the Gandiva’s approach [7] since not all tasks are trained equally, but important tasks are selected and accelerated.
- No prior knowledge and modification of user code: In contrast to previous works, Hermes does not need prior knowledge about the jobs such as job completion time (JCT) distribution. Moreover, it does not try to predict the JCT of the jobs. Moreover, Hermes does not require modification to the user code and all modifications are transparent to the users.
- Real implementation: We have implemented Hermes over Kubernetes [6], one of the most popular open-source platforms for container orchestration.

## 2. Background and Motivation

#### 2.1. Training of Deep Learning Models

#### 2.1.1. Overview of Deep Learning Training

#### 2.1.2. Hyper-Parameter Optimization

#### 2.1.3. Grid Search

#### 2.1.4. Random Search

#### 2.1.5. Bayesian Optimization

#### 2.2. Motivation

## 3. Design and Implementation

#### 3.1. Overall Architecture of Hermes

#### 3.2. Global Scheduler

Algorithm 1 Placement Algorithm |

Input: GPUs $\mathbb{G}$, Job $\mathbb{J}$ |

1: for job $J\in \mathbb{J}$ do |

2: ${G}_{cand}\leftarrow \mathrm{Find}\_\mathrm{Available}\_\mathrm{GPU}\left(\mathbb{G}\right)$ |

3: if ${G}_{cand}\ne \mathrm{null}$ then |

4: Initialize J |

5: Enqueue J to ${G}_{cand}$ |

6: end if |

7: end for |

Algorithm 2 Find_Available_GPU |

Input: GPUs $\mathbb{G}$ |

Output: ${G}_{cand}$ |

1: ${G}_{thres}\leftarrow 4$ // ${G}_{thres}$: job count threshold in each GPU |

2: ${G}_{cand}\leftarrow \mathrm{null}$ |

3: for GPU $G\in \mathbb{G}$ do |

4: if # of job in G is fewer than ${G}_{thres}$ then |

5: if ${G}_{cand}$ is null or # of job in G is fewer than ${G}_{cand}$ then |

6: ${G}_{cand}\leftarrow G$ |

7: end if |

8: end if |

9: end for |

10: return ${G}_{cand}$ |

#### 3.3. Node Scheduler

Algorithm 3 Convergence-aware Scheduling Algorithm |

Input: Iteration i, Convergences $\mathbb{C}$, GPUs $\mathbb{G}$ |

1: for $G\in \mathbb{G}$ do |

2: ${j}_{C}\leftarrow \mathrm{currently}\phantom{\rule{4.pt}{0ex}}\mathrm{running}\phantom{\rule{4.pt}{0ex}}\mathrm{job}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}G$ |

3: ${Q}_{W}\leftarrow \mathrm{waiting}\phantom{\rule{4.pt}{0ex}}\mathrm{job}\phantom{\rule{4.pt}{0ex}}\mathrm{queue}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}G$ |

4: $\mathbb{J}\leftarrow \mathrm{jobs}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}\mathrm{GPU}\phantom{\rule{4.pt}{0ex}}G$ |

5: if $\mathbb{J}=\varnothing $ then |

6: continue |

7: end if |

8: for $J\in \mathbb{J}$ do |

9: if J is WAITING then |

10: Enqueue J to ${Q}_{W}$ |

11: end if |

12: end for |

13: ${J}_{sched}\leftarrow \underset{J\in {Q}_{W}}{max}{C}_{J}\left(i\right)$ |

14: if ${J}_{sched}$ needs preemption then |

15: if ${J}_{C}$ is PREEMPTIBLE then |

16: Preempt ${J}_{preempt}$ |

17: end if |

18: end if |

19: Schedule ${J}_{sched}$ |

20: end for |

#### 3.4. Preemption Module

## 4. Performance Evaluation

#### 4.1. Experiment Setup

#### 4.1.1. Testbed

#### 4.1.2. Workloads

#### 4.1.3. Baselines

#### 4.2. Hyper-Parameter Optimization Speed

#### 4.3. Overhead Analysis

## 5. Related Work

#### 5.1. Deep Learning Scheduling Frameworks

#### 5.2. Hyper-Parameter Optimization Frameworks

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Gu, J.; Chowdhury, M.; Shin, K.G.; Zhu, Y.; Jeon, M.; Qian, J.; Liu, H.; Guo, C. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19); USENIX Association: Boston, MA, USA, 2019; pp. 485–500. [Google Scholar]
- Hertel, L.; Collado, J.; Sadowski, P.; Ott, J.; Baldi, P. Sherpa: Robust hyperparameter optimization for machine learning. SoftwareX
**2020**, 12, 100591. [Google Scholar] [CrossRef] - Domhan, T.; Springenberg, J.T.; Hutter, F. Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. In Proceedings of the 24th International Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2015; pp. 3460–3468. [Google Scholar]
- Vavilapalli, V.K.; Seth, S.; Saha, B.; Curino, C.; O’Malley, O.; Radia, S.; Reed, B.; Baldeschwieler, E.; Murthy, A.C.; Douglas, C.; et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing-SOCC ’13; ACM Press: Santa Clara, CA, USA, 2013; pp. 1–16. [Google Scholar] [CrossRef]
- Hindman, B.; Konwinski, A.; Zaharia, M.; Ghodsi, A.; Joseph, A.D.; Katz, R.; Shenker, S.; Stoica, I. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation; USENIX Association: Berkeley, CA, USA, 2011; pp. 295–308. [Google Scholar]
- Foundation, C.N.C. Kubernetes. Available online: https://kubernetes.io (accessed on 1 December 2020).
- Xiao, W.; Bhardwaj, R.; Ramjee, R.; Sivathanu, M.; Kwatra, N.; Han, Z.; Patel, P.; Peng, X.; Zhao, H.; Zhang, Q.; et al. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18); USENIX Association: Carlsbad, CA, USA, 2018; pp. 595–610. [Google Scholar]
- Peng, Y.; Bao, Y.; Chen, Y.; Wu, C.; Guo, C. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference on-EuroSys ’18; ACM Press: Porto, Portugal, 2018; pp. 1–14. [Google Scholar] [CrossRef]
- Zheng, H.; Xu, F.; Chen, L.; Zhou, Z.; Liu, F. Cynthia: Cost-Efficient Cloud Resource Provisioning for Predictable Distributed Deep Neural Network Training. In Proceedings of the 48th International Conference on Parallel Processing-ICPP 2019; ACM Press: Kyoto, Japan, 2019; pp. 1–11. [Google Scholar] [CrossRef]
- Zheng, W.; Tynes, M.; Gorelick, H.; Mao, Y.; Cheng, L.; Hou, Y. FlowCon: Elastic Flow Configuration for Containerized Deep Learning Applications. In Proceedings of the 48th International Conference on Parallel Processing-ICPP 2019; ACM Press: Kyoto, Japan, 2019; pp. 1–10. [Google Scholar] [CrossRef][Green Version]
- Mahajan, K.; Balasubramanian, A.; Singhvi, A.; Venkataraman, S.; Akella, A.; Phanishayee, A.; Chawla, S. Themis: Fair and Efficient GPU Cluster Scheduling. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20); USENIX Association: Santa Clara, CA, USA, 2020; pp. 289–304. [Google Scholar]
- Zhang, H.; Stafman, L.; Or, A.; Freedman, M.J. SLAQ: Quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing-SoCC ’17; ACM Press: Santa Clara, CA, USA, 2017; pp. 390–404. [Google Scholar] [CrossRef][Green Version]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: tensorflow.org (accessed on 29 January 2021).
- Tensorflow. TensorFlow Benchmark. Available online: https://github.com/tensorflow/benchmarks (accessed on 1 December 2020).
- Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Statist.
**1951**, 22, 400–407. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. Available online: https://openreview.net/forum?id=8gmWwjFyLj (accessed on 29 January 2021).
- Shallue, C.J.; Lee, J.; Antognini, J.; Sohl-Dickstein, J.; Frostig, R.; Dahl, G.E. Measuring the Effects of Data Parallelism on Neural Network Training. J. Mach. Learn. Res.
**2019**, 20, 1–49. [Google Scholar] - Hinton, G.E. A Practical Guide to Training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 599–619. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine Learning in Python. arXiv
**2018**, arXiv:1201.0490. [Google Scholar] - Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential Model-Based Optimization for General Algorithm Configuration. In Learning and Intelligent Optimization; Coello, C.A.C., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 507–523. [Google Scholar]
- Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, US, 2011; pp. 2546–2554. [Google Scholar]
- Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. arXiV
**2012**, arXiv:1206.2944. Available online: https://arxiv.org/abs/1206.2944 (accessed on 29 January 2021). - Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.d., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, US, 2019; pp. 8024–8035. [Google Scholar]
- Liaw, R.; Bhardwaj, R.; Dunlap, L.; Zou, Y.; Gonzalez, J.E.; Stoica, I.; Tumanov, A. HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline. In Proceedings of the ACM Symposium on Cloud Computing-SoCC ’19; ACM Press: Santa Cruz, CA, USA, 2019; pp. 61–73. [Google Scholar] [CrossRef]
- Bergstra, J.; Yamins, D.; Cox, D.D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, Austin, TX, USA, 24–29 June 2013; pp. 13–20. [Google Scholar]
- Rasley, J.; He, Y.; Yan, F.; Ruwase, O.; Fonseca, R. HyperDrive: Exploring hyperparameters with POP scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference on-Middleware ’17; ACM Press: Las Vegas, NV, USA, 2017; pp. 1–13. [Google Scholar] [CrossRef]
- Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res.
**2017**, 18, 6765–6816. [Google Scholar]

**Figure 2.**(

**a**) Cumulative distribution functions (CDF) of all the combinations are in blue and the combinations achieving at least 50% of the loss of the optimal combination are in red. Note that the red CDF stagnates around 0.2 because only about 20% of the combinations converged. (

**b**) Example loss curves from our experiment.

**Figure 4.**Example of $los{s}_{j}\left(i\right)$. The blue line shows a “noisy” example loss curve. The red line shows the estimated loss curve, which is less affected by noise.

**Table 1.**Considered hyper-parameter configurations. 192 random combinations ($\mathrm{Optimizer}\times \mathrm{Batch}\phantom{\rule{4.pt}{0ex}}\mathrm{size}\times \mathrm{Learning}\phantom{\rule{4.pt}{0ex}}\mathrm{rate}\times \mathrm{Weight}\phantom{\rule{4.pt}{0ex}}\mathrm{decay}$) are used for each model.

Name | Values |
---|---|

Optimizer | $\in \{\mathrm{SGD},\phantom{\rule{0.277778em}{0ex}}\mathrm{Momentum}-\mathrm{SGD},\phantom{\rule{0.277778em}{0ex}}\mathrm{RMSProp},\phantom{\rule{0.277778em}{0ex}}\mathrm{Adam}\}$ |

Batch size | $\in \{16,\phantom{\rule{0.277778em}{0ex}}24,\phantom{\rule{0.277778em}{0ex}}32,\phantom{\rule{0.277778em}{0ex}}50\}$ |

Learning rate | $\in \{0.001,\phantom{\rule{0.277778em}{0ex}}0.0005,\phantom{\rule{0.277778em}{0ex}}0.0001,\phantom{\rule{0.277778em}{0ex}}0.00001\}$ |

Weight decay | $\in \{0.1,\phantom{\rule{0.277778em}{0ex}}0.01,\phantom{\rule{0.277778em}{0ex}}0.001\}$ |

Model | ||||
---|---|---|---|---|

GoogleNet | VGG16 | VGG19 | ResNet50 | |

TensorFlow | 30.12 s | 94.26 s | 110.49 s | 66.88 s |

Hermes | 32.73 s | 97.25 s | 112.73 s | 69.46 s |

Frameworks | Scheduling Algorithm | Prior Knowledge | Objective | Consider DL Quality |
---|---|---|---|---|

Gandiva [7] | Time-sharing (RR) | None | Fairness | No |

Tiresias [1] | Gittin index | JCT distribution | Minimize average JCT | No |

Themis [11] | Semi-optimistic auction | None | Finish-time fairness | No |

Optimus [8] | Remaining-time-driven | JCT estimation | Minimize average JCT | Yes |

FlowCon [10] | Growth-efficiency-driven | None | Minimize average JCT | Yes |

SLAQ [12] | Quality-driven | None | Average quality improvement | Yes |

Hermes | Feedback-driven | None | Early feedback | Yes |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Son, J.; Yoo, Y.; Kim, K.-r.; Kim, Y.; Lee, K.; Park, S. A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters. *Electronics* **2021**, *10*, 350.
https://doi.org/10.3390/electronics10030350

**AMA Style**

Son J, Yoo Y, Kim K-r, Kim Y, Lee K, Park S. A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters. *Electronics*. 2021; 10(3):350.
https://doi.org/10.3390/electronics10030350

**Chicago/Turabian Style**

Son, Jaewon, Yonghyuk Yoo, Khu-rai Kim, Youngjae Kim, Kwonyong Lee, and Sungyong Park. 2021. "A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters" *Electronics* 10, no. 3: 350.
https://doi.org/10.3390/electronics10030350