Cost Efﬁcient GPU Cluster Management for Training and Inference of Deep Learning

: Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also signiﬁcant energy consumption costs. In this paper, we propose a cost efﬁcient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we ﬁrst categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efﬁciently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing. which is not dedicated for certain GPU device architectures and DL jobs. We present the deferrable DL training job scheduling and integrate it with the dynamic right-sizing (DRS) method. Our sophisticated approach enables both the ensurance of the performance requirements (deadline and latency bound) and the energy consumption cost svaing for DL job processing. Through the large-scale simulation results based on real data of NVIDIA-based GPU cards and the execution proﬁling, we show that our method is practical for modern GPU-based clusters. The soft-constrained modeling approach of the CE-DLA method achieves the cost-to-performance ratio about 2 times on average, than previous performance-driven approach (PA-MBT) and energy-driven approach (EPRONS). In view of energy consumption cost, our CE-DLA method improves the cost saving of 29% than the competitors (43% than PA-MBT, and 15% than EPRONS) while guranteeing the acceptable performance. In future work, we will explore the cost-efﬁcient framework covering the entire steps of DL jobs, i.e., including data preprocessing, data transmission between nodes in cluster, DNN model training/inferencing and the response submission to users.


Introduction
Recently, the Artificial Intelligence (AI) services based on deep learning (DL) have been dramatically expanded over the various area (e.g., image processing, computer vision, natural language processing, game learning, and self-driving system), while the nonnegligible cost by the AI infrastructures have not been studied in detail yet. Most of the cost for DL application processing is caused from the energy consumption for GPU-based cluster operation [1]. For example, the maximum thermal design power (TDP) of state-ofthe-art Ampere-based A100 GPU device (7 nm) is 400 W, which is higher than the older architecture of Volta-based V100 (12 nm, TDP-300 W) [2]. Generally, the GPU computing nodes in DL clusters, bring two types of energy consumption: idle energy consumption and active energy consumption [3]. The idle energy consumption is occurred when the node is turned on but has no running DL jobs. The active energy consumption is required when the node executes the assigned DL job. The active energy consumption is determined based on both the characteristics of the DL jobs (i.e., the number of deep neural network (DNN) model parameters and the input data size) and the hardware specification of the deployed GPU devices (i.e., the number of multi-processing units and core/memory clock rate) in the node [4]. Due to the complex mixture of such factors, the unsophisticated DL job allocation might bring the undesirable energy consumption cost for the cluster operation.
In order to reduce the energy consumption of the cluster operation, lots of studies have been presented so far. For reducing the energy consumption and carbon emission First, they do not consider the heterogeneity of DL jobs. Without loss of generality, we can categorize the usual DL jobs into two classes: training jobs and inference jobs [19]. The DL training jobs require both of the feed-forward and back-propagation computations for DNN model parameter updating. The back-propagation computation needs longer duration than the feed-forward computation because of the diffusion of the chained gradient calculation. Usually, the DL training jobs require hours or days to get the acceptable estimation accuracy from the constructed DNN model. The DL training job is kind of batch workload, so we should pre-define the deadline for each DL training job. As long as satisfying the deadline, we are able to freely halt and restart the DL training jobs. The DL inference jobs require only the feed-forward computation for DNN model estimation. Contrast to the DL training jobs, the inference jobs require only few milliseconds or seconds to be completed. The DL inference job is kind of transactional workload, and the involved performance metric is the response latency which is commonly used for service quality evaluation. Consequently, the clusters handling DL training and inference jobs should do double duty like as the high performance computing (HPC) cluster for scientific applications and the data center (DC) for Internet workloads.
Second, the previous studies do not consider the heterogeneity of GPU device architectures. Generally, the deployed multiple GPU devices in clusters have different architectures due to the continuous upgrade [20]. The different GPU devices have the different computing capacity and derive the different power usage. In this case, it is not easy to design the common bottom-up model that is applicable for various GPU device architectures in general. Therefore, we need to design the generalized statistical models to estimate the GPU device power usage and the DL job performance, instead of using the certain model dedicated to the specific GPU architecture. Finally, the previous studies for GPU-based cluster operation do not consider the dynamic electricity price of the grid market. They commonly focus on only the training acceleration and the estimation accuracy, but not the electricity price-aware DL job allocation. Even with the same amount of energy consumption, the actual energy consumption cost for DL job processing may differ according to the variation of the electricity price [21]. For example, the energy consumption cost of the DL training job during the day-time may be higher than the one during the early hours of the morning.
In this paper, we propose a novel approach of cost efficient deep learning job allocation (CE-DLA), that minimizes the energy consumption cost of GPU-based cluster operation with DL job processing while guaranteeing the performance requirements of user requests. To do this, we categorize all the DL jobs into two classes of training and inference jobs, and formulate the deadline and the response latency as performance metrics. Our proposed CE-DLA approach is able to cover the allocation for heterogeneous DL jobs on the cluster. Furthermore, we exploit the statistical modeling approach to accommodate the heterogeneous GPU computing nodes in the cluster. Regardless of the GPU architecture diversity, our approach can be applied to common GPU-based clusters with DL job processing. Furthermore, we consider the dynamic electricity price of the grid market to reduce the energy consumption cost for cluster operation in practice. We conduct the DL job allocation with fine-granularity (i.e., time slot based delicate allocation) in the light of the electricity price variation, so as to enable the additional cost reduction even for the same energy consumption. The contributions of this work are as follows: • To the best of our knowledge, this paper is the first work that conducts electricity price-aware allocation for both the DL training and inference jobs on the GPU-based cluster. Corresponding to the variation of the grid market price, our proposed approach automatically derives the cost optimal DL job allocation given workloads of user requests. • The statistical power and performance models used in our formulation can be applied to general GPU-based clusters consist of heterogeneous GPU devices. At the cost of negligible estimation error, our approach efficiently estimates the GPU device power usage and the running DL job performance without the entire profiling of various GPU architectures. • In order to reduce the energy consumption of idle GPU computing nodes, we exploit the dynamic right-sizing (DRS) method that temporarily turns the nodes having no DL jobs off. When the workloads of user requests are low, we can achieve the maximum cost efficiency via the DRS method. • Through the sophisticated mixed integer nonlinear problem (MINLP) formulation, our approach easily finds the optimal solution considering all the aspects presented above. • We exploit real trace data of GPU computing nodes and the grid market price, so as to establish the practical simulation experimental environments.
In order to evaluate the performance of our work practically, we first extract the real output data from the actual GPU computing nodes and the well-known DNN models. We deployed four types of GPU devices: GTX1060 (3 GB), GTX1060 (6 GB), 1080 (8 GB), and RTX2060 (6 GB) [2]. For DNN models, we exploit ResNet152 [22], VGG19Net [23], and InceptionV3 [24]. For input raw data, we exploit the ImageNet as the large-scale dataset [25]. Based on the framework of MS Computation Network Toolkit (CNTK) [26], we develop the Windows Powershell-based scripts to parse the output from the DL job processing. Second, we exploit the real trace data of dynamic electricity price from the Federal Energy Regulatory Commission (FERC) [27]. By using all the retrieved real data, we establish the practical simulation environment and conduct the various experiments. We demonstrate that our proposed CE-DLA approach outperforms other competitors for GPU-based cluster operation in views of energy consumption cost and the performance for DL job processing. Figure 1 shows the system structure with our proposed cost efficient deep learning allocation (CE-DLA) approach. The public service users and vendors continuously send the requests of DL training (B) and inference jobs (L) to our system. The requests of the DL training jobs contain the specification of the target DNN model, the dataset, the number of epochs to be trained, and the deadline. The requests of the DL inference jobs contain the specification of the target DNN model, the (single) input data, and the bound of acceptable response latency. At the same time, our system retrieves the trajectories of electricity price W from the grid market. In the work, we consider only the real-time electricity price, not the day-ahead one. Based on B, L, and W, the system conducts both the DL job allocation and the GPU-based cluster operation. We present the the allocation procedures as follows.

•
Step 1: The system begins to calculate the allocation decisions x and (y, l) for the DL training and inference jobs, respectively. • Step 2: For DL training jobs, the system maps the partial epochs of the jobs to the multiple time slots of GPU computing nodes. Each DL training job requires the feedforward and back-propagation computations. In the feed-forward computation, user input data is sequentially passed over the DNN model weight parameters, and the output is derived from the last layer in the DNN model. After that, in the backpropagation computation, the model gradients are derived layer by layer based on the loss function value. After the calculation of all the gradients is done based on the chain-rule, the DNN model parameters are updated. The system iteratively conducts these procedures for all the pre-defined epochs. • Step 3: For DL inference jobs, the system carries out the load-balancing for the input workloads on the available GPU computing nodes. In contrast to DL training jobs, each DL inference job requires only the feed-forward computation. Therefore, the DL inference jobs can be handled immediately. The system takes the DL inference jobs as the base workloads, and allocates them into the time slots right after they arrived. • Step 4: If DL training jobs are deferrable, the system tries to halt-and-resume the assigned training jobs with considering the variation of the electricity price, as long as the deadline is not violated.

•
Step 5: The system derives the DRS decision s for the GPU computing nodes in the cluster. The system turns idle nodes that have no running jobs off until it needs to deploy more available time slots for workload increasing. The system circumspectly conducts the turning-on/turning-off for nodes because the power state transition requires the non-negligible overheads [28]. The system formulates the optimization problems in order to make sophisticated allocation and operation decisions. By using the data (i.e., power usage and DL job processing time) measured by the GPU node parser, the system constructs two optimization problems: hard-constrained and soft-constrained problems. In the hard-constrained problem, the system tightly tries to satisfy the requirements of deadline and response latency bound for DL training and inference jobs. In this case, if the available time slots of nodes are not enough to accommodate all the jobs, then the system simply rejects some of the requests. In the soft-constrained problem, the system just penalizes the violation of the deadline and the response latency bound. In this case, the system tries to find the optimal trade-off between the service quality of user requests and the financial benefit of the cluster owner. After the problem formulation, the MINLP solver finds the optimal solution x * , y * , l * , s * given workloads and electricity prices. The system applies the derived optimal solution to the cluster. We assume that our system accurately predicts the requests of the DL jobs and the grid market electricity prices during the control horizon. The detailed forecasting methods and the associated prediction error are not presented in this paper, and we will study such issues in the future work.

Proposed System Model
The goal of our work is to derive the optimal decision for DL training job scheduling, DL inference job balancing and computing node right-sizing in order to achieve the costefficient GPU-based cluster. To formulate the optimization problems, we consider K GPU computing nodes, and M DNN models. Let h and H denote the index of the certain time slot and the length of control horizon, respectively. We assume that the duration for a single time slot is 15 min. The notations used in the work are summerized in Table 1.

Notation Description
decision variable for DRS method

Training and Inference Job Model
The characteristics of DL training and inference jobs are totally different. Generally, the number of invoked training jobs (e.g., tens to hundreds) is much less than the number of inference jobs during the day (e.g., ≈millions) [29]. The required processing time for DL training jobs is relatively long (i.e., hours to days) while the response latency of each DL inference job is very short (i.e., about milliseconds). Therefore the DL training jobs generally occupy multiple time slots to be completed while the DL inference jobs should be done within the allocated single time slot.
The element E i,j denotes the total number of epochs required to be trained in B i,j . The element Q T i,j denotes the time slot index of deadline for B i,j . Therefore, the number of remaining available time slots denote the number of DL inference jobs of i-th DNN model arrived at the time slot h, while the DL training job that is kind of a batch job considering the deadline, the DL inference job is in the category of transaction workload considering the response latency [30]. The upper-bound of acceptable response latency for DL inference jobs of i-th DNN model is defined as Q I i . In the work, we assume that the response latency bound for DL inference jobs depends on the target DNN model type (i.e., the DL service type). That is, all the DL inference jobs of i-th DNN model have the same value Q I i . We define the request sets of DL training and inference jobs as follows: B = {B 1,1 , B 1,2 , · · · , B 1,n 1 , B 2,1 , B 2,2 , · · · , B 2,n 2 , · · · , B M,n M }, Here, n i represents the number of invoked DL training jobs of i-th DNN model. From the viewpoint of the performance, for DL training jobs, the only thing we need to consider is the deadline compliance. This indicates that we are able to conduct the flexible scheduling for DL training jobs as long as the deadline is not violated. We freely postpone the partial epochs of the DL training jobs within the deadline. Meanwhile, for DL inference jobs, we should consider the short-term response latency, not the long-term deadline. Obviously, we cannot conduct the flexible scheduling approach for DL inference jobs due to the requirement of prompt execution. All the DL inference jobs should be done within their arrived time slots. In this case, we may regard DL inference jobs as the base workloads in each time slot.

Performance Model
The processing performance of DL training and inference jobs is affected by two factors: the characteristics of the assigned GPU device and the size of DNN model. The compute capability of processing units, core and memory clock rates, and the memory bandwidth of GPU device may affect the processing time of DL jobs [4]. Note that the GPU architectures deployed in modern clusters are commonly heterogeneous [20], and the processing time of DL jobs might be different according to the allocated GPU device types. Generally, it is not easy to establish the general bottom-up performance model for various commercial GPU devices. Instead, we exploit the statistical model to estimate the processing time of DL jobs allocated to heterogeneous GPU devices [31].
Let µ T i,j,k denote the training completion time per an epoch for j-th DL training jobs of i-th DNN model allocated to k-th GPU computing node. Then, the number of traincompleted epochs per a time unit (i.e., a minute) is the reciprocal of µ T i,j,k . Let δ denote the duration for a single time slot (i.e., 15 min). Then, the number of train-completed epochs during a time slot is calculated as follows: Let µ I i,k denote the service rate for DL inference jobs of i-th DNN model allocated to k-th computing GPU node. We assume that the probability that the GPU computing node is busy, is always 1 [32]. Then, we can derive the average response latency as follows: where λ i,k represents the arrival rate of partial workloads in L i , distributed to k-th GPU computing node. Let l i,k denote the amount of partial workloads in L i , distributed to k-th GPU computing node by the load-balancing decision of our CE-DLA approach. Then, the arrival rate λ i,k can be derived as follows: Obviously, we can exploit Equations (3) and (4) to construct the DL job performance constraints. We present the details in Section 2.2.6.

Power Consumption Model
If the GPU computing node is powered-on state and it is active, then its power consumption is proportional to the load of the running DL jobs. Let p T i,j,k denote the power consumption for j-th DL training job of i-th DNN model allocated to k-th GPU computing node. The constants α T i,j,k,1 and α T i,j,k,2 represent the power model coefficients to derive p T i,j,k , respectively. They can be estimated the statistical modeling approach such as recursive least square (RLS) method [33]. Then we can derive the power consumption p T i,j,k at time step h as follows: where F k is the performance factor of k-th GPU computing node. We set this value by including the compute capability and the clock rate of the GPU device. Here, x i,j,k (h) is the decision variable indicates whether the partial epochs of the DL training job B i,j is allocated to k-th GPU computing node at time slot h or not (i.e., allocated = 1, not allocated = 0). For DL inference jobs, we design the power consumption model based on the utilization of processing units [34]. Let p I i,k denote the power consumption for DL inference jobs of i-th DNN model distributed to k-th GPU computing node. Similar to Equation (6) above, 3 , and α I i,k,4 are associated power model coefficients to estimate p I i,k . Let u i,k denote the average utilization of k-th GPU computing node with the partial workloads of L i . Then we define the power consumption p I i,k at time step h as follows: where l i,k (h) is the amount of partial workloads in L i (h), distributed to k-th GPU computing node at time step h. We can reformulate the average utilization of the node u i,k (h) as follows: Then, by integrating Equations (7) and (8), we can derive the reformulated power consumption model, in term of λ i,k as follows: Let y i,k (h) denote the indicator represents whether the DL inference jobs distributed to the GPU computing node or not. That is, y i,k (h) is defined as follows: Based on Equations (5)-(10), we define the total power consumption of the k-th GPU computing node at time slot h as follows: where p idle k is the static power consumption when the k-th GPU computing node has no DL jobs. Consequently, our proposed CE-DLA approach can tune the power consumption for GPU computing nodes by adjusting decision variables x, y, and l, based on Equation (11) above.

Energy Consumption Model
Now, we are able to derive the energy consumption model based on both the performance and the power model defined above. Let e k (h) denote the energy consumption of the k-th GPU computing node at time slot h. The power state variable s k (h) indicates whether the k-th node is powered-on or powered-off state at time slot h. The energy consumption e k (h) is formulated as follows: If the GPU computing node goes into the powered-off state at time slot h (s k (h) = 0), then the involved static power consumption is 0 W. In this paper, we exploit the linear electricity pricing model for implementation simplicity. Let W (h) denote the electricity price presented by the grid market at time slot h. Then the energy consumption cost for the entire cluster at time slot h is defined as follows: Finally, the total energy consumption cost for the entire cluster during all the time slots h = 1, · · · , H is formulated as follows: In order to reduce the energy consumption cost of the cluster, the CE-DLA approach prefers to assign the partial epochs of DL training jobs to the time slots at which the electricity price is low. Owing to the fine-grained structure (i.e., consists of multiple short-term iterations), we freely halt and restart running DL training jobs on the nodes. Furthermore, for time slots at which the electricity price is high, we can temporarily turn some of idle GPU computing nodes off by using the DRS method, so as to avoid the unnecessary static power consumption.
Note that there are non-negligible overheads for conducting halting-and-restarting the DL training jobs and the turning the nodes on again. In next section, we present the undesirable state transition cost occurred by those controls.

State Transition Cost Model
The first state transition cost is occurred by the halting-and-restarting process for DL training jobs. When the stopped DL training jobs resume, the node re-activates the library of the GPU device, re-uploads the trained DNN model parameters to the GPU global memory and reads the input dataset from the disk again. The associated cost is simply defined as follows: Here, the constant O hr ijk represents the pre-measured overhead price for restarting the halted DL training job. For implementation simplicity, we do not consider the DL training job migration between different GPU computing nodes. Nevertheless, it is easy to modify or extend the equation above if we want to allow the DL training job migration.
The second state transition cost is occurred by the switching of DRS process. Obviously, we may see the additional overheads when the GPU computing node transits to the powered-off state from the powered-on state. According to [28], the cost generally contains the additional power consumption cost, reboot time consuming cost, and the wear-and-tear cost. Similar to Equation (15), the associated cost is defined as follows: where the constant O drs k represents the cost for turning the GPU computing node on. This value only depends on the characteristics of the node.

Constraints
This section introduces the involved constraints for our problem formulation. The defined decision variables are x, y, l and s. Note that x, y and s are binary variables. The constraints for binary decision variables are defined as follows: We present the additional constraints for allowable job allocation as follows: The constraints (20) imply two conditions. The arbitrary DL training job cannot be allocated to the node before it arrives at the node. Furthermore, the DL training and inference jobs cannot be allocated to the nodes that are powered-off state. The constraints (21) imply that the arbitrary DL training job should be allocated to the single GPU computing node. The DL training job cannot be executed without the assigned node and it cannot be executed on more than two nodes in parallel. The constraints (22) imply that the common single node cannot accommodate both of the DL training and inference jobs simultaneously. The constraints (23) imply that the partial workloads l i,k (h) cannot exceed the entire workloads l i (h). The constraints (24) imply that the total sum of partial workloads of the DL inference jobs should be matched to the amount of entire requests. The constraints (25) imply that the DL inference jobs cannot be distributed to the GPU computing nodes that are powered-off state. The constraints (26) imply that y i,k (h) = 1 only when L i,k (h) > 0.
We present the constraints for acceptable performance for DL job processing as follows: h · x i,j,k (h) ≤ Q T i,j , ∀i, j, k, h.
The constraints (27) imply that all the epochs of the DL training jobs should be able to be trained within the allocated time slots. The constraints (28) imply that the DL training jobs cannot be allocated to the time slots that are after the deadline. The constraints (29) imply that the response latency for the distributed DL inference jobs should be below than the upper bound Q I . Contrast to DL training jobs, the DL inference jobs are not allowed to be postponed to later time slots. By reformulating Equations (4), we can derive the constraints (29).
If there is no feasible solution that satisfies the constraints (27)-(29) due to lack of GPU computing nodes, then the hard-constrained problem formulated in next Section 2.2.7 is infeasible. In this case, we should reject some of requests. In order to mitigate this risk, we additionally formulate the soft-constraint problem in Section 2.2.8.

Hard Constrained Problem Formulation
Based on the model and constraints presented above, we formulate the problem with the hard constraints as follows: subject to. (17)- (29).
The advantages of the hard-constrained Problem 1 is that we can tightly ensure the acceptable performance of the DL jobs. However, there are several drawbacks at the same time. First, we may miss the opportunity to find the trade-off between the cost and the performance. We are inevitable to reject the some of requests despite the slight violation of the deadline and the response latency bound. For example, the requested DL training job should be completed within 10 h and the estimated deadline violation is only 15 min but nevertheless the system may reject that request. Second, the defined deadline and the response latency bound are sometimes just recommendations, not critical constraints. In this case, the strict enforcement of the performance constraints rather cause the degradation of the service quality due to too frequent request rejection. To solve such issues, we present the additional soft-constrained problem formulation in next section.

Soft Constrained Problem Formulation
In the soft-constrained problem formulation, we convert the constraints (17)- (29) to the additional cost terms for the objective function. In this approach, we flexibly accommodate the violation of the deadline and the response latency bound. By scaling the size of the attached weight values, we are able to gradually adjust the acceptable level for the DL job processing performance. Additionally, we can increase the cluster owner's benefit at the minor cost of the slight performance degradation, and reduce the undesirable frequent request rejection. We formulate the problem with the soft constraints as follows: subject to. (17)- (27).
Here, O T i,j,k and O I i,k are the weight values for the violation of the deadline and the response latency bound, respectively. Note that if they are positive infinite values, then the soft-constrained Problem 2 is equivalent to the hard-constrained Problem 1. The max terms in the objective function can be converted to the linear forms. For details, see Appendix in [28]. In addition, the energy consumption model (i.e., Equation (12)) used in the problem formulation contains the product terms of decision variables x, y, l and s. We need to reformulate these terms in order to make the Problems 1 and 2 the convex ones. To do this, we can apply the exponential-transformation technique to the problem. For details, see [35].

Job Allocation Algorithm
The DL job allocation algorithm of the CE-DLA approach is presented in the Algorithm 1. The system receives the specifications of arrived DL training and inference job requests, and checks the involved parameters. According to the cluster operation policy, the system properly constructs the problem P (lines 01-05). If we want to ensure the acceptable performance of the DL job processing, then the system sets P as the hard-constrained problem (line 02). Otherwise, the system sets P as the soft-constrained problem (line 03). After then, the system checks the status of deployed CPU computing nodes in the cluster, and extracts the available time slots (line 06). If the feasible solution exists, then the system solve the problem P by using the MINLP solver (line 07). We exploit the well-known GUROBI optimizer to do this [36]. If the feasible solution does not exist, then the system rejects some of requests according to each priority value attached to the DL jobs (line 09). In this work, we do not present the details for establishing the priority policy. After the rejection, the system checks the feasibility of the problem P again (line 10). Finally, the system actuates the control for the cluster by using derived optimal solution x * , y * , l * , and s * . x * , y * , l * , s * = solve P by using the MINLP solver [36]; 08: else 09: reject some of requests from B and L according to priorities; 10: goto 06; 11: end if 12: actuate the cluster control by (x * , y * , l * , s * )

Experiments
For the experiments, we measure the actual data from our GPU devices, and the real trace data of the grid market electricity price. After then, we establish the simulation environment, so as to investigate the practical performance of our proposed CE-DLA approach compared to other competitors.

Deep Learning Job Setup
In order to establish the experimental testbed, we develop the Windows Powershell-based script codes by using the DL framework, MS Computation Network Toolkit (CNTK) [26]. For generating the DL training and inference jobs, we exploit three well-known DNN models: ResNet152 [22], VGG19Net [23], and InceptionV3 [24]. In this paper, for experimental simplicity, we only focus on the DL jobs for image processing. In order to construct the requests of the DL jobs, we exploit the open large-scale dataset [25]. We set the data batch size as 32 for each DL training job. For the DL inference job requests, we generate each request that contains the raw data of the single image. We develop the console-based output parser to collect the training time and the inference response latency from each DL job. All the codes are executed based on the Anaconda framework [37] that is the virtual environment with various python packages.

GPU Computing Node Setup
We measure the actual power and performance output from the real GPU computing nodes. We consider four types of the NVIDIA GPU devices: GTX1060 (3 GB), GTX1060 (6 GB), RTX2060 (6 GB), and the GTX1080 (8 GB) [2]. In order to measure the power consumption for each DL job, we use the utility, NVIDIA-SMI [38] that provides the monitoring data of the NVIDIA-based GPU device architectures. By using our implemented parser, we collect the power usage and the DL job processing time for each pair of GPU devices-DNN models. We exploit these actual data to establish our simulation environment. We assume that the average CPU utilization is always full (that is, almost 100%), and the involved power consumption of the CPU device is maximum during DL job processing.

Simulation Testbed Setup
Based on derived actual data from the GPU devices with three DNN models, we establish the practical simulation environment. We consider the cluster that contains 160 GPU computing nodes (i.e., GTX1060 (3 GB)-40 EA, GTX1060 (6 GB)-40EA, RTX2060 (6 GB)-40 EA, and GTX1080 (8 GB)-40 EA). For simplicity, we assume that the number of deployed GPU device within each node is only 1. The length of the entire horizon H is set as 7 days (i.e., = 168 (h) = 10, 080 (min) = 672 (time slots)). Similar to the experimental setup of [4], the amount of DL training and inference jobs, and the involved deadline and the response latency, are generated by using the real job-trace and Gaussian distribution (avg = 50%, and std = 7.5% of possible shortest completion time/response latency).
We exploit the real trace data of the Federal Energy Regulatory Commission (FERC) [27] in order to draw the trajectory of the grid market electricity price. Especially, we use the locational based marginal price (LBMP) data ($/Wh) from the Syracuse power generator in New York Independent System Operator (NYISO) from 1 to 7 June 2018. Figure 2 shows the associated curve. We see that the regular pattern that the electricity price is high during the daytime (10 a.m.-6 p.m.) while it is low during the night.

Evaluation Metric
In order to evaluate our proposed CE-DCM system, we measure four metrics as follows: (1) Deadline violation: For the DL training job B i,j , this performance metric is defined If we want to differentiate the priority of each DL training job, then we attach the weight parameters to the metric. Obviously, the ideal value of the deadline violation is 0.
(2) Latency bound violation: For the partial workloads of the DL inference job l i,k (h), this performance metric is defined as 0). The average latency bound violation for all DL inference jobs is defined as . If we want to differentiate the priority of each DL inference job, then we attach the weight parameters to the metric. Similar to the deadline violation, the ideal value of the latency bound violation is also 0.
(3) Energy consumption: For the entire cluster, the total energy consumption is derived as ∑ ∀k ∑ ∀h e k (h).
(4) Energy consumption cost: For the entire cluster, the total energy consumption cost is derived as c = ∑ ∀k ∑ ∀h e k (h) · W (h). Note that the energy consumption cost is determined by both of the amount of energy consumption and the electricity price. Therefore the curve of energy consumption cost is not always same to the one of energy consumption. Obviously, the lower this value, the better the cost efficiency.
(5) Cost-to-performance ratio: For the entire cluster, the cost-to-performance ratio is defined as . This metric presents the number of DL (training and inference) jobs that we can ensure the involved performance requirements (deadline and latency bound), given the unit energy consumption cost (i.e., per 1 USD).

Competitors
We implement the existing approaches below, in order to compare them with our proposed CE-DLA approach.
(1) Baseline: This naive approach uses the first-input-first-output (FIFO) based job allocation for both the DL training and inference jobs. This approach does not consider the detailed job specification that contains the deadline and the response latency bound. For the implementation simplicity, we sometimes prefer to deploy this approach to the small-scale clusters. We exploit this approach to draw the worst bound of the derived performance and the cost to evaluate other approaches.
(2) EPRONS [39]: This approach proposes the job allocation for energy proportional network and server (EPRONS) in the cluster (e.g., data centers) by using the linear programming model. The authors focus on minimizing the energy consumption of latency-sensitive applications, i.e., DL inference jobs. This approach tries to find the optimal DRS decision while guaranteeing the response latency bound of the DL inference jobs. However, this approach does not consider the deadline-sensitive long-term applications, i.e., DL training jobs. In addition, the authors does not explicitly reflect the dynamic electricity price to the job allocation. In the EPRONS approach, we optimize min N * P server s.t. K = K * , where N is the number of active (powered-on) servers, P server is the amount of power consumption, and K is the scale factor (K * = optimal scale factor) for the latency-aware packet allocation. In this paper, we exploit l i,k as K to implement the EPRONS approach. In addition, we do not consider the cost for network links and switches defined in [39] in our model.
(3) PA-MBT [30]: The authors in this work present the performance aware allocation of mixed batch and transactional (PA-MBT) jobs, i.e., DL training and inference jobs. They propose the performance metric, relative performance functions (RPFs) to find the trade-offs between heterogeneous job requests. Their proposed method successfully achieves the acceptable deadline and response latency, however it does not consider the amount of energy consumption by each job and the associated electricity payment. Moreover, their approach cannot avoid the unnecessary energy consumption caused by the idle GPU computing nodes because of the absence of the DRS method. In the PA-MBT approach, we optimize max min m u m (w m ) where m is the index of job, u is the RPF that measures the relative distance of the actual job processing time w from its requirement. The RPF value and the involved parameters can be determined based on the workload characteristics (i.e., training-batch or inference-transactional). In this paper, we exploit µ as w to implement the PA-MBT approach. Figure 3 shows the DL job training time and inference latency according to our deployed GPU device types. Figure 3a shows the average completion time for training 10 iterations in an epoch of each pair of GPU devices-DNN models. Obviously, the GPU device, RTX2060 (6 GB) has the best performance (≈15 s in average) compared to other devices, due to its state-of-the-art architecture for parallel processing. Owing to the bigger memory size, the GTX1060 (6 GB) derives the better performance (≈36 s in average) than the GTX1060 (3 GB) (≈42 s in average) in spite of the same core architecture. Figure 3b shows the average response latency of DL inference jobs of each pair of GPU devices-DNN models. Similar to Figure 3a, the RTX2060 (6 GB) has the best performance for the latency (≈0.025 s in average) while the GTX1060 (3 GB) has the worst one (≈0.062 s in average). In view of the response latency, there is not a huge difference between the GTX1080 and RTX2060 because the feed-forward computation of the DL inference job is not heavy compared to the back-propagation computation. We exploit the data shown in Figure 3 to construct our simulation experiments.   Figure 4a shows the average DL training job deadline violation (%) of the CE-DLA approach and others when the average DL inference job workload is (=10 5 (reqs/min) (moderate)). The baseline approach has the worst result (≈7% at # of training jobs = 280) because it naively assigns the DL training jobs to time slots of the GPU computing nodes without consideration to the job specification. Except the baseline approach, the others derive the deadline violation about 2%, in average. This is because the total workload of the DL inference jobs is low, and the available time slots of the GPU computing nodes are enough to accommodate all the requests. Figure 4b shows the average DL training job deadline violation (%) when the average DL inference job workload is (=17.5 5 (reqs/min) (high)). Due to the high workload of the DL inference job, the deadline violation of all the approaches is relatively large compared to Figure 4a. Especially, the EPRONS approach derives the deadline violation about (12% at # of training jobs = 280), and that result is worse than the PA-MBT and the CE-DLA approaches. This is because the EPRONS approach prefers to assign the DL jobs to the certain nodes that require the low energy consumption regardless of the compute capability. Similar to the CE-DLA approach, the PA-MBT approach shows the low deadline violation even at the high workload, because it finds the available earliest time slots of the nodes without considering the energy consumption. Figure 4c shows the average DL inference job latency violation (%) according to the workloads (at # of training jobs = 175 (moderate)). Obviously, when the workload of the DL inference jobs is low (=10 5 (reqs/min)), then the associated latency violation is not serious (≈2% in average) in all the approaches. Figure 4d shows the latency violation (at # of training jobs = 280 (high)). Similar to Figure 4b, the baseline and the EPRONS approaches derive the worse latency compared to the PA-MBT and the CE-DLA approaches, because they do not explicitly consider the performance of DL job processing. For all the cases, the PA-MBT approach shows the good performance compared to our proposed CE-DLA approach. Note that these results by the PA-MBT approach can be achieved at the non-negligble cost of the energy consumption.  Figure 5 compares the proposed CE-DLA approach with others in views of both the energy consumption and the energy consumption cost. The total energy consumption (kWh) occurred by the CE-DLA approach and the others when the average DL inference job workload is (=10 5 (reqs/min)), is shown in Figure 5a. The baseline approach derives the worst energy consumption about (3500 kWh at # of training jobs = 280) because it assigns the DL training jobs according to the arrival ordering, not the energy consumption. The EPRONS and our proposed CE-DLA approaches derive almost same the energy consumption (≈600 kWh in average) along all the cases, because both of them try to find the energy optimal time slots for the DL job allocation. Contrast to results of the performance in Figure 4, the PA-MBT approach has the worse energy consumption (≈1300 kWh in average) compared to the EPRONS and the CE-DLA approaches. The PA-MBT approach tries to properly balance both the batch (training) and transaction (inference) workloads with considering the heterogeneous performance requirements similar to our CE-DLA approach, however it does not consider the amount of the involved energy consumption. Figure 5b shows the energy consumption according to the DL inference job workloads (at # of training jobs = 175). Similar to Figure 5a, both the baseline and the PA-MBT ap-proaches derive relatively the high energy consumption for DL job allocation compared to the EPRONS and the CE-DCM approaches. Figure 5c shows the energy consumption cost ($) for all the approaches when the average DL inference job workload is (=10 5 (reqs/min)), with the real trace data of the FERC. The bar shapes of the energy consumption cost in Figure 5c are similar to the ones in Figure 5a, however the EPRONS approach has the worse energy consumption cost compared to our proposed CE-DLA approach. This is because the EPRONS approach only considers the amount of energy consumption, not the grid market electricity price. As mentioned earlier, the energy consumption cost can be different even for the same amount of energy consumption by the variation of the electricity price. Our proposed CE-DLA approach directly reflects the dynamic electricity price to the DL job allocation, so as to minimize the energy consumption cost as shown in Figure 5c.  Figure 6 compares the cost-to-performance ratio of all the methods. We can find that our proposed CE-DLA method outperforms other competitors in view of the cost efficiency. In Figure 6a, the soft-constrained modeling approach of the CE-DLA method achieves the cost-to-performance ratio (5.26/1 $) about 2.5 times of the ratio (2.09/1 $) of PA-MBT method. Although the cost-to-performance ratio of the hard-constrained modeling approach is slightly worse than the soft-constrained modeling approach, it still outperforms all of other competitors. In Figure 6, we also see the superiority of the CE-DLA method for DL inference jobs over others, similar to Figure 6a. Table 2    Additionally, we try to investigate the rejection ratio of our proposed CE-DLA method when the workload is extremely high (400 training jobs, 2500 inference requests/min). For implementation simplicity, we assume that only the training jobs are rejected (the inference jobs are not). Table 3 shows the results. The hard-constrained modeling approach of the CE-DLA method derives the high rejection ratio (98 DL training jobs, 24.5%) similar to the PA-MBT (97 DL training jobs, 24.2%). In contrast, the soft-constrained modeling approach achieves the reduced rejection ratio (45 DL training jobs, 11.2%) with the slight violation of the performance requirement of DL jobs. If we set the weight constants O T i,j,k and O I i,j by large values, then the soft-constrained modeling approach may achieve the better cost efficiency than others, but it may also bring the undesirable performance degradation (by the violation allowance). Note that both the hard and soft-constrained modeling approaches derive the high rejection ratio when the workload is extremely high (400 training jobs/2500 inference reqs/min). If the workload is too high and the available nodes are lack to accommodate all the requests, then it is inevitable to reject a lot of requests regardless of the allocation algorithm design. Our proposed method achieves the financial benefit only when the workload is not too tight to process. We think that almost algorithms targeting to cost-efficient resource allocation, have the same issue. This problem is not from the weakness of the algorithm design, but from the lack of available resources. If we want to solve such issues, we should deploy more servers into the cluster.

Conclusions
In this paper, we propose the energy consumption cost efficient deep learning job allocation (CE-DLA) method for GPU-based cluster operation. To the best of our knowledge, this is the first work conducts the performance-and the cost-driven allocation for both the DL training and inference jobs given the dynamic electricity price. We design the mixed integer nonlinear programming (MINLP) formulation based on the statistical modeling which is not dedicated for certain GPU device architectures and DL jobs. We present the deferrable DL training job scheduling and integrate it with the dynamic right-sizing (DRS) method. Our sophisticated approach enables both the ensurance of the performance requirements (deadline and latency bound) and the energy consumption cost svaing for DL job processing. Through the large-scale simulation results based on real data of NVIDIAbased GPU cards and the execution profiling, we show that our method is practical for modern GPU-based clusters. The soft-constrained modeling approach of the CE-DLA method achieves the cost-to-performance ratio about 2 times on average, than previous performance-driven approach (PA-MBT) and energy-driven approach (EPRONS). In view of energy consumption cost, our CE-DLA method improves the cost saving of 29% than the competitors (43% than PA-MBT, and 15% than EPRONS) while guranteeing the acceptable performance. In future work, we will explore the cost-efficient framework covering the entire steps of DL jobs, i.e., including data preprocessing, data transmission between nodes in cluster, DNN model training/inferencing and the response submission to users.

Conflicts of Interest:
The authors declare no conflict of interest.