Next Article in Journal
CrystalCells: An Open-Source Modular Bioprinting Platform with Automated Tool Exchange, High-Performance Extruding, Thermal Control, and Microscopic Imaging
Previous Article in Journal
The Haptic Fidelity Paradox in VR: Cognitive Load and User Satisfaction
Previous Article in Special Issue
An Overview of Technical Aspects and Challenges in Designing Edge-Cloud Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy-Guided Hierarchical Scheduling for Elastic Distributed Deep Learning

1
Department of Artificial Intelligence, Kyung Hee University, Yongin 17104, Republic of Korea
2
Department of Computer Science and Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
3
Artificial Intelligence Research Center, Kyung Hee University, Yongin 17104, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(8), 3725; https://doi.org/10.3390/app16083725
Submission received: 3 March 2026 / Revised: 31 March 2026 / Accepted: 8 April 2026 / Published: 10 April 2026
(This article belongs to the Special Issue Edge Computing and Cloud Computing: Latest Advances and Prospects)

Abstract

Shared GPU clusters often execute multiple distributed training jobs concurrently under fluctuating contention. We reinterpret this setting as a two-scale control problem, where the micro scale captures intra-job learning dynamics and the macro scale captures inter-job resource arbitration. We propose an entropy-guided hierarchical framework that links these two scales through a unified uncertainty signal computed from training logits. Unlike existing uncertainty-aware methods that typically use uncertainty for only a single level of decision making, our approach uses the same entropy-based signal to jointly support both intra-job adaptation and inter-job scheduling within a hierarchical control loop. At the micro level, each worker estimates predictive uncertainty via normalized entropy and converts it into stable weights that drive epoch-level controls for uncertainty-aware data sharding, fixed-budget batch-size reallocation, and learning-rate modulation, while remaining compatible with standard synchronous data-parallel training. At the macro level, the same signal is aggregated into a job utility score that guides admission, ordering, and GPU quota assignment under contention. In large-scale workload-driven simulation, our method reduces average job completion time (JCT) by 23.7% and shortens cluster makespan by 15.7% relative to a strong learning-unaware baseline, demonstrating that uncertainty-aligned scheduling can improve cluster-level efficiency while preserving training correctness. We further validate scalability using a calibrated simulator up to 1024 nodes.

1. Introduction

Large-scale deep learning training is increasingly carried out in shared environments rather than in isolation [1,2,3]. In many laboratories and production settings, a single GPU cluster serves as a common infrastructure where multiple distributed jobs begin, pause, and overlap throughout the day [4,5]. In this setting, training behavior is shaped not only by the model and the dataset but also by when and where a job runs [6,7]. Network contention rises and falls, available resources shift, and the scheduler repeatedly changes the parallelism assigned to each job [8,9]. These dynamics introduce two intertwined decision scopes. One concerns within-job choices that maintain efficient learning, and the other concerns cross-job decisions that arbitrate shared GPUs. We refer to these as micro and macro in a job-centric sense. The micro level captures decisions made within a single distributed training job, while the macro level captures decisions made across multiple concurrent jobs in the cluster.
In shared GPU clusters, learning dynamics within a job and resource allocation across jobs become coupled in practice. At the micro level, a distributed job learns through worker-local data streams, gradient synchronization, and hyperparameters tuned under an assumed effective batch and synchronization regime [10,11]. At the macro level, the cluster is shared through macro-level scheduling decisions such as ordering, admission, and the number of workers assigned to each job over time [12,13]. These two levels are coupled because macro-level allocation changes the effective training regime of a job, and the resulting learning behavior determines whether additional resources will actually shorten time to accuracy [14,15,16].
In multi-job execution, the central objective is often time to accuracy [17,18]. A job is considered complete not when it merely consumes its assigned epochs, but when it reaches a target validation metric under fluctuating contention [19]. Yet the information available to the scheduler is typically dominated by system measurements [20,21]. These measurements describe how busy devices are, but they do not describe how much learning progress is being produced at a given moment [22]. When data-parallel training continues with fixed sharding and static hyperparameters under changing contention, a sequence of effects follows [23]. Effective batch size and synchronization behavior drift from the assumptions used to set the learning rate. Workers contribute gradients of different informativeness as their local shards differ in difficulty, while the system continues to average these contributions uniformly [24]. Under these conditions, cluster-level decisions can reduce waiting time while still yielding slow progress in model quality [25]. This appears as a longer time to accuracy and heavier tail behavior in JCT distributions [26].
A central difficulty is that the scheduler and the training system typically lack visibility into worker-level learning utility [27]. The system cannot tell which workers are currently operating on ambiguous examples where additional computation is likely to produce larger updates to the decision boundary [28,29]. The system also cannot tell how the number of classes and dataset complexity affect the scale and persistence of uncertainty during training [30]. In multi-class workloads, uncertainty often remains higher for longer, and the gap between easy shards and hard shards becomes more pronounced. If hard shards receive insufficient exposure under fixed sharding, informative gradients arrive too infrequently and convergence slows [31,32]. If the system attempts to compensate only through aggressive updates under a changed effective batch, training can become unstable, and the intended time-to-accuracy reduction is not achieved. An online signal is needed that reflects learning difficulty under contention and is stable enough to guide both training-time control and cluster-level scheduling [33]. The specific research gap addressed in this work is that existing schedulers typically rely on system-level or single-level optimization signals, and therefore do not consistently connect intra-job learning dynamics with inter-job resource allocation. As a result, they may improve resource efficiency or waiting time, while still lacking a unified learning-aware mechanism for jointly coordinating adaptation within a job and scheduling across jobs under shared-cluster contention.
In this study, we address this problem by using predictive uncertainty computed from logits as a lightweight online signal that bridges micro-level adaptation and macro-level orchestration. Each worker summarizes uncertainty using normalized entropy so that the signal remains comparable across tasks with different numbers of classes, and the resulting worker-level signals are converted into stable weights through clamping and exponential smoothing. These weights drive three micro-level controls for the next epoch, including uncertainty-proportional data sharding, per worker batch-size reallocation under a fixed budget, and job-level learning-rate modulation to stabilize updates as the effective regime shifts. In parallel, the same uncertainty signal is aggregated into a job score used by the cluster scheduler so that macro-level allocation decisions favor configurations expected to reduce time to accuracy rather than merely balance hardware utilization. In workload-driven multi-job simulation, our approach reduces average JCT by 23.7% and shortens cluster makespan by 15.7% relative to a representative baseline (Lucid). These gains are driven primarily by reduced queueing delay, which drops by 30.7% under contention. In large-scale simulation at 1024 nodes, our method further reduces average JCT by 25.2% and makespan by 18.1% relative to the same baseline.
The main contributions of this paper are as follows:
  • A two-scale control formulation for shared-cluster training, together with a unified uncertainty signal that bridges micro-level adaptation and macro-level scheduling.
  • Stable budget-aware control rules for uncertainty-aware data sharding, batch sizing, and learning-rate modulation driven by normalized entropy.
  • Empirical evidence that learning-aware orchestration improves time to accuracy and training stability with negligible overhead in multi-job, multi-class workloads, along with scalability validation in simulation up to 1024 nodes.

2. Related Work

2.1. Cluster Scheduling for Distributed Deep Learning Jobs

Deep learning clusters have motivated schedulers that treat training jobs as long-running workloads with distinct progress characteristics [34]. Gandiva introduced job management primitives such as time slicing and migration to improve responsiveness for interactive and iterative training workflows [4], and it emphasized that early training signals can be useful for prioritization in practice. Tiresias explored scheduling policies for distributed training when JCTs are uncertain and proposed mechanisms that reduce average JCT without requiring complete prior knowledge [35]. These systems establish the foundations for multi-tenant GPU scheduling and motivate designs that react to changing cluster conditions while keeping training jobs practical to deploy [22,36].

2.2. Elastic and Co-Adaptive Scheduling Under Contention

Modern shared-cluster schedulers increasingly treat parallelism as a control variable, adjusting it during execution to improve cluster-level outcomes under contention [37]. Pollux introduced a co-adaptive approach that jointly considers cluster-level allocation and per job training efficiency through the notion of goodput, enabling dynamic resource reassignment based on observed training behavior [10]. Lyra further studied elastic scheduling in deep learning clusters and showed that adapting resource allocation across mixed workloads can improve overall efficiency and utilization [11]. These directions highlight that inter-job decisions can change the effective training regime, which creates an opportunity for job-aware control signals that remain lightweight and model agnostic [6,38].

2.3. Intra-Job Adaptation of Batch Size and Learning Rate

Intra-job adaptation is commonly used to stabilize optimization and improve statistical efficiency as training progresses [23]. Prior work has studied adaptive batch sizes as a mechanism for controlling gradient noise and for trading off computation and convergence [39,40], including criteria based on estimates of gradient variance and relationships between batch-size growth and learning-rate schedules [25]. In parallel, practical training systems often adjust learning rate based on training dynamics to maintain stable updates across phases of learning [41]. Intra-job adaptation techniques motivate using lightweight training signals to coordinate batch-size and learning-rate changes under resource variability [26,42,43].

2.4. Predictive Uncertainty as a Learning Signal

Uncertainty measures have a long history as a proxy for informativeness in learning systems [27,44]. Uncertainty sampling in active learning selects examples where the model is least confident [45], and subsequent work has explored uncertainty measures and their properties across tasks [46,47]. Bayesian and ensemble-based approaches further formalize predictive uncertainty for deep models and show how uncertainty can guide data selection and learning progress in practical settings [28,48]. The literature supports the idea that uncertainty derived from predictions can serve as a lightweight signal for guiding where learning effort should be concentrated, which aligns with our use of normalized entropy to coordinate intra-job adaptation and inter-job scheduling [49,50].
Recent studies have also considered broader forms of heterogeneous and large-scale learning beyond shared-cluster scheduling. For example, personalized federated learning has been explored for heterogeneous medical image analysis tasks, highlighting the importance of adaptive coordination under heterogeneous data and training conditions [51]. In addition, recent work on knowledge distillation and teacher–student learning in medical imaging emphasizes efficient model transfer and scalable training strategies in complex learning environments [52]. Recent advances in vision–language models for medical image analysis further illustrate the growing interest in general large-model frameworks for complex multimodal learning settings [53]. Although these directions address different problem settings from shared-cluster job scheduling, they help contextualize the broader need for adaptive learning-aware mechanisms in modern large-scale training systems.

3. Problem Formulation

We consider a shared GPU cluster that runs multiple distributed training jobs concurrently. Let J denote the set of active jobs. The cluster consists of a set of GPU nodes N with capacity G = | N | GPUs. Each job j J trains a model with C j classes using data parallelism over a set of workers R j with cardinality R j . Time is indexed by epochs t { 0 , 1 , 2 , } , and decisions are made at epoch boundaries.
We focus on reducing time to accuracy under shared resource constraints. Let A j , t denote the validation metric of job j at the end of epoch t and let A j tar be a target level. We define the completion epoch of job j as
T j = min { t 0 : A j , t A j tar } .
The system aims to reduce completion times across jobs while satisfying the cluster GPU budget at each epoch.

3.1. Learning Signal from Predictive Uncertainty

For a sample x processed by job j at epoch t, let p j , t ( x ) R C j be the predictive distribution produced by the model. We define sample level uncertainty using normalized entropy:
h ˜ j , t ( x ) = c = 1 C j p j , t , c ( x ) log p j , t , c ( x ) log C j .
Each worker r R j processes a shard D j , r , t during epoch t and reports a worker-level mean uncertainty:
H j , r , t = E x D j , r , t h ˜ j , t ( x ) .
The expectation is estimated by a sample mean over minibatches processed during the epoch. To compare workers within the same job, we normalize across workers using a small positive constant ε H > 0 . Here, ε H denotes a numerical stabilizer introduced only to prevent division by zero or unstable normalization when the summed uncertainty becomes very small.
H ^ j , r , t = H j , r , t s R j H j , s , t + ε H .
When an additional ambiguity signal is used, it is blended with H ^ j , r , t to form a single worker score u j , r , t . Otherwise, we set u j , r , t = H ^ j , r , t .
We define normalized worker weights that summarize where learning difficulty is concentrated within a job. Let u j , r , t denote the resulting worker score, then the weights satisfy
w j , r , t 0 , r R j w j , r , t = 1 , w j , r , t = u j , r , t s R j u j , s , t + ε W ,
where ε W > 0 is a small positive constant used only for numerical stability in the denominator. In practice, the weights are computed using this stabilized normalization together with temporal smoothing to avoid oscillations across epochs.

3.2. Micro-Level Control Variables

At each epoch boundary, each job chooses its next epoch internal configuration based on the weights w j , r , t . The configuration is represented by three decision variables for epoch t + 1 .
First, the job chooses a per epoch sample budget N j , t + 1 and allocates it across workers as
N j , r , t + 1 = N j , t + 1 w j , r , t , r R j N j , r , t + 1 = N j , t + 1 .
In implementations, integer sample counts are realized by standard rounding and correction while preserving the equality. Second, the job chooses per worker batch sizes subject to bounds and a job-level batch budget. Let b j , r , t + 1 denote the batch size of worker r for epoch t + 1 . The batch sizes satisfy
b min b j , r , t + 1 b max , r R j b j , r , t + 1 = B j , t + 1 ,
where B j , t + 1 is the job-level batch budget for epoch t + 1 .
Third, the job adapts its learning rate as a function of job-level uncertainty. We define a job-level uncertainty summary as
U j , t = r R j w j , r , t u j , r , t .
The learning rate η j , t + 1 is selected by an update rule η j , t + 1 = f ( U j , t , η j , t ) that increases stability when uncertainty is high and allows faster progress when uncertainty is low. The specific form of f is part of the proposed control policy.

3.3. Macro-Level Scheduling Variables and Constraints

At each epoch boundary, the scheduler assigns integer GPU quotas g j , t + 1 to jobs for epoch t + 1 under the cluster budget
j J g j , t + 1 G , g j , t + 1 Z 0 .
The scheduler uses uncertainty-based learning utility as an input signal. We define a short-horizon progress proxy from the reported uncertainty summaries:
Δ U j , t = U j , t 1 U j , t ,
and define a utility score:
S j , t = λ U j , t + ( 1 λ ) max ( Δ U j , t , 0 ) ,
where λ [ 0 , 1 ] balances the uncertainty level and recent progress. The GPU allocation decision is produced by a scheduling policy g j , t + 1 = π ( S j , t , J , G ) that maps utility scores and system constraints to integer quotas. The specific form of π is part of the proposed scheduling policy.

3.4. Linking Macro Allocation to Micro Budgets

The macro-level quota determines the next epoch processing budget of each job. Let ρ j , t denote the measured per GPU processing rate of job j at epoch t in samples per epoch per GPU. We set the next epoch sample budget as
N j , t + 1 = g j , t + 1 ρ j , t .
This creates a closed loop. The scheduler changes g j , t + 1 based on uncertainty guided utility, which changes N j , t + 1 . The job then distributes N j , t + 1 across workers and adapts batch sizes and learning rate based on the same uncertainty signal.

4. Proposed Method

4.1. Overview and Timing of the Control Loop

Figure 1 illustrates our entropy-guided hierarchical method that couples micro-level adaptation and macro-level scheduling through a shared learning signal in a shared GPU cluster. The control loop is organized around epoch boundaries, which provide a natural synchronization point already present in most data-parallel training pipelines and allow lightweight statistics aggregation without interfering with per step synchronization.
The workflow proceeds as follows. During an epoch, workers execute standard synchronous data-parallel training and locally accumulate uncertainty statistics computed from logits. At the epoch boundary, these per worker statistics are aggregated within each job to construct stable worker weights. With a one-epoch delay, the micro-level controller then uses these weights to update the next epoch’s internal configuration by adjusting the data-sharding ratios in proportion to uncertainty, tuning per worker batch sizes within safe bounds, and modulating the job-level learning rate.
At the same boundary, each job also exposes a compact job score derived from the same uncertainty statistics to the macro-level scheduler. The scheduler uses this score to update integer GPU quotas for the next epoch, which determines each job’s data-processing budget under cluster contention. In this way, Figure 1 represents a closed hierarchical loop: macro-level control determines how much compute a job receives, while micro-level control determines how that compute is distributed and used within the job. Both levels are therefore coordinated through the same entropy-based signal, which provides a consistent interface between learning dynamics and scheduling decisions.

4.2. Uncertainty Estimation from Logits

Consider a job j with C j classes and workers R j . For a sample x at epoch t, the model produces logits z j , t ( x ) R C j and probabilities
p j , t ( x ) = softmax z j , t ( x ) .
This choice is deliberate from a systems perspective. Logits and softmax probabilities are already available during training, so uncertainty can be computed without additional forward passes, auxiliary models, or extra labels. We use Entropy because it reflects how spread the predictive distribution is over classes:
h j , t ( x ) = c = 1 C j p j , t , c ( x ) log p j , t , c ( x ) ,
and we normalize it to keep the signal comparable across tasks with different numbers of classes:
h ˜ j , t ( x ) = h j , t ( x ) log C j .
Normalized entropy is important in multi-job environments because different jobs may have different class counts, and the scheduler must compare job scores on a consistent scale. When needed, we complement Entropy with a margin-based ambiguity measure that captures near ties between the top two classes. Let p ( 1 ) ( x ) and p ( 2 ) ( x ) be the largest and second largest entries of p j , t ( x ) . We set
q j , t ( x ) = 1 p ( 1 ) ( x ) p ( 2 ) ( x ) .
Entropy reflects global spread over classes, while margin ambiguity focuses on local decision boundaries. In practice, Entropy alone is often sufficient, and ambiguity is used as an optional complement when boundary confusion is a dominant mode.
Each worker r processes its local shard D j , r , t . We aggregate uncertainty at the worker level by epoch means:
H j , r , t = E x D j , r , t h ˜ j , t ( x ) , Q j , r , t = E x D j , r , t q j , t ( x ) .
This aggregation step serves two roles. It compresses per sample uncertainty into a small set of scalars per worker, which is communication-friendly, and it provides a stable estimate of how informative the worker data stream is over the duration of an epoch.

4.3. Robust Weight Construction for Uncertainty-Guided Control

We normalize the worker statistics within the job:
H ^ j , r , t = H j , r , t s R j H j , s , t , Q ^ j , r , t = Q j , r , t s R j Q j , s , t .
Normalization ensures that weights represent relative importance among workers of the same job, independent of the absolute magnitude of uncertainty. We combine them into a single worker score:
u j , r , t = α H ^ j , r , t + 1 α Q ^ j , r , t ,
where α controls the mixture between normalized entropy and margin ambiguity. From a control perspective, u j , r , t is the raw signal that indicates where learning is currently concentrated inside the job.
Raw signals can oscillate due to minibatch stochasticity and due to changes in effective parallelism under multi-job contention. We therefore introduce guardrails that make the controller robust. We first bound the score
u j , r , t clip = clip u j , r , t , w min , w max ,
which prevents a single worker from dominating and also prevents any worker from being starved. We then apply exponential smoothing:
w ˜ j , r , t = β w j , r , t 1 + 1 β u j , r , t clip ,
so that allocations evolve gradually across epochs rather than reacting aggressively to short-term fluctuations. Finally, we renormalize to obtain simplex weights:
w j , r , t = w ˜ j , r , t s R j w ˜ j , s , t .
These weights are the central interface between learning and systems. Inside a job, they determine how data and compute are redistributed across workers for the next epoch. Across jobs, a job-level summary derived from the same weights drives cluster scheduling decisions.

4.4. Micro-Level Controls at Epoch Boundaries

We apply three controls at epoch t + 1 . The key design choice is that all controls are applied to the next epoch, which makes them compatible with standard synchronous training and avoids interfering with per step synchronization.
Let N j , t + 1 be the total number of samples to be processed by job j in epoch t + 1 under its current resource budget. We allocate per worker sample counts by
N ¯ j , r , t + 1 = N j , t + 1 w j , r , t , N j , r , t + 1 = round N ¯ j , r , t + 1 .
This rule shifts data exposure toward workers whose data stream is currently more uncertain, which increases the proportion of updates drawn from ambiguous examples. Since rounding may violate the exact total, we enforce the budget constraint
r R j N j , r , t + 1 = N j , t + 1
using a remainder correction that assigns leftover samples to the largest fractional parts. This keeps the total work per epoch comparable across methods and isolates the effect of redistribution.
We map weights to batch sizes within bounds b min and b max as
b j , r , t + 1 = clip round b min + ( b max b min ) w j , r , t , b min , b max .
Batch size interacts with gradient noise and synchronization. Prior work has shown that batch size and learning rate are coupled through their effect on optimization stability and gradient noise scale, and that conservative joint adaptation can improve training robustness under changing effective regimes. Under contention, communication delays or reduced replica counts effectively change the optimization regime seen by the job. Our mapping is therefore intended as a bounded control rule that reallocates per worker batch sizes according to uncertainty while keeping the update within a safe operating range. The role of this rule is not to claim a new optimization theorem, but to provide a lightweight and stable mechanism that is consistent with prior observations on batch-size/learning-rate coupling and large-batch training behavior. The bounds ensure predictable memory usage and avoid sudden jumps that can destabilize training.
We summarize job uncertainty using the weighted score
U j , t = r R j w j , r , t u j , r , t .
This scalar captures the uncertainty state of the job at the end of epoch t and serves as a feedback measurement. We then update the learning rate using
η j , t + 1 = η j , t U j , t U j , 0 γ ,
where γ > 0 controls the sensitivity of the learning-rate modulation to changes in job-level uncertainty. Larger values of γ make the controller react more aggressively to uncertainty fluctuations, whereas smaller values lead to more conservative updates. In our implementation, γ is set to provide stable epoch-level adaptation without causing abrupt changes in the optimizer state. When uncertainty rises relative to the initial stage, the learning rate is reduced to prevent overly aggressive updates. When uncertainty declines, the learning rate relaxes to maintain efficient progress. Because the update uses an epoch-level measurement, it remains stable and adds negligible overhead.
This design also helps mitigate potential conflicts between local adaptation and global scheduling. Macro-level allocation determines how much resource budget a job receives, while the micro-level controller regulates how that budget is used so that optimization remains stable under the resulting effective training regime. In this sense, the two levels are coordinated through the same uncertainty signal, but they need not have identical objectives at every moment; rather, the shared signal provides a consistent interface that reduces mismatch between learning-aware adaptation and scheduling decisions.

4.5. Macro-Level Scheduling via Uncertainty-Guided Utility

The macro-level scheduler operates at epoch boundaries and uses the same uncertainty signal as the micro-level controller, aggregated at the job level. From (26), each job reports U j , t once per epoch. We define the short-horizon progress proxy
Δ U j , t = U j , t 1 U j , t ,
and form a job utility score that captures both current learning demand and near-term learning progress
S j , t = λ U j , t + ( 1 λ ) ReLU Δ U j , t ,
where λ [ 0 , 1 ] balances the contribution of the current uncertainty level and the recent uncertainty reduction trend. A larger λ places more emphasis on present learning demand, whereas a smaller λ gives relatively more weight to short-term progress. In our implementation, λ is set to provide a stable compromise between these two signals so that the scheduler remains responsive without overreacting to short-term fluctuations. A higher S j , t indicates that allocating resources to job j is expected to yield a stronger reduction in uncertainty and thus faster convergence.
The use of S j , t as a scheduling priority is based on the following interpretation. A job with persistently high uncertainty still has substantial unresolved learning difficulty, while a positive recent decrease in uncertainty indicates that the job is converting computation into useful progress. The score therefore combines current learning demand with short-horizon responsiveness to additional compute. This does not imply that micro-level utility and macro-level priority are always identical in a strict optimization sense. Rather, the same normalized signal is used to align the two levels heuristically so that jobs are prioritized not only by resource occupancy but also by their expected learning benefit under contention.
At each decision point, the scheduler ranks runnable candidates by utility score rather than submission order. For newly arrived or still pending jobs, U j , t may be unavailable. The scheduler then falls back to a default prior U j , init or uses a short warm-up measurement. The scheduler selects the highest-scoring job that fits within the currently idle GPUs. This defines an entropy-guided admission rule in which execution order is determined by learning utility.
Let the cluster have G GPUs. The scheduler assigns integer quotas g j , t + 1 such that
j J g j , t + 1 G .
We use proportional allocation to translate utility into quotas:
g ˜ j , t + 1 = G · S j , t k J S k , t , g j , t + 1 = clip round ( g ˜ j , t + 1 ) , g min , g max ,
with remainder correction to satisfy the budget. This ensures stable integer allocations while preserving the monotonic relationship between S j , t and allocated resources.
In practical clusters, quotas must be realized by assigning jobs to concrete GPU nodes. Let N be the set of nodes. We represent placement by x j , n , t + 1 { 0 , 1 } and enforce
n N x j , n , t + 1 = g j , t + 1 , j J x j , n , t + 1 1 n N .
Among feasible placements, the macro-level scheduler may use a deterministic ordering of assigned nodes to choose a master rank and may prefer placements that reduce expected contention. Importantly, placement does not change the learning signal itself. Instead, it realizes the entropy-guided decision of which jobs run and how much parallelism they receive.
The key interaction remains that g j , t + 1 determines each job’s next epoch processing budget N j , t + 1 , and the micro-level controller then distributes that budget across workers using (23)–(27). This closes the loop between macro-level execution decisions and micro-level adaptation, both driven by the same entropy-based utility.

4.6. End-to-End Scheduling Algorithm

Algorithm 1 summarizes the method as an epoch-level control loop. We use the epoch boundary as the control point because it already provides a natural synchronization barrier in data-parallel training, and it allows uncertainty statistics to be aggregated with negligible overhead. The algorithm applies a one epoch delay in all control actions, meaning that weights computed at epoch t determine sharding, batch size, and learning rate for epoch t + 1 , which avoids interfering with per step synchronization. At the cluster level, job scores are computed from the same uncertainty signal and translated into GPU quotas that determine the next epoch data budget, closing the loop between macro-level allocation and micro-level adaptation.
Algorithm 1 Entropy-guided hierarchical scheduling
Require: Active jobs J , workers R j for each job, cluster GPUs G
Require: Hyperparameters α , β , w min , w max , b min , b max , γ , λ
  1:
Initialize w j , r , 0 1 / | R j | and learning rate η j , 0 for all jobs and workers
  2:
for each epoch t = 0 , 1 , 2 , do
  3:
for all jobs j J in parallel do
  4:
  for all workers r R j in parallel do
  5:
   Train for one epoch using current shard and batch size
  6:
   Accumulate h ˜ j , t ( x ) and optionally q j , t ( x ) from logits
  7:
   Compute H j , r , t and Q j , r , t as epoch means
  8:
  end for
  9:
  Compute H ^ j , r , t and Q ^ j , r , t using (18)
10:
  Compute u j , r , t using (19)
11:
  Compute w j , r , t using (20), (21), and (22)
12:
  Compute job uncertainty U j , t using (26)
13:
  Compute job score S j , t using (28) and (29)
14:
end for
15:
 Scheduler computes next quotas g j , t + 1 for all jobs using (31) under (30)
16:
for all jobs j J do
17:
  Derive next epoch data budget N j , t + 1 from quota g j , t + 1
18:
  Compute N j , r , t + 1 using (23) and enforce (24)
19:
  Compute b j , r , t + 1 using (25)
20:
  Update η j , t + 1 using (27)
21:
end for
22:
end for
The method is designed as a thin layer on top of an existing data-parallel training stack and a cluster scheduler. Uncertainty computation is performed during the forward pass using logits already produced for the loss, and thus requires no additional model evaluations, forward/backward passes, or extra labels. Each worker accumulates running sums of normalized entropy, and it may also track a margin-based ambiguity statistic, along with a counter of processed samples. At the epoch boundary, the job performs a small number of collective operations to aggregate these scalars and to compute the normalized quantities needed for weight construction. Because only compact scalar statistics are exchanged once per epoch, the additional communication cost remains small compared with standard gradient synchronization and does not require changes to gradient all-reduce or per step training logic. This design keeps the control overhead lightweight while remaining compatible with standard synchronous data-parallel training.
Data-sharding adaptation can be implemented by updating the sampler at each epoch boundary. When datasets are represented by index lists, the controller assigns each worker a contiguous or strided subset of size N j , r , t + 1 while ensuring that the union matches the intended epoch budget. The remainder correction in (24) can be applied deterministically to preserve reproducibility. Batch-size updates can be applied by re-instantiating each worker data loader with the new b j , r , t + 1 or by using a loader that supports dynamic batch sizes. The bounds b min and b max can be chosen once per job through lightweight profiling and then kept fixed. The learning-rate update in (27) is applied at the epoch boundary through the optimizer parameter group. For macro-level scheduling, each job reports the scalar score S j , t once per epoch, and the scheduler converts it into the next-epoch quota and data budget N j , t + 1 through a throughput model or measured step time. This separation keeps the system modular. The scheduler consumes only a compact job-level summary, while the training code consumes only the assigned quota and does not require access to other jobs.

5. Experimental Setup

Testbed. Experiments run on a shared GPU cluster with four NVIDIA RTX A5000 GPUs. All methods share the same physical cluster to induce realistic multi-tenant contention on compute and communication. Table 1 summarizes the hardware and software environment.
Workload. We submit fifty jobs drawn from eight job types. Each job type fixes the model, dataset, base batch size, and training mode. Job types J1 to J3 run for fifty epochs, while J4 to J8 run for twenty epochs. The submission mix follows the configured multiplicities and priority levels, where J1 to J3 have priority one, J4 to J6 have priority two, and J7 to J8 have priority three. Table 2 summarizes all job types and the workload composition.
Methods and metrics. We compare the proposed scheduler against FIFO, shortest job first and Lucid. We evaluate system performance using JCT, makespan, and the cumulative distribution function of JCT with emphasis on tail behavior. We evaluate learning quality using Top-1 accuracy and loss for MNIST, Fashion MNIST, CIFAR10, and CIFAR100, and Top-1 and Top-5 accuracy with loss for Tiny ImageNet. Table 3 summarizes the training configuration shared across methods and the control ranges used by the proposed approach.
Large-scale simulation. To study scaling beyond the four-GPU testbed, we evaluate the same workload using a calibrated simulator. The simulator models step time and communication overhead as a function of allocated GPUs and contention, with parameters calibrated from measurements on the real cluster. We evaluate performance at 64, 128, 256, 512 and 1024 nodes using the simulator.

5.1. Experimental Results

5.1.1. Overall System Performance

Table 4 summarizes overall system performance in terms of average JCT, average queueing delay, and workload makespan, all measured in hours. We define JCT as the wall-clock time from job submission to completion, and queueing delay as the time from submission to the first GPU allocation. We define makespan as the elapsed time between the earliest submission and the last job completion in the workload.
Across the 50-job workload, Entropy achieves the lowest average JCT of 5.69 h, compared to 7.46 h for Lucid, 9.83 h for SFJ, and 10.12 h for FIFO. This corresponds to a 23.7% reduction over Lucid and a 43.8% reduction over FIFO. Entropy also reduces average queueing delay from 6.80 to 4.71 h, a 30.7% reduction, indicating that utility-guided admission and quota updates alleviate waiting under multi-tenant contention. Finally, Entropy shortens the workload makespan from 17.27 to 14.56 h, a 15.7% reduction over Lucid, improving cluster-level throughput relative to the baselines. These results suggest that aligning admission and resource allocation with learning utility can simultaneously reduce both per job latency and end-to-end workload completion time.
Entropy consistently improves the upper tail of the distribution. Table 5 highlights this effect in the percentile summary. Compared to Lucid, Entropy reduces the 95th-percentile JCT from 15.03 to 11.39 h, corresponding to a 24.2% reduction, and reduces the maximum observed JCT from 17.23 to 14.52 h, corresponding to a 15.7% reduction. Entropy also improves the median (P50) from 6.72 to 4.60 h, corresponding to a 31.6% reduction, indicating that the latency improvement is not limited to a few outliers.
Figure 2 shows that Entropy shifts the JCT distribution left across a broad range of quantiles, rather than improving only a small subset of jobs. At the median, Entropy reduces JCT from 6.72 h for Lucid to 4.60 h, corresponding to a 31.6% reduction. The separation persists into the upper tail: the 90th-percentile decreases from 14.26 to 11.05 h, corresponding to a 22.5% reduction, and the 95th-percentile decreases from 15.03 to 11.39 h, corresponding to a 24.2% reduction. Moreover, the maximum observed JCT decreases from 17.23 to 14.52 h, corresponding to a 15.7% reduction, indicating that Entropy tightens the extreme tail under contention. Overall, the CDF indicates that Entropy improves completion times for a large fraction of jobs while also reducing tail risk, which aligns with the percentile summary in Table 5.
Figure 3 decomposes the average JCT into queueing delay from submission to start and service time from start to completion. The main improvement of Entropy comes from a substantial reduction in queueing delay. Compared to Lucid, Entropy lowers average queueing delay from 6.80 to 4.71 h, a 30.7% reduction, which accounts for most of the 1.77 h reduction in average JCT from 7.46 to 5.69 h. In contrast, the average service time under Entropy is slightly higher than Lucid, increasing from 0.66 to 0.98 h. This suggests that the benefit of the proposed method mainly comes from improved admission and GPU-quota decisions under contention, rather than from uniformly shortening execution time after a job starts. A plausible reason is that micro-level adaptation prioritizes learning-aware redistribution and stability, which can introduce conservative batch-size or learning-rate adjustments during some phases of training. As a result, the method may improve overall time to accuracy at the cluster level while not always minimizing raw per job service time. Potential optimizations include less frequent control updates, threshold-based adaptation, and more conservative tuning of the micro-level control parameters.
This result also helps clarify the computational overhead and runtime cost of the proposed method relative to the baseline schedulers. Although the entropy-guided control logic introduces additional adaptation decisions during training, its runtime effect appears as a moderate increase in service time rather than a large penalty in end-to-end latency. In our results, this overhead is outweighed by the larger reduction in queueing delay, so the net impact on overall JCT remains favorable. Therefore, the proposed method should be understood as introducing a lightweight runtime trade-off in exchange for improved cluster-level scheduling efficiency under contention.

5.1.2. Training Accuracy and Convergence

In addition to system-level latency, we examine whether different scheduling policies preserve training accuracy and convergence behavior. Table 6 reports the average best validation accuracy achieved by each job type under each scheduler, computed as the maximum validation accuracy attained during each run.
Entropy achieves the highest average accuracy in 7 out of 8 job types. Compared to Lucid, Entropy improves average best accuracy by 0.10–2.45 percentage points in seven job types, with an average gain of approximately 0.88 percentage points across all eight job types. Compared to FIFO, Entropy improves average best accuracy by 0.21–3.61 percentage points in seven job types, and remains within 0.96 percentage points in the remaining job type. Taken together, these results show that Entropy maintains competitive model quality across job types and often achieves higher accuracy than the baselines. Alongside the improvements in JCT and tail latency, this indicates that the proposed scheduler improves cluster responsiveness while preserving learning outcomes under multi-tenant contention.

5.1.3. Ablation Study

To isolate the contribution of the macro-level and micro-level components, we conduct an ablation study on a workload of 25 jobs drawn from job types J1 to J9. We compare three variants. The macro-only variant enables inter-job scheduling and quota updates only. The micro-only variant enables intra-job adaptation only while keeping inter-job scheduling fixed. The full method enables both components.
Table 7 summarizes the results. The full method achieves the best performance across all system-level metrics. Compared to macro-only, the full method reduces average JCT from 3.32 to 2.39 h, a 28.0% reduction, and reduces makespan from 8.61 to 6.77 h, a 21.4% reduction. Compared to micro-only, the full method reduces average JCT from 4.17 to 2.39 h, a 42.7% reduction, and reduces makespan from 10.04 to 6.77 h, a 32.6% reduction. These results indicate that combining macro-level allocation with micro-level adaptation yields complementary benefits under multi-tenant contention.
Macro-level control reduces queueing delay by directly shaping admission and quota assignment, while micro-only adaptation has limited leverage over waiting time because the macro-level policy is fixed. In contrast, micro-level adaptation primarily improves per job training efficiency after GPUs are allocated, which helps translate allocated resources into faster progress. The gap between macro-only and the full method suggests that micro-level adaptation contributes additional gains beyond improved admission and quota decisions. Conversely, the gap between micro-only and the full method highlights the importance of learning-aware resource arbitration for improving cluster-level throughput under contention. Taken together, these ablation results support the practical rationale of the design. Although the macro-level and micro-level components contribute differently, the largest gains are obtained when both are coordinated through the shared uncertainty signal.

5.1.4. Scalability Under Large-Scale Simulation

We evaluate scalability using a calibrated simulator that models per step training time and communication overhead as a function of allocated GPUs. The simulator is parameterized to reflect the target homogeneous A5000 cluster setting, including throughput-related scaling behavior, communication efficiency under multi-GPU allocation, and scheduler-side overhead. Job templates are also configured using representative workload characteristics such as batch size, GPU demand, and training-progress parameters, so that the simulator provides a controlled approximation of large-scale distributed training behavior. The simulator is intended to extend the small-scale real-cluster evaluation to larger node counts under a consistent workload model, using the same scheduler logic and representative job characteristics. Because the simulator abstracts training into a throughput model, the absolute times in Table 8 should be interpreted primarily as relative comparisons across schedulers under controlled scaling, rather than as exact wall-clock durations of specific real-world training runs or production deployments.
Table 8 reports average JCT and makespan in seconds for workloads of 2000 jobs as the cluster scales from 64 to 1024 nodes. As the number of nodes increases, average JCT and makespan decrease across all schedulers, reflecting increased parallel capacity. Entropy achieves the lowest average JCT and makespan at every scale in this simulated setting. At 1024 nodes, Entropy reduces average JCT from 593.3 s under Lucid to 444.0 s, a 25.2% reduction, and reduces makespan from 1324.0 s to 1084.5 s, an 18.1% reduction. These results suggest that the proposed policy preserves its advantage as the cluster size grows, consistently improving both per job latency and end-to-end workload completion time under the simulator model.

5.1.5. Practical Deployment Discussion

From a practical deployment perspective, the proposed framework is designed to operate as a lightweight control layer on top of existing shared-cluster scheduling and synchronous data-parallel training workflows. Because the control decisions are applied at epoch boundaries, the method can be integrated without intrusive changes to the standard per step training path. In production GPU clusters, however, deployment would still need to account for additional factors such as workload heterogeneity, varying contention patterns, scheduler policy constraints, and the trade-off between adaptation responsiveness and operational stability. These considerations suggest that deployment-oriented tuning of control frequency and parameter sensitivity will be important for robust operation in real multi-tenant environments.

6. Conclusions

In this study, we presented Entropy, an entropy-guided hierarchical scheduling approach that couples micro-level adaptation and macro-level allocation through a shared uncertainty signal computed from logits and aggregated at epoch boundaries. Across our multi-tenant A5000 testbed, Entropy substantially improves cluster responsiveness and tail latency. In particular, Entropy reduces the 95th-percentile JCT from 15.03 h under Lucid to 11.39 h, corresponding to a 24.2% reduction, and lowers the median JCT from 6.72 to 4.60 h, corresponding to a 31.6% reduction. These gains are driven primarily by shorter queueing delay, indicating that uncertainty-guided admission and quota decisions mitigate contention-induced waiting while maintaining competitive learning outcomes across job types.
The current study also has several limitations. First, the empirical evaluation is centered on controlled multi-tenant settings and primarily vision-oriented workloads, which do not fully represent the diversity of emerging distributed training scenarios, such as NLP, multimodal learning, and large-model fine-tuning. Second, although the proposed framework is designed to operate with lightweight epoch-level control, the current evaluation does not cover all practical systems effects that may arise in real production deployments, including stronger heterogeneity in hardware, workload interference, and broader failure or straggler conditions. Third, the large-scale results are based on a simulator-driven evaluation, which is useful for controlled comparative analysis but does not replace full validation in a production-scale environment.
In future work, we will extend the approach to a broader set of NLP and multimodal workloads, including large language model fine-tuning and other long-context tasks where learning dynamics and resource sensitivity differ from vision benchmarks. We will also evaluate Entropy on more diverse tasks and larger-scale distributed datasets under mixed batch sizes, mixed sequence lengths, and more heterogeneous training regimes. Finally, we will study deployment in heterogeneous clusters that combine different GPU generations and interconnects, and we will explore robustness to practical systems effects, such as interference-aware placement, checkpointing overheads, and straggler mitigation in large-scale distributed training.

Author Contributions

Conceptualization, T.-J.S.; Methodology, T.-J.S.; Software, T.-J.S.; Validation, T.-J.S.; Formal analysis, T.-J.S. and E.-N.H.; Investigation, T.-J.S. and E.-N.H.; Resources, E.-N.H.; Data curation, T.-J.S.; Writing—original draft, T.-J.S.; Writing—review & editing, T.-J.S. and E.-N.H.; Visualization, T.-J.S.; Supervision, E.-N.H.; Project administration, E.-N.H.; Funding acquisition, E.-N.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2024-00438239, 70%) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and supported by Global—Learning & Academic Research Institution for Master’s PhD students, and the Postdocs (G-LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2025-25442355, 30%).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code developed in this study is not publicly available due to institutional restrictions. The experiments were conducted on publicly available datasets, which can be accessed from their respective sources.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Ye, Z.; Gao, W.; Hu, Q.; Sun, P.; Wang, X.; Luo, Y.; Zhang, T.; Wen, Y. Deep learning workload scheduling in gpu datacenters: A survey. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
  2. Peng, Y.; Bao, Y.; Chen, Y.; Wu, C.; Guo, C. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, Porto, Portugal, 23–26 April 2018. [Google Scholar]
  3. Mahajan, K.; Balasubramanian, A.; Singhvi, A.; Venkataraman, S.; Akella, A.; Phanishayee, A.; Chawla, S. Themis: Fair and efficient GPU cluster scheduling. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), Santa Clara, CA, USA, 25–27 February 2020. [Google Scholar]
  4. Xiao, W.; Bhardwaj, R.; Ramjee, R.; Sivathanu, M.; Kwatra, N.; Han, Z.; Patel, P.; Peng, X.; Zhao, H.; Zhang, Q.; et al. Gandiva: Introspective cluster scheduling for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, USA, 8–10 October 2018. [Google Scholar]
  5. Gu, J.; Chowdhury, M.; Shin, K.G.; Zhu, Y.; Jeon, M.; Qian, J.; Liu, H.; Guo, C. Tiresias: A GPU cluster manager for distributed deep learning. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), Boston, MA, USA, 26–28 February 2019. [Google Scholar]
  6. Zheng, P.; Pan, R.; Khan, T.; Venkataraman, S.; Akella, A. Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023. [Google Scholar]
  7. Zheng, P.; Pan, R.; Khan, T.; Venkataraman, S.; Akella, A. Astraea: A fair deep learning scheduler for multi-tenant gpu clusters. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 2781–2793. [Google Scholar] [CrossRef]
  8. Kaur, R.; Asad, A.; Al Abdul Wahid, S.; Mohammadi, F. A Survey of Advancements in Scheduling Techniques for Efficient Deep Learning Computations on GPUs. Electronics 2025, 14, 1048. [Google Scholar] [CrossRef]
  9. Hu, Q.; Ye, Z.; Wang, Z.; Wang, G.; Zhang, M.; Chen, Q.; Sun, P.; Lin, D.; Wang, X.; Luo, Y.; et al. Characterization of large language model development in the datacenter. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Santa Clara, CA, USA, 16–18 April 2024. [Google Scholar]
  10. Qiao, A.; Choe, S.K.; Subramanya, S.J.; Neiswanger, W.; Ho, Q.; Zhang, H.; Ganger, G.R.; Xing, E.P. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), Virtual, 14–16 July 2021. [Google Scholar]
  11. Li, J.; Xu, H.; Zhu, Y.; Liu, Z.; Guo, C.; Wang, C. Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy, 8–11 May 2023. [Google Scholar]
  12. Sharma, A.; Bhasi, V.M.; Singh, S.; Kesidis, G.; Kandemir, M.T.; Das, C.R. Gpu cluster scheduling for network-sensitive deep learning. arXiv 2024, arXiv:2401.16492. [Google Scholar] [CrossRef]
  13. Strati, F.; Ma, X.; Klimovic, A. Orion: Interference-aware, fine-grained gpu sharing for ml applications. In Proceedings of the Nineteenth European Conference on Computer Systems, Athens, Greece, 23–25 April 2024. [Google Scholar]
  14. Weng, Q.; Yang, L.; Yu, Y.; Wang, W.; Tang, X.; Yang, G.; Zhang, L. Beware of fragmentation: Scheduling GPU-Sharing workloads with fragmentation gradient descent. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23), Santa Clara, CA, USA, 10–12 July 2023. [Google Scholar]
  15. Yang, Z.Y.; Xia, W.K.; Chu, H.Q.; Su, W.H.; Wang, R.F.; Wang, H. A comprehensive review of deep learning applications in cotton industry: From field monitoring to smart processing. Plants 2025, 14, 1481. [Google Scholar] [CrossRef]
  16. Wu, A.Q.; Li, K.L.; Song, Z.Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.F. Deep learning for sustainable aquaculture: Opportunities and challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
  17. Chen, J.; Pan, X.; Monga, R.; Bengio, S.; Jozefowicz, R. Revisiting distributed synchronous SGD. arXiv 2016, arXiv:1604.00981. [Google Scholar]
  18. Sergeev, A.; Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv 2018, arXiv:1802.05799. [Google Scholar] [CrossRef]
  19. Zhang, H.; Zheng, Z.; Xu, S.; Dai, W.; Ho, Q.; Liang, X.; Hu, Z.; Wei, J.; Xie, P.; Xing, E.P. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA, USA, 12–14 July 2017. [Google Scholar]
  20. Jiang, Y.; Zhu, Y.; Lan, C.; Yi, B.; Cui, Y.; Guo, C. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Virtual, 4–6 November 2020. [Google Scholar]
  21. Provatas, N.; Konstantinou, I.; Koziris, N. A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized Learning. IEEE Access 2025, 13, 30993–31015. [Google Scholar] [CrossRef]
  22. Liang, F.; Zhang, Z.; Lu, H.; Leung, V.; Guo, Y.; Hu, X. Communication-efficient large-scale distributed deep learning: A comprehensive survey. arXiv 2024, arXiv:2404.06114. [Google Scholar]
  23. Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
  24. McCandlish, S.; Kaplan, J.; Amodei, D.; OAID Team. An empirical model of large-batch training. arXiv 2018, arXiv:1812.06162. [Google Scholar] [CrossRef]
  25. Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t decay the learning rate, increase the batch size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
  26. You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv 2019, arXiv:1904.00962. [Google Scholar]
  27. Settles, B. Active Learning Literature Survey; University of Wisconsin-Madison: Madison, WI, USA, 2009. [Google Scholar]
  28. Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
  29. Şahin, E.; Arslan, N.N.; Özdemir, D. Unlocking the black box: An in-depth review on interpretability, explainability, and reliability in deep learning. Neural Comput. Appl. 2025, 37, 859–965. [Google Scholar] [CrossRef]
  30. Fort, S.; Hu, H.; Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. arXiv 2019, arXiv:1912.02757. [Google Scholar]
  31. Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
  32. Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
  33. Kulichenko, M.; Nebgen, B.; Lubbers, N.; Smith, J.S.; Barros, K.; Allen, A.E.A.; Habib, A.; Shinkle, E.; Fedik, N.; Li, Y.W.; et al. Data generation for machine learning interatomic potentials and beyond. Chem. Rev. 2024, 124, 13681–13714. [Google Scholar] [CrossRef]
  34. Tang, S.; Yu, Y.; Wang, H.; Wang, G.; Chen, W.; Xu, Z.; Guo, S.; Gao, W. A survey on scheduling techniques in computing and network convergence. IEEE Commun. Surv. Tutor. 2023, 26, 160–195. [Google Scholar] [CrossRef]
  35. Liang, F.; Zhang, Z.; Lu, H.; Li, C.; Leung, V.; Guo, Y.; Hu, X. Resource allocation and workload scheduling for large-scale distributed deep learning: A survey. arXiv 2024, arXiv:2406.08115. [Google Scholar] [CrossRef]
  36. Wu, B.; Zhong, Y.; Zhang, Z.; Liu, S.; Liu, F.; Sun, Y.; Huang, G.; Liu, X.; Jin, X. Fast distributed inference serving for large language models. arXiv 2023, arXiv:2305.05920. [Google Scholar] [CrossRef]
  37. Zhang, Y.K.; Zhan, D.C.; Ye, H.J. Capability Instruction Tuning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2025; Volume 39. [Google Scholar]
  38. Choudhury, A.; Wang, Y.; Pelkonen, T.; Srinivasan, K.; Jain, A.; Lin, S.; David, D.; Soleimanifard, S.; Chen, M.; Yadav, A.; et al. MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, USA, 10–12 July 2024. [Google Scholar]
  39. Shen, L.; Sun, Y.; Yu, Z.; Ding, L.; Tian, X.; Tao, D. On efficient training of large-scale deep learning models. ACM Comput. Surv. 2024, 57, 1–36. [Google Scholar] [CrossRef]
  40. Yang, G.; Hu, E.J.; Babuschkin, I.; Sidor, S.; Liu, X.; Farhi, D.; Lyon, N.; Hernandez, D.; Joshua, Z.; Gao, J.; et al. Tuning large neural networks via zero-shot hyperparameter transfer. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17084–17097. [Google Scholar]
  41. Balles, L.; Romero, J.; Hennig, P. Coupling Adaptive Batch Sizes with Learning Rates. arXiv 2016, arXiv:1612.05086. [Google Scholar]
  42. He, F.; Liu, T.; Tao, D. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  43. Yu, G.; Tan, G.; Huang, H.; Zhang, Z.; Chen, P.; Natella, R.; Zheng, Z.; Lyu, M.R. A survey on failure analysis and fault injection in AI systems. ACM Trans. Softw. Eng. Methodol. 2026, 35, 1–42. [Google Scholar] [CrossRef]
  44. Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.; Lakshminarayanan, B.; Snoek, J. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  45. Aru, J.; Labash, A.; Corcoll, O.; Vicente, R. Mind the gap: Challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 2023, 56, 9141–9156. [Google Scholar] [CrossRef]
  46. Gal, Y.; Islam, R.; Ghahramani, Z. Deep Bayesian Active Learning with Image Data. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: Sydney, Australia, 2017; Volume 70, pp. 1183–1192. [Google Scholar]
  47. Fakour, F.; Mosleh, A.; Ramezani, R. A structured review of literature on uncertainty in machine learning & deep learning. arXiv 2024, arXiv:2406.00332. [Google Scholar] [CrossRef]
  48. He, W.; Jiang, Z.; Xiao, T.; Xu, Z.; Li, Y. A survey on uncertainty quantification methods for deep learning. ACM Comput. Surv. 2025, 58, 179. [Google Scholar] [CrossRef]
  49. Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  50. Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A Survey of Uncertainty in Deep Neural Networks. Artif. Intell. Rev. 2023, 56, 1513–1589. [Google Scholar] [CrossRef]
  51. Sun, Y.; Li, X.; Li, L.; Feng, T.; Zhao, Y.; Yin, S. PHH-FL: Perceptual Hashing Hypernetwork Personalized Federated Learning for Heterogeneous Medical Image Analysis Tasks. IEEE Internet Things J. 2025, 13, 8712–8724. [Google Scholar] [CrossRef]
  52. Li, X.; Li, L.; Li, M.; Yan, P.; Feng, T.; Luo, H.; Zhao, Y.; Yin, S. Knowledge distillation and teacher-student learning in medical imaging: Comprehensive overview, pivotal role, and future directions. Med. Image Anal. 2025, 101, 103819. [Google Scholar] [CrossRef]
  53. Li, X.; Li, L.; Jiang, Y.; Wang, H.; Qiao, X.; Feng, T.; Luo, H.; Zhao, Y. Vision-Language Models in medical image analysis: From simple fusion to general large models. Inf. Fusion 2025, 118, 102995. [Google Scholar] [CrossRef]
Figure 1. System overview of the entropy-guided hierarchical control loop. At epoch boundaries, workers compute normalized entropy from logits and aggregate it to form micro-level weights, which drive sharding, per worker batch sizes, and job-level learning-rate modulation. The same signal is summarized as a job score and sent to the macro-level scheduler to update GPU quotas for the next epoch.
Figure 1. System overview of the entropy-guided hierarchical control loop. At epoch boundaries, workers compute normalized entropy from logits and aggregate it to form micro-level weights, which drive sharding, per worker batch sizes, and job-level learning-rate modulation. The same signal is summarized as a job score and sent to the macro-level scheduler to update GPU quotas for the next epoch.
Applsci 16 03725 g001
Figure 2. Empirical CDF of JCT (hours) across 50 jobs.
Figure 2. Empirical CDF of JCT (hours) across 50 jobs.
Applsci 16 03725 g002
Figure 3. Decomposition of average JCT (hours) into queueing delay and service time.
Figure 3. Decomposition of average JCT (hours) into queueing delay and service time.
Applsci 16 03725 g003
Table 1. Cluster and software environment.
Table 1. Cluster and software environment.
ItemConfiguration
Cluster size4 nodes (one worker per GPU per node)
GPU per node1 × NVIDIA RTX A5000 (24 GB VRAM)
CPUIntel Xeon Gold 6326 (32 cores)
System memory32 GB RAM per node
Interconnect100 GbE
StorageLocal NVMe and NFS/Lustre for datasets
Operating systemUbuntu 20.04.6 LTS
CUDA/cuDNN12.2/9.1.0
PyTorch2.6.0 (DDP backend: gloo)
All-reduce bucket sizePyTorch default (25 MB)
PrecisionAMP (fp16)
Table 2. Job types and workload composition. Mode indicates training from scratch or training with a frozen backbone.
Table 2. Job types and workload composition. Mode indicates training from scratch or training with a frozen backbone.
Job TypeModelDatasetModeEpochsBatchCount
J1VGG16MNISTscratch50168
J2MobileNetFashion MNISTscratch50168
J3ResNet18CIFAR10scratch50168
J4EfficientNet V2CIFAR10freeze20166
J5ResNet50CIFAR100scratch20165
J6ShuffleNetCIFAR100freeze20165
J7ConvNeXt TinyTiny ImageNetscratch20165
J8DenseNet121Tiny ImageNetscratch20165
Table 3. Training configuration and control hyperparameters.
Table 3. Training configuration and control hyperparameters.
ItemSetting
OptimizerAdam
Base learning rate0.001
Momentum or betasdefault Adam betas ( 0.9 , 0.999 )
LossCrossEntropyLoss
Initial batch size16
Batch size bounds b min , b max 8 and 64
Weight smoothing β 0.3
Weight clipping bounds w min , w max 0.5 and 1.5
Learning-rate bounds 10 5 and 10 2
Gradient accumulation step bounds20 and 64
Table 4. Average JCT, average queueing delay, and workload makespan (hours).
Table 4. Average JCT, average queueing delay, and workload makespan (hours).
SchedulerAverage JCT (h)Avg Queueing Delay (h)Makespan (h)
FIFO10.129.3320.00
SFJ9.839.0918.68
Lucid7.466.8017.27
Entropy (Ours)5.694.7114.56
Table 5. JCT distribution summary (hours).
Table 5. JCT distribution summary (hours).
SchedulerP50P90P95P99Max
FIFO10.0117.0618.4819.8219.95
SFJ10.0017.8718.2618.5518.66
Lucid6.7214.2615.0316.4417.23
Entropy (Ours)4.6011.0511.3913.1314.52
Table 6. Average best accuracy (%) by job type. For each job type, the best average accuracy across schedulers is highlighted in bold.
Table 6. Average best accuracy (%) by job type. For each job type, the best average accuracy across schedulers is highlighted in bold.
TypeFIFOSFJLucidEntropy (Ours)
J199.4799.4499.4999.68
J292.0992.0992.1092.20
J384.6084.6185.2886.55
J431.4430.9531.5232.80
J564.3464.5766.1667.92
J625.3225.0424.6424.36
J746.5347.2047.7249.01
J866.8566.8268.0170.46
Table 7. Ablation study results on 25 jobs (J1–J9). All metrics are measured in hours.
Table 7. Ablation study results on 25 jobs (J1–J9). All metrics are measured in hours.
MetricMacro-OnlyMicro-OnlyFull Entropy
Average JCT (h)3.324.172.39
Avg Queueing Delay (h)2.032.911.77
Makespan (h)8.6110.046.77
Table 8. Scalability results from simulation. Average JCT and makespan are measured in seconds for 2000 jobs drawn from nine job types.
Table 8. Scalability results from simulation. Average JCT and makespan are measured in seconds for 2000 jobs drawn from nine job types.
NodesSchedulerAverage JCT (s)Makespan (s)
64FIFO16,832.227,682.8
64SFJ8262.821,993.3
64Lucid8022.618,759.0
64Entropy (Ours)6678.615,035.4
128FIFO8358.113,889.7
128SFJ4464.911,130.0
128Lucid4110.99535.0
128Entropy (Ours)3355.27637.4
256FIFO4171.77101.2
256SFJ2566.65620.6
256Lucid2144.34837.0
256Entropy (Ours)1685.53806.1
512FIFO2101.63751.3
512SFJ1448.22885.4
512Lucid1110.22588.0
512Entropy (Ours)854.01938.6
1024FIFO1082.92087.2
1024SFJ773.61638.0
1024Lucid593.31324.0
1024Entropy (Ours)444.01084.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, T.-J.; Huh, E.-N. Entropy-Guided Hierarchical Scheduling for Elastic Distributed Deep Learning. Appl. Sci. 2026, 16, 3725. https://doi.org/10.3390/app16083725

AMA Style

Sun T-J, Huh E-N. Entropy-Guided Hierarchical Scheduling for Elastic Distributed Deep Learning. Applied Sciences. 2026; 16(8):3725. https://doi.org/10.3390/app16083725

Chicago/Turabian Style

Sun, Teh-Jen, and Eui-Nam Huh. 2026. "Entropy-Guided Hierarchical Scheduling for Elastic Distributed Deep Learning" Applied Sciences 16, no. 8: 3725. https://doi.org/10.3390/app16083725

APA Style

Sun, T.-J., & Huh, E.-N. (2026). Entropy-Guided Hierarchical Scheduling for Elastic Distributed Deep Learning. Applied Sciences, 16(8), 3725. https://doi.org/10.3390/app16083725

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop