1. Introduction
Large-scale deep learning training is increasingly carried out in shared environments rather than in isolation [
1,
2,
3]. In many laboratories and production settings, a single GPU cluster serves as a common infrastructure where multiple distributed jobs begin, pause, and overlap throughout the day [
4,
5]. In this setting, training behavior is shaped not only by the model and the dataset but also by when and where a job runs [
6,
7]. Network contention rises and falls, available resources shift, and the scheduler repeatedly changes the parallelism assigned to each job [
8,
9]. These dynamics introduce two intertwined decision scopes. One concerns within-job choices that maintain efficient learning, and the other concerns cross-job decisions that arbitrate shared GPUs. We refer to these as micro and macro in a job-centric sense. The micro level captures decisions made within a single distributed training job, while the macro level captures decisions made across multiple concurrent jobs in the cluster.
In shared GPU clusters, learning dynamics within a job and resource allocation across jobs become coupled in practice. At the micro level, a distributed job learns through worker-local data streams, gradient synchronization, and hyperparameters tuned under an assumed effective batch and synchronization regime [
10,
11]. At the macro level, the cluster is shared through macro-level scheduling decisions such as ordering, admission, and the number of workers assigned to each job over time [
12,
13]. These two levels are coupled because macro-level allocation changes the effective training regime of a job, and the resulting learning behavior determines whether additional resources will actually shorten time to accuracy [
14,
15,
16].
In multi-job execution, the central objective is often time to accuracy [
17,
18]. A job is considered complete not when it merely consumes its assigned epochs, but when it reaches a target validation metric under fluctuating contention [
19]. Yet the information available to the scheduler is typically dominated by system measurements [
20,
21]. These measurements describe how busy devices are, but they do not describe how much learning progress is being produced at a given moment [
22]. When data-parallel training continues with fixed sharding and static hyperparameters under changing contention, a sequence of effects follows [
23]. Effective batch size and synchronization behavior drift from the assumptions used to set the learning rate. Workers contribute gradients of different informativeness as their local shards differ in difficulty, while the system continues to average these contributions uniformly [
24]. Under these conditions, cluster-level decisions can reduce waiting time while still yielding slow progress in model quality [
25]. This appears as a longer time to accuracy and heavier tail behavior in JCT distributions [
26].
A central difficulty is that the scheduler and the training system typically lack visibility into worker-level learning utility [
27]. The system cannot tell which workers are currently operating on ambiguous examples where additional computation is likely to produce larger updates to the decision boundary [
28,
29]. The system also cannot tell how the number of classes and dataset complexity affect the scale and persistence of uncertainty during training [
30]. In multi-class workloads, uncertainty often remains higher for longer, and the gap between easy shards and hard shards becomes more pronounced. If hard shards receive insufficient exposure under fixed sharding, informative gradients arrive too infrequently and convergence slows [
31,
32]. If the system attempts to compensate only through aggressive updates under a changed effective batch, training can become unstable, and the intended time-to-accuracy reduction is not achieved. An online signal is needed that reflects learning difficulty under contention and is stable enough to guide both training-time control and cluster-level scheduling [
33]. The specific research gap addressed in this work is that existing schedulers typically rely on system-level or single-level optimization signals, and therefore do not consistently connect intra-job learning dynamics with inter-job resource allocation. As a result, they may improve resource efficiency or waiting time, while still lacking a unified learning-aware mechanism for jointly coordinating adaptation within a job and scheduling across jobs under shared-cluster contention.
In this study, we address this problem by using predictive uncertainty computed from logits as a lightweight online signal that bridges micro-level adaptation and macro-level orchestration. Each worker summarizes uncertainty using normalized entropy so that the signal remains comparable across tasks with different numbers of classes, and the resulting worker-level signals are converted into stable weights through clamping and exponential smoothing. These weights drive three micro-level controls for the next epoch, including uncertainty-proportional data sharding, per worker batch-size reallocation under a fixed budget, and job-level learning-rate modulation to stabilize updates as the effective regime shifts. In parallel, the same uncertainty signal is aggregated into a job score used by the cluster scheduler so that macro-level allocation decisions favor configurations expected to reduce time to accuracy rather than merely balance hardware utilization. In workload-driven multi-job simulation, our approach reduces average JCT by 23.7% and shortens cluster makespan by 15.7% relative to a representative baseline (Lucid). These gains are driven primarily by reduced queueing delay, which drops by 30.7% under contention. In large-scale simulation at 1024 nodes, our method further reduces average JCT by 25.2% and makespan by 18.1% relative to the same baseline.
The main contributions of this paper are as follows:
A two-scale control formulation for shared-cluster training, together with a unified uncertainty signal that bridges micro-level adaptation and macro-level scheduling.
Stable budget-aware control rules for uncertainty-aware data sharding, batch sizing, and learning-rate modulation driven by normalized entropy.
Empirical evidence that learning-aware orchestration improves time to accuracy and training stability with negligible overhead in multi-job, multi-class workloads, along with scalability validation in simulation up to 1024 nodes.
3. Problem Formulation
We consider a shared GPU cluster that runs multiple distributed training jobs concurrently. Let denote the set of active jobs. The cluster consists of a set of GPU nodes with capacity GPUs. Each job trains a model with classes using data parallelism over a set of workers with cardinality . Time is indexed by epochs , and decisions are made at epoch boundaries.
We focus on reducing time to accuracy under shared resource constraints. Let
denote the validation metric of job
j at the end of epoch
t and let
be a target level. We define the completion epoch of job
j as
The system aims to reduce completion times across jobs while satisfying the cluster GPU budget at each epoch.
3.1. Learning Signal from Predictive Uncertainty
For a sample
x processed by job
j at epoch
t, let
be the predictive distribution produced by the model. We define sample level uncertainty using normalized entropy:
Each worker
processes a shard
during epoch
t and reports a worker-level mean uncertainty:
The expectation is estimated by a sample mean over minibatches processed during the epoch. To compare workers within the same job, we normalize across workers using a small positive constant
. Here,
denotes a numerical stabilizer introduced only to prevent division by zero or unstable normalization when the summed uncertainty becomes very small.
When an additional ambiguity signal is used, it is blended with to form a single worker score . Otherwise, we set .
We define normalized worker weights that summarize where learning difficulty is concentrated within a job. Let
denote the resulting worker score, then the weights satisfy
where
is a small positive constant used only for numerical stability in the denominator. In practice, the weights are computed using this stabilized normalization together with temporal smoothing to avoid oscillations across epochs.
3.2. Micro-Level Control Variables
At each epoch boundary, each job chooses its next epoch internal configuration based on the weights . The configuration is represented by three decision variables for epoch .
First, the job chooses a per epoch sample budget
and allocates it across workers as
In implementations, integer sample counts are realized by standard rounding and correction while preserving the equality. Second, the job chooses per worker batch sizes subject to bounds and a job-level batch budget. Let
denote the batch size of worker
r for epoch
. The batch sizes satisfy
where
is the job-level batch budget for epoch
.
Third, the job adapts its learning rate as a function of job-level uncertainty. We define a job-level uncertainty summary as
The learning rate is selected by an update rule that increases stability when uncertainty is high and allows faster progress when uncertainty is low. The specific form of f is part of the proposed control policy.
3.3. Macro-Level Scheduling Variables and Constraints
At each epoch boundary, the scheduler assigns integer GPU quotas
to jobs for epoch
under the cluster budget
The scheduler uses uncertainty-based learning utility as an input signal. We define a short-horizon progress proxy from the reported uncertainty summaries:
and define a utility score:
where
balances the uncertainty level and recent progress. The GPU allocation decision is produced by a scheduling policy
that maps utility scores and system constraints to integer quotas. The specific form of
is part of the proposed scheduling policy.
3.4. Linking Macro Allocation to Micro Budgets
The macro-level quota determines the next epoch processing budget of each job. Let
denote the measured per GPU processing rate of job
j at epoch
t in samples per epoch per GPU. We set the next epoch sample budget as
This creates a closed loop. The scheduler changes based on uncertainty guided utility, which changes . The job then distributes across workers and adapts batch sizes and learning rate based on the same uncertainty signal.
4. Proposed Method
4.1. Overview and Timing of the Control Loop
Figure 1 illustrates our entropy-guided hierarchical method that couples micro-level adaptation and macro-level scheduling through a shared learning signal in a shared GPU cluster. The control loop is organized around epoch boundaries, which provide a natural synchronization point already present in most data-parallel training pipelines and allow lightweight statistics aggregation without interfering with per step synchronization.
The workflow proceeds as follows. During an epoch, workers execute standard synchronous data-parallel training and locally accumulate uncertainty statistics computed from logits. At the epoch boundary, these per worker statistics are aggregated within each job to construct stable worker weights. With a one-epoch delay, the micro-level controller then uses these weights to update the next epoch’s internal configuration by adjusting the data-sharding ratios in proportion to uncertainty, tuning per worker batch sizes within safe bounds, and modulating the job-level learning rate.
At the same boundary, each job also exposes a compact job score derived from the same uncertainty statistics to the macro-level scheduler. The scheduler uses this score to update integer GPU quotas for the next epoch, which determines each job’s data-processing budget under cluster contention. In this way,
Figure 1 represents a closed hierarchical loop: macro-level control determines how much compute a job receives, while micro-level control determines how that compute is distributed and used within the job. Both levels are therefore coordinated through the same entropy-based signal, which provides a consistent interface between learning dynamics and scheduling decisions.
4.2. Uncertainty Estimation from Logits
Consider a job
j with
classes and workers
. For a sample
x at epoch
t, the model produces logits
and probabilities
This choice is deliberate from a systems perspective. Logits and softmax probabilities are already available during training, so uncertainty can be computed without additional forward passes, auxiliary models, or extra labels. We use Entropy because it reflects how spread the predictive distribution is over classes:
and we normalize it to keep the signal comparable across tasks with different numbers of classes:
Normalized entropy is important in multi-job environments because different jobs may have different class counts, and the scheduler must compare job scores on a consistent scale. When needed, we complement Entropy with a margin-based ambiguity measure that captures near ties between the top two classes. Let
and
be the largest and second largest entries of
. We set
Entropy reflects global spread over classes, while margin ambiguity focuses on local decision boundaries. In practice, Entropy alone is often sufficient, and ambiguity is used as an optional complement when boundary confusion is a dominant mode.
Each worker
r processes its local shard
. We aggregate uncertainty at the worker level by epoch means:
This aggregation step serves two roles. It compresses per sample uncertainty into a small set of scalars per worker, which is communication-friendly, and it provides a stable estimate of how informative the worker data stream is over the duration of an epoch.
4.3. Robust Weight Construction for Uncertainty-Guided Control
We normalize the worker statistics within the job:
Normalization ensures that weights represent relative importance among workers of the same job, independent of the absolute magnitude of uncertainty. We combine them into a single worker score:
where
controls the mixture between normalized entropy and margin ambiguity. From a control perspective,
is the raw signal that indicates where learning is currently concentrated inside the job.
Raw signals can oscillate due to minibatch stochasticity and due to changes in effective parallelism under multi-job contention. We therefore introduce guardrails that make the controller robust. We first bound the score
which prevents a single worker from dominating and also prevents any worker from being starved. We then apply exponential smoothing:
so that allocations evolve gradually across epochs rather than reacting aggressively to short-term fluctuations. Finally, we renormalize to obtain simplex weights:
These weights are the central interface between learning and systems. Inside a job, they determine how data and compute are redistributed across workers for the next epoch. Across jobs, a job-level summary derived from the same weights drives cluster scheduling decisions.
4.4. Micro-Level Controls at Epoch Boundaries
We apply three controls at epoch . The key design choice is that all controls are applied to the next epoch, which makes them compatible with standard synchronous training and avoids interfering with per step synchronization.
Let
be the total number of samples to be processed by job
j in epoch
under its current resource budget. We allocate per worker sample counts by
This rule shifts data exposure toward workers whose data stream is currently more uncertain, which increases the proportion of updates drawn from ambiguous examples. Since rounding may violate the exact total, we enforce the budget constraint
using a remainder correction that assigns leftover samples to the largest fractional parts. This keeps the total work per epoch comparable across methods and isolates the effect of redistribution.
We map weights to batch sizes within bounds
and
as
Batch size interacts with gradient noise and synchronization. Prior work has shown that batch size and learning rate are coupled through their effect on optimization stability and gradient noise scale, and that conservative joint adaptation can improve training robustness under changing effective regimes. Under contention, communication delays or reduced replica counts effectively change the optimization regime seen by the job. Our mapping is therefore intended as a bounded control rule that reallocates per worker batch sizes according to uncertainty while keeping the update within a safe operating range. The role of this rule is not to claim a new optimization theorem, but to provide a lightweight and stable mechanism that is consistent with prior observations on batch-size/learning-rate coupling and large-batch training behavior. The bounds ensure predictable memory usage and avoid sudden jumps that can destabilize training.
We summarize job uncertainty using the weighted score
This scalar captures the uncertainty state of the job at the end of epoch
t and serves as a feedback measurement. We then update the learning rate using
where
controls the sensitivity of the learning-rate modulation to changes in job-level uncertainty. Larger values of
make the controller react more aggressively to uncertainty fluctuations, whereas smaller values lead to more conservative updates. In our implementation,
is set to provide stable epoch-level adaptation without causing abrupt changes in the optimizer state. When uncertainty rises relative to the initial stage, the learning rate is reduced to prevent overly aggressive updates. When uncertainty declines, the learning rate relaxes to maintain efficient progress. Because the update uses an epoch-level measurement, it remains stable and adds negligible overhead.
This design also helps mitigate potential conflicts between local adaptation and global scheduling. Macro-level allocation determines how much resource budget a job receives, while the micro-level controller regulates how that budget is used so that optimization remains stable under the resulting effective training regime. In this sense, the two levels are coordinated through the same uncertainty signal, but they need not have identical objectives at every moment; rather, the shared signal provides a consistent interface that reduces mismatch between learning-aware adaptation and scheduling decisions.
4.5. Macro-Level Scheduling via Uncertainty-Guided Utility
The macro-level scheduler operates at epoch boundaries and uses the same uncertainty signal as the micro-level controller, aggregated at the job level. From (
26), each job reports
once per epoch. We define the short-horizon progress proxy
and form a job utility score that captures both current learning demand and near-term learning progress
where
balances the contribution of the current uncertainty level and the recent uncertainty reduction trend. A larger
places more emphasis on present learning demand, whereas a smaller
gives relatively more weight to short-term progress. In our implementation,
is set to provide a stable compromise between these two signals so that the scheduler remains responsive without overreacting to short-term fluctuations. A higher
indicates that allocating resources to job
j is expected to yield a stronger reduction in uncertainty and thus faster convergence.
The use of as a scheduling priority is based on the following interpretation. A job with persistently high uncertainty still has substantial unresolved learning difficulty, while a positive recent decrease in uncertainty indicates that the job is converting computation into useful progress. The score therefore combines current learning demand with short-horizon responsiveness to additional compute. This does not imply that micro-level utility and macro-level priority are always identical in a strict optimization sense. Rather, the same normalized signal is used to align the two levels heuristically so that jobs are prioritized not only by resource occupancy but also by their expected learning benefit under contention.
At each decision point, the scheduler ranks runnable candidates by utility score rather than submission order. For newly arrived or still pending jobs, may be unavailable. The scheduler then falls back to a default prior or uses a short warm-up measurement. The scheduler selects the highest-scoring job that fits within the currently idle GPUs. This defines an entropy-guided admission rule in which execution order is determined by learning utility.
Let the cluster have
G GPUs. The scheduler assigns integer quotas
such that
We use proportional allocation to translate utility into quotas:
with remainder correction to satisfy the budget. This ensures stable integer allocations while preserving the monotonic relationship between
and allocated resources.
In practical clusters, quotas must be realized by assigning jobs to concrete GPU nodes. Let
be the set of nodes. We represent placement by
and enforce
Among feasible placements, the macro-level scheduler may use a deterministic ordering of assigned nodes to choose a master rank and may prefer placements that reduce expected contention. Importantly, placement does not change the learning signal itself. Instead, it realizes the entropy-guided decision of which jobs run and how much parallelism they receive.
The key interaction remains that
determines each job’s next epoch processing budget
, and the micro-level controller then distributes that budget across workers using (
23)–(
27). This closes the loop between macro-level execution decisions and micro-level adaptation, both driven by the same entropy-based utility.
4.6. End-to-End Scheduling Algorithm
Algorithm 1 summarizes the method as an epoch-level control loop. We use the epoch boundary as the control point because it already provides a natural synchronization barrier in data-parallel training, and it allows uncertainty statistics to be aggregated with negligible overhead. The algorithm applies a one epoch delay in all control actions, meaning that weights computed at epoch
t determine sharding, batch size, and learning rate for epoch
, which avoids interfering with per step synchronization. At the cluster level, job scores are computed from the same uncertainty signal and translated into GPU quotas that determine the next epoch data budget, closing the loop between macro-level allocation and micro-level adaptation.
| Algorithm 1 Entropy-guided hierarchical scheduling |
Require: Active jobs , workers for each job, cluster GPUs G Require: Hyperparameters - 1:
Initialize and learning rate for all jobs and workers - 2:
for each epoch do - 3:
for all jobs in parallel do - 4:
for all workers in parallel do - 5:
Train for one epoch using current shard and batch size - 6:
Accumulate and optionally from logits - 7:
Compute and as epoch means - 8:
end for - 9:
Compute and using ( 18) - 10:
Compute using ( 19) - 11:
Compute using ( 20), ( 21), and ( 22) - 12:
Compute job uncertainty using ( 26) - 13:
Compute job score using ( 28) and ( 29) - 14:
end for - 15:
Scheduler computes next quotas for all jobs using ( 31) under ( 30) - 16:
for all jobs do - 17:
Derive next epoch data budget from quota - 18:
Compute using ( 23) and enforce ( 24) - 19:
Compute using ( 25) - 20:
Update using ( 27) - 21:
end for - 22:
end for
|
The method is designed as a thin layer on top of an existing data-parallel training stack and a cluster scheduler. Uncertainty computation is performed during the forward pass using logits already produced for the loss, and thus requires no additional model evaluations, forward/backward passes, or extra labels. Each worker accumulates running sums of normalized entropy, and it may also track a margin-based ambiguity statistic, along with a counter of processed samples. At the epoch boundary, the job performs a small number of collective operations to aggregate these scalars and to compute the normalized quantities needed for weight construction. Because only compact scalar statistics are exchanged once per epoch, the additional communication cost remains small compared with standard gradient synchronization and does not require changes to gradient all-reduce or per step training logic. This design keeps the control overhead lightweight while remaining compatible with standard synchronous data-parallel training.
Data-sharding adaptation can be implemented by updating the sampler at each epoch boundary. When datasets are represented by index lists, the controller assigns each worker a contiguous or strided subset of size
while ensuring that the union matches the intended epoch budget. The remainder correction in (
24) can be applied deterministically to preserve reproducibility. Batch-size updates can be applied by re-instantiating each worker data loader with the new
or by using a loader that supports dynamic batch sizes. The bounds
and
can be chosen once per job through lightweight profiling and then kept fixed. The learning-rate update in (
27) is applied at the epoch boundary through the optimizer parameter group. For macro-level scheduling, each job reports the scalar score
once per epoch, and the scheduler converts it into the next-epoch quota and data budget
through a throughput model or measured step time. This separation keeps the system modular. The scheduler consumes only a compact job-level summary, while the training code consumes only the assigned quota and does not require access to other jobs.
5. Experimental Setup
Testbed. Experiments run on a shared GPU cluster with four NVIDIA RTX A5000 GPUs. All methods share the same physical cluster to induce realistic multi-tenant contention on compute and communication.
Table 1 summarizes the hardware and software environment.
Workload. We submit fifty jobs drawn from eight job types. Each job type fixes the model, dataset, base batch size, and training mode. Job types J1 to J3 run for fifty epochs, while J4 to J8 run for twenty epochs. The submission mix follows the configured multiplicities and priority levels, where J1 to J3 have priority one, J4 to J6 have priority two, and J7 to J8 have priority three.
Table 2 summarizes all job types and the workload composition.
Methods and metrics. We compare the proposed scheduler against FIFO, shortest job first and Lucid. We evaluate system performance using JCT, makespan, and the cumulative distribution function of JCT with emphasis on tail behavior. We evaluate learning quality using Top-1 accuracy and loss for MNIST, Fashion MNIST, CIFAR10, and CIFAR100, and Top-1 and Top-5 accuracy with loss for Tiny ImageNet.
Table 3 summarizes the training configuration shared across methods and the control ranges used by the proposed approach.
Large-scale simulation. To study scaling beyond the four-GPU testbed, we evaluate the same workload using a calibrated simulator. The simulator models step time and communication overhead as a function of allocated GPUs and contention, with parameters calibrated from measurements on the real cluster. We evaluate performance at 64, 128, 256, 512 and 1024 nodes using the simulator.
5.1. Experimental Results
5.1.1. Overall System Performance
Table 4 summarizes overall system performance in terms of average JCT, average queueing delay, and workload makespan, all measured in hours. We define JCT as the wall-clock time from job submission to completion, and queueing delay as the time from submission to the first GPU allocation. We define makespan as the elapsed time between the earliest submission and the last job completion in the workload.
Across the 50-job workload, Entropy achieves the lowest average JCT of 5.69 h, compared to 7.46 h for Lucid, 9.83 h for SFJ, and 10.12 h for FIFO. This corresponds to a 23.7% reduction over Lucid and a 43.8% reduction over FIFO. Entropy also reduces average queueing delay from 6.80 to 4.71 h, a 30.7% reduction, indicating that utility-guided admission and quota updates alleviate waiting under multi-tenant contention. Finally, Entropy shortens the workload makespan from 17.27 to 14.56 h, a 15.7% reduction over Lucid, improving cluster-level throughput relative to the baselines. These results suggest that aligning admission and resource allocation with learning utility can simultaneously reduce both per job latency and end-to-end workload completion time.
Entropy consistently improves the upper tail of the distribution.
Table 5 highlights this effect in the percentile summary. Compared to Lucid, Entropy reduces the 95th-percentile JCT from 15.03 to 11.39 h, corresponding to a 24.2% reduction, and reduces the maximum observed JCT from 17.23 to 14.52 h, corresponding to a 15.7% reduction. Entropy also improves the median (P50) from 6.72 to 4.60 h, corresponding to a 31.6% reduction, indicating that the latency improvement is not limited to a few outliers.
Figure 2 shows that Entropy shifts the JCT distribution left across a broad range of quantiles, rather than improving only a small subset of jobs. At the median, Entropy reduces JCT from 6.72 h for Lucid to 4.60 h, corresponding to a 31.6% reduction. The separation persists into the upper tail: the 90th-percentile decreases from 14.26 to 11.05 h, corresponding to a 22.5% reduction, and the 95th-percentile decreases from 15.03 to 11.39 h, corresponding to a 24.2% reduction. Moreover, the maximum observed JCT decreases from 17.23 to 14.52 h, corresponding to a 15.7% reduction, indicating that Entropy tightens the extreme tail under contention. Overall, the CDF indicates that Entropy improves completion times for a large fraction of jobs while also reducing tail risk, which aligns with the percentile summary in
Table 5.
Figure 3 decomposes the average JCT into queueing delay from submission to start and service time from start to completion. The main improvement of Entropy comes from a substantial reduction in queueing delay. Compared to Lucid, Entropy lowers average queueing delay from 6.80 to 4.71 h, a 30.7% reduction, which accounts for most of the 1.77 h reduction in average JCT from 7.46 to 5.69 h. In contrast, the average service time under Entropy is slightly higher than Lucid, increasing from 0.66 to 0.98 h. This suggests that the benefit of the proposed method mainly comes from improved admission and GPU-quota decisions under contention, rather than from uniformly shortening execution time after a job starts. A plausible reason is that micro-level adaptation prioritizes learning-aware redistribution and stability, which can introduce conservative batch-size or learning-rate adjustments during some phases of training. As a result, the method may improve overall time to accuracy at the cluster level while not always minimizing raw per job service time. Potential optimizations include less frequent control updates, threshold-based adaptation, and more conservative tuning of the micro-level control parameters.
This result also helps clarify the computational overhead and runtime cost of the proposed method relative to the baseline schedulers. Although the entropy-guided control logic introduces additional adaptation decisions during training, its runtime effect appears as a moderate increase in service time rather than a large penalty in end-to-end latency. In our results, this overhead is outweighed by the larger reduction in queueing delay, so the net impact on overall JCT remains favorable. Therefore, the proposed method should be understood as introducing a lightweight runtime trade-off in exchange for improved cluster-level scheduling efficiency under contention.
5.1.2. Training Accuracy and Convergence
In addition to system-level latency, we examine whether different scheduling policies preserve training accuracy and convergence behavior.
Table 6 reports the average best validation accuracy achieved by each job type under each scheduler, computed as the maximum validation accuracy attained during each run.
Entropy achieves the highest average accuracy in 7 out of 8 job types. Compared to Lucid, Entropy improves average best accuracy by 0.10–2.45 percentage points in seven job types, with an average gain of approximately 0.88 percentage points across all eight job types. Compared to FIFO, Entropy improves average best accuracy by 0.21–3.61 percentage points in seven job types, and remains within 0.96 percentage points in the remaining job type. Taken together, these results show that Entropy maintains competitive model quality across job types and often achieves higher accuracy than the baselines. Alongside the improvements in JCT and tail latency, this indicates that the proposed scheduler improves cluster responsiveness while preserving learning outcomes under multi-tenant contention.
5.1.3. Ablation Study
To isolate the contribution of the macro-level and micro-level components, we conduct an ablation study on a workload of 25 jobs drawn from job types J1 to J9. We compare three variants. The macro-only variant enables inter-job scheduling and quota updates only. The micro-only variant enables intra-job adaptation only while keeping inter-job scheduling fixed. The full method enables both components.
Table 7 summarizes the results. The full method achieves the best performance across all system-level metrics. Compared to macro-only, the full method reduces average JCT from 3.32 to 2.39 h, a 28.0% reduction, and reduces makespan from 8.61 to 6.77 h, a 21.4% reduction. Compared to micro-only, the full method reduces average JCT from 4.17 to 2.39 h, a 42.7% reduction, and reduces makespan from 10.04 to 6.77 h, a 32.6% reduction. These results indicate that combining macro-level allocation with micro-level adaptation yields complementary benefits under multi-tenant contention.
Macro-level control reduces queueing delay by directly shaping admission and quota assignment, while micro-only adaptation has limited leverage over waiting time because the macro-level policy is fixed. In contrast, micro-level adaptation primarily improves per job training efficiency after GPUs are allocated, which helps translate allocated resources into faster progress. The gap between macro-only and the full method suggests that micro-level adaptation contributes additional gains beyond improved admission and quota decisions. Conversely, the gap between micro-only and the full method highlights the importance of learning-aware resource arbitration for improving cluster-level throughput under contention. Taken together, these ablation results support the practical rationale of the design. Although the macro-level and micro-level components contribute differently, the largest gains are obtained when both are coordinated through the shared uncertainty signal.
5.1.4. Scalability Under Large-Scale Simulation
We evaluate scalability using a calibrated simulator that models per step training time and communication overhead as a function of allocated GPUs. The simulator is parameterized to reflect the target homogeneous A5000 cluster setting, including throughput-related scaling behavior, communication efficiency under multi-GPU allocation, and scheduler-side overhead. Job templates are also configured using representative workload characteristics such as batch size, GPU demand, and training-progress parameters, so that the simulator provides a controlled approximation of large-scale distributed training behavior. The simulator is intended to extend the small-scale real-cluster evaluation to larger node counts under a consistent workload model, using the same scheduler logic and representative job characteristics. Because the simulator abstracts training into a throughput model, the absolute times in
Table 8 should be interpreted primarily as relative comparisons across schedulers under controlled scaling, rather than as exact wall-clock durations of specific real-world training runs or production deployments.
Table 8 reports average JCT and makespan in seconds for workloads of 2000 jobs as the cluster scales from 64 to 1024 nodes. As the number of nodes increases, average JCT and makespan decrease across all schedulers, reflecting increased parallel capacity. Entropy achieves the lowest average JCT and makespan at every scale in this simulated setting. At 1024 nodes, Entropy reduces average JCT from 593.3 s under Lucid to 444.0 s, a 25.2% reduction, and reduces makespan from 1324.0 s to 1084.5 s, an 18.1% reduction. These results suggest that the proposed policy preserves its advantage as the cluster size grows, consistently improving both per job latency and end-to-end workload completion time under the simulator model.
5.1.5. Practical Deployment Discussion
From a practical deployment perspective, the proposed framework is designed to operate as a lightweight control layer on top of existing shared-cluster scheduling and synchronous data-parallel training workflows. Because the control decisions are applied at epoch boundaries, the method can be integrated without intrusive changes to the standard per step training path. In production GPU clusters, however, deployment would still need to account for additional factors such as workload heterogeneity, varying contention patterns, scheduler policy constraints, and the trade-off between adaptation responsiveness and operational stability. These considerations suggest that deployment-oriented tuning of control frequency and parameter sensitivity will be important for robust operation in real multi-tenant environments.
6. Conclusions
In this study, we presented Entropy, an entropy-guided hierarchical scheduling approach that couples micro-level adaptation and macro-level allocation through a shared uncertainty signal computed from logits and aggregated at epoch boundaries. Across our multi-tenant A5000 testbed, Entropy substantially improves cluster responsiveness and tail latency. In particular, Entropy reduces the 95th-percentile JCT from 15.03 h under Lucid to 11.39 h, corresponding to a 24.2% reduction, and lowers the median JCT from 6.72 to 4.60 h, corresponding to a 31.6% reduction. These gains are driven primarily by shorter queueing delay, indicating that uncertainty-guided admission and quota decisions mitigate contention-induced waiting while maintaining competitive learning outcomes across job types.
The current study also has several limitations. First, the empirical evaluation is centered on controlled multi-tenant settings and primarily vision-oriented workloads, which do not fully represent the diversity of emerging distributed training scenarios, such as NLP, multimodal learning, and large-model fine-tuning. Second, although the proposed framework is designed to operate with lightweight epoch-level control, the current evaluation does not cover all practical systems effects that may arise in real production deployments, including stronger heterogeneity in hardware, workload interference, and broader failure or straggler conditions. Third, the large-scale results are based on a simulator-driven evaluation, which is useful for controlled comparative analysis but does not replace full validation in a production-scale environment.
In future work, we will extend the approach to a broader set of NLP and multimodal workloads, including large language model fine-tuning and other long-context tasks where learning dynamics and resource sensitivity differ from vision benchmarks. We will also evaluate Entropy on more diverse tasks and larger-scale distributed datasets under mixed batch sizes, mixed sequence lengths, and more heterogeneous training regimes. Finally, we will study deployment in heterogeneous clusters that combine different GPU generations and interconnects, and we will explore robustness to practical systems effects, such as interference-aware placement, checkpointing overheads, and straggler mitigation in large-scale distributed training.