On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes

Zhou, Bowen; Mondal, Subrota Kumar; Cheng, Yuning; Kabir, H. M. Dipu

doi:10.3390/app16052481

Open AccessArticle

On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes

¹

School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao 999078, China

²

AI and Cyber Futures Institute, Charles Sturt University, Orange, NSW 2800, Australia

³

Rural Health Research Institute, Charles Sturt University, Orange, NSW 2800, Australia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(5), 2481; https://doi.org/10.3390/app16052481

Submission received: 18 December 2025 / Revised: 10 February 2026 / Accepted: 27 February 2026 / Published: 4 March 2026

Download

Browse Figures

Versions Notes

Abstract

Kubernetes, an open-source project initiated by Google for managing and organizing containers in cloud platforms, has become the preferred choice for deploying large-scale containerized microservice architectures. Kubernetes employs a scheduler that considers constraints defined by workload owners and cluster managers to identify the most suitable node to host a given task. Although it can be configured in a multitude of ways, the default scheduler that comes with Kubernetes is not fully capable of efficiently handling the demands of Horizontal Pod Autoscaling (HPA), particularly when deploying a large number of similar pods simultaneously. This article focuses on the optimization of the Kubernetes scheduler to allocate and manage resources more efficiently in rapid Pod autoscaling scenarios. The scheduling mechanisms of Kubernetes offer considerable potential for improvement. This article introduces a custom scheduler that reduces redundant scoring steps using a caching mechanism, thereby accelerating the scheduling process for horizontal scaling of pods. The article begins with an in-depth literature review, followed by the development of novel algorithms to address existing gaps in the default scheduler. The custom scheduler is then subjected to rigorous simulation and testing phases to ensure its robustness and efficiency. Experimental results demonstrate the effectiveness of the proposed approach in improving the scheduling performance for HPA in Kubernetes.

Keywords:

Kubernetes; pod; cache; custom scheduler; HPA; rapid autoscaling

1. Introduction

A significant number of businesses have adopted virtualization technology with the objective of enhancing the efficiency and reducing the cost of their information technology infrastructure [1]. This technology enables the rational allocation and distribution of infrastructural resources among numerous applications. As has been the case in the past, virtual machines (VMs) are utilized for this purpose [2]. An application can run on a virtual machine (VM), which typically means packaging it into a self-contained image that includes the application itself, all required dependencies, and the operating system it relies on. However, embedding a full operating system inside each VM leads to larger images and longer boot times. A lighter virtualization approach based on Linux containers helps address these limitations [3]. With containers, applications and their dependencies can be bundled into portable deployment units, while multiple containers share access to the host OS kernel. Because of this shared-kernel model, container images are usually much smaller and can be started significantly faster. Docker is an open-source project that provides a widely used implementation of Linux containers [4]. As container adoption has grown, many organizations have relied on Docker-based container frameworks to support continuous deployment and large-scale container management. Several orchestration systems are widely recognized in industry, including Apache Mesos, Docker Swarm, and Kubernetes [5]. In recent years, container orchestration platforms—especially Kubernetes—have become extremely popular [6]. This open-source platform has become the go-to choice for deploying, scaling, and managing containerized applications across environments, including data centers and cloud infrastructure. Its strength lies not in abstracting hardware but in its efficiency in scheduling tasks and distributing workloads among many nodes ensuring optimal resource usage and high availability.

Container orchestrators rely heavily on a feature called scheduling [7]. Container allocation to computing nodes inside a cluster at a given time is controlled by a process called scheduling. Effective resource scheduling is driven by storage optimization. However, as deployments become more complex and larger in scale, challenges arise when it comes to scheduling; while the default algorithms and strategies implemented within Kubernetes are strong, they may not always be the optimal fit for every situation or workload. Inadequate scheduling can result in the inefficient use of resources or the underutilization of nodes, which may lead to service disruptions. Given this context, it becomes crucial to revisit and optimize the scheduling mechanisms within Kubernetes. There is a need to develop strategies that are adaptable, responsive, and tailored to diverse requirements. This ensures that applications not only remain available but also deliver peak performance.

1.1. Challenges

We now delve into the key challenges faced during the development of our custom Kubernetes scheduler and the significant contributions made to address these challenges:

Performance Efficiency: One of the primary challenges is enhancing the scheduler’s performance, particularly in environments requiring rapid scaling. The default Kubernetes scheduler at times struggles with scalability and speed due to its complex calculations and the absence of a state-persistent mechanism.
Cache Utilization: Implementing a caching mechanism that not only speeds up the scheduling process by avoiding redundant computations but also ensures that the cache remains up-to-date with the latest state of the cluster is a significant technical hurdle.
Integration with Existing Systems: Ensuring that the custom scheduler could seamlessly integrate with the existing Kubernetes infrastructure without disrupting ongoing operations is crucial.
Maintaining Consistency and Reliability: The scheduler needed to consistently make optimal scheduling decisions under varying conditions and maintain high reliability and fault tolerance.

To this, our goal is to develop a caching mechanism, optimize scheduling strategy, and enhance scalability, performance, and reliability.

1.2. Positioning and Contributions

In essence, we have the following contributions. This work targets rapid scale-out bursts in Kubernetes where a workload creates many homogeneous (or near-homogeneous) replica pods within a short interval. In this setting, the scheduler repeatedly executes an expensive Score phase over largely similar feasible-node sets, resulting in redundant computation and increased scheduling latency. We propose a rotating score-cache that reuses the node-score ranking across similar pods while controlling staleness and preserving placement diversity. Our main contributions are:

(1): A ScoreKey that captures scheduling-relevant pod features (resource requests and constraint hashes) to define similarity for score reuse;
(2): A rotating cache that returns only the current top-ranked node for a ScoreKey and pops it after each successful scheduling to avoid repeatedly selecting the same node;
(3): Correctness and freshness safeguards via feasible-set change detection and periodic cache refresh after a configurable consumption threshold; and
(4): An evaluation on a real GKE cluster comparing against the default scheduler, Koordinator, and YuniKorn under burst deployment.

1.3. Difference to Prior Reuse/Batching Proposals

While prior work shares the high-level motivation of avoiding redundant scheduler computation, practical designs often do not persistently reuse the score ranking across a burst. In contrast, our approach explicitly maintains a reusable ranking keyed by a pod-equivalence signature (ScoreKey) and consumes it via a rotating “pop” mechanism to preserve diversity. Moreover, we incorporate explicit invalidation/refresh policies (feasible-set hashing and consumption-based refresh) to balance reuse benefits against staleness risks in dynamic clusters.

1.4. Paper Organization

This paper is structured to provide a comprehensive understanding of the challenges and advancements in Kubernetes scheduling, with a focus on developing a more efficient scheduling algorithm. Here is a breakdown of the document’s structure: Background (Section 2): This section delves into the foundational concepts necessary for understanding Kubernetes and its components. It covers operating system-level virtualization, container orchestration, and the architecture of Kubernetes. Scheduling in Kubernetes (Section 3): A detailed exploration of the current scheduling mechanisms in Kubernetes. This section discusses how the default scheduler operates and identifies its limitations, particularly in scenarios requiring rapid scaling. Methodology (Section 4): This core part of the paper describes the methodology used to analyze the deficiencies of the default Kubernetes scheduler and to develop a custom scheduling solution. It includes the Deficiency Analysis of Default Scheduling Algorithm, an examination of where and why the current scheduler may fail to perform efficiently; and Custom Scheduling Algorithm Design, a presentation of the new scheduling algorithm, including the design and implementation of the caching mechanism and scheduling logic. Experiment and Evaluation (Section 5): In this section, we describe the experimental setup and the methods used to evaluate the efficiency of the new scheduler compared to the default Kubernetes scheduler. This section details the hardware and software configurations, the experimental methodology, and the criteria for performance evaluation. Conclusion and Future Work (Section 6): This section summarizes the findings and contributions of the paper. It reflects on the implications of the new scheduler for Kubernetes and suggests areas for future research.

2. Background

Virtualization is a technique for delivering virtualized services by abstracting and sharing resources that ultimately run on physical hardware [8]. In practice, it partitions the capabilities of a given resource—such as networking, storage, compute servers, or applications—so they can be used by multiple users or across distinct execution environments [9]. In particular, system-level virtualization makes it possible to create isolated execution spaces, namely containers, within a single operating system instance. On Linux platforms, this isolation is commonly achieved through mechanisms such as cgroups, namespaces, and chroot. This approach underpins container-based cloud offerings, often termed Container as a Service (CaaS) [10]. Today, all major cloud providers support deploying containers, for example, through Google Container Engine, Amazon Elastic Container Service (ECS), and Microsoft Azure Container Service. Moreover, organizations can also build container clusters on their own infrastructure by using widely adopted orchestrators such as Docker Swarm or Kubernetes [5].

A container consists of one or more processes that are isolated from the rest of the system, encompassing the application, its necessary libraries, and configuration files. Containers provide environments that ensure consistent execution, offer lightweight management of their lifecycle, and deliver performance that approaches that of hardware more closely than traditional virtual machines [11]. Currently, Docker is the most recognized container management tool. It is a layered platform that includes various software components designed for building, shipping, and running applications in containers, such as Docker Daemon, containerd, and Docker registry [12].

Figure 1a illustrates the relationships among container technologies used to manage container lifecycles on a Kubernetes worker node. At the foundation are OCI runtimes (e.g., runC, crun, and Kata runtime) that create and execute container processes [13]. Above them, container runtime daemons such as containerd and CRI-O manage images and container lifecycles, typically invoking an OCI runtime for low-level execution. The Open Container Initiative (OCI) specifications help ensure portability by standardizing image formats and runtime expectations.

At the cluster level, container orchestration automates scheduling, deployment, availability, load balancing, and networking. In Figure 1a, kubelet communicates with the container runtime daemon via the Container Runtime Interface (CRI). Kubernetes is compatible with any runtime that implements CRI, such as containerd or CRI-O.

2.1. Container Orchestration

As previously described, a container orchestrator is responsible for managing and organizing microservice architectures on a large scale, focusing on the automation and lifecycle management of containers and services within a cluster. Generally, end-users dispatch their jobs to the cluster manager master, a central component of the orchestration system. The cluster manager master allocates these tasks to the worker nodes within the compute cluster for execution. It’s important to recognize that jobs or applications typically consist of one or more services, and these services are made up of one or more diverse tasks that run on containers. The compute cluster represents a network of interconnected nodes, which may be either physical or virtual machines located in various environments, such as cloud or private clusters.

Container orchestration platforms are divided into two types: self-hosted and cloud-managed. Platforms like Borg, Mesos, and Kubernetes [16] fall under the self-hosted category, requiring comprehensive setup, configuration, and management on either physical or virtual setups. On the other hand, cloud-managed orchestrators are provided as services by cloud providers and demand only limited setup. Google Kubernetes Engine (GKE), Azure Kubernetes Service (AKS), and Amazon Elastic Kubernetes Service (EKS) are examples of cloud-managed solutions. Figure 1b displays the distinct elements of both self-hosted and cloud-managed orchestrators. The principal functions of these components are outlined briefly here, with more extensive descriptions available in [7], which discusses a standard framework for container orchestration systems.

A fundamental component of container orchestrators discussed in this paper is the scheduling mechanism. This module determines the optimal node within the cluster to execute incoming tasks at a specific moment. Essentially, scheduling involves deciding where and when to place a container based on system conditions such as resource availability, node preferences, and data locality. It also considers factors like energy usage, response times, or overall job completion time. Additionally, a rescheduler component is available that relocates tasks based on preemptive needs or to enhance load distribution and resource efficiency across the system.

It’s important to note that not every component is essential for a container orchestration system to function effectively. The resource allocation module secures cluster resources using either a fixed or adaptive method over time. Task distribution is managed by the load balancing module, which allocates tasks across containers using various strategies like cost-effectiveness, energy conservation, or priority, commonly employing a round-robin approach but allowing for alternative methods. The autoscaling module adjusts the scale of resources either horizontally by modifying the number of nodes, or vertically by adjusting resources per task, based on workload demands.

Kubernetes, for example, uses the Horizontal Pod Autoscaler to scale workloads by tuning the number of pods according to defined CPU and memory indicators.

In addition, the admission control component verifies that the cluster can accommodate incoming user workloads without violating configured quota limits. At the same time, the accounting component records and manages resource assignments on a per-user basis, while the monitoring component continuously observes current resource consumption and gathers the data needed to keep the system stable and healthy, including support for fault-tolerant operation.

2.2. Kubernetes Architecture

Kubernetes, often abbreviated as K8s, is a free, open-source platform designed to automate the setup, scaling, and management of applications housed in containers across various servers. It originated from Google’s Borg project [17], which accumulated over ten years of expertise in managing scalable applications in real-world operations. Since its public release in 2014, Kubernetes has continuously expanded its reach and capabilities.

Systems running on Kubernetes often incorporate various auxiliary tools from the Kubernetes ecosystem to enhance and streamline their operations. This expanding ecosystem includes tools such as Helm for deployment management (https://helm.sh/, accessed on 17 December 2025), Istio for service mesh oversight (http://istio.io/, accessed on 17 December 2025), Prometheus for monitoring (https://prometheus.io, accessed on 17 December 2025), Grafana for analytics (https://grafana.com, accessed on 17 December 2025), and Kibana for logging [18]. We can view a comprehensive overview of the Kubernetes ecosystem at https://landscape.cncf.io (accessed on 17 December 2025). While detailing their specific functionalities is outside the scope of this text, it’s important to highlight that numerous scheduling strategies engage with these tools to obtain real-time system data or to forecast resource utilization, as cited in sources [19,20,21,22].

Architecturally, a Kubernetes (K8s) cluster is composed of a group of nodes, which are either physical or virtual machines, working collectively as a unified system (refer to Figure 2). These nodes are categorized by role, specifically into master nodes that orchestrate the cluster operations, and worker nodes that supply the computational resources. The cluster uses software-defined overlay networks, like Flannel or Calico, to assign each pod and service a unique IP address [23]. Pods represent the fundamental operational units within the cluster.

The master node orchestrates overall cluster coordination, whereas the worker nodes furnish the necessary resources; while a single master node can manage a cluster, triple master nodes are common in setups requiring high availability (HA). Kubernetes employs an event-driven declarative framework and adheres to a design of loosely interconnected components. The master node consists of several key components:

-: etcd: This key-value store maintains the cluster’s intended state synchronously.
-: Scheduler: It allocates pods across available worker nodes.
-: API Server: This serves as the communication hub for issuing commands and managing Kubernetes objects, which are durable entities denoting the cluster’s state. The API server facilitates a RESTful HTTP API that describes objects in JSON or YAML formats. Commands can also be sent to the API server using Kubernetes’ command-line interface (CLI), kubectl.
-: Controller Manager: It keeps an eye on etcd for changes and pushes the system towards the desired state. Known Kubernetes controllers like ReplicaSet, Deployment, Job, or DaemonSet offer various functionalities, including maintaining availability, enabling rollbacks, managing task execution, or ensuring a pod is active on every node.

These controllers interact through modifications tracked by the API server and react to system events with the help of informers. They monitor the progress of deployments and execute required measures to guarantee their proper functioning. When a new pod needs to be deployed, the scheduler determines the optimal worker node to host the pod, based on the system logic.

Conversely, worker nodes handle the operation of application pods. Specifically, kubelet serves as the node agent tasked with managing the lifecycle of pods and overseeing the status of both the pods and the node itself.

3. Scheduling in Kubernetes

The Kubernetes scheduler performs three main functions [24]: 1. watching the cluster for pods that have not yet been scheduled, 2. determining the most appropriate node for each pending pod, and 3. binding that pod to the chosen node. In our customized scheduler, “most appropriate” specifically means the node with the largest amount of free RAM. Accordingly, the node-selection step reflects our modification, which explicitly favors the node offering the highest available memory. This goal is implemented through the following steps:

Identify all nodes that can host the pod, i.e., nodes whose available CPU and memory satisfy the pod’s requirements.
For every eligible node, retrieve the relevant metric information (memory).
Select only the node(s) with the strongest (largest) metric value. The pod life cycle is illustrated in Figure 3.

3.1. User Specifications

Users have the ability to set various parameters defining the criteria that the scheduler must meet. These parameters include constraints at the node, namespace, or pod level that function as criteria for admission control. Below, a concise overview is provided.

At the node level, control over which tasks are deployed on specific nodes can be managed through affinity and taint mechanisms [21]. Affinity attracts certain pods towards specific nodes, whereas taint repels them. Additionally, there is a corresponding property known as tolerance. Nodes possess a certain level of tolerance, and if a node’s taint exceeds this threshold, the system will allocate the task to another node with a greater capacity for tolerance.

At the pod level, it’s possible to define the resource requirements for a container through the use of request and limit properties. These settings determine the minimum and maximum resources—typically CPU and memory—that a pod requires to operate. The request setting is enforced strictly, ensuring that the sum of resources allocated to existing containers, plus any new requests, does not surpass the node’s resource capacity. However, a container may temporarily exceed the limits specified for it.

It is important to note that under certain circumstances, a pod may need to be terminated due to resource shortages. For instance, if a node with 2 CPU cores already has 1.5 cores allocated, any new pod requesting more than 0.5 cores would push the node’s utilization above its capacity. Kubernetes handles this situation differently based on the Quality of Service (QoS) class assigned to the pods. For pods in the best-effort class, which lack both request and limit specifications, termination may occur if deemed necessary. Conversely, pods in the guaranteed class have their request and limit values set equally, ensuring their continued operation. Additionally, there is a burstable class, where pods have defined requests but not limits, securing them a minimum level of resources. Properly setting these values is crucial to prevent undesirable terminations during critical operations.

Lastly, within the scope of a Kubernetes namespace, it is possible to set the resource requests and limits for pods. These parameters can be established through the use of LimitRange and ResourceQuota properties assigned to a namespace.

3.2. Internal Workflow

In Kubernetes, the process for scheduling resources operates as follows: A user initiates a request to create a pod with specific computational resources. This request is received by Kubernetes’ master node, which then passes it to the API server. The kube-scheduler, which is part of the master node, processes this request and decides on the appropriate worker node(s) to host the pod. It then commands the kubelet, which runs on the designated worker node, to initiate the pod. If the pod’s specification includes defined computing resources, the kubelet employs cgroups and tc to allocate the necessary resources. Once the pod is successfully set up, the kubelet updates the API server about its operational status. Users can alter the resource configurations at any time by sending updated YAML files to the master node. It’s important to note that the Kubernetes API server employs optimistic concurrency to handle simultaneous update attempts, rejecting the later one to avoid conflict. Essentially, the scheduler, a core component of the Kubernetes control plane located on the master node, ensures that pods are optimally placed on nodes. Although multiple schedulers can exist within a cluster, the kube-scheduler serves as the default scheduler in Kubernetes, providing a standard for scheduling operations.

Default Scheduler

Kubernetes features a fundamental component known as the default scheduler, which functions as a standard dispatcher to assign pods to appropriate worker nodes as shown in Figure 4. The operation of the default scheduler is illustrated as follows:

There exists a queue within the scheduler, known as the podQueue, which continually monitors the API Server for any updates.
Upon the creation of a pod, its metadata is initially recorded in the etcd via the API Server.
Operating akin to a controller, the default scheduler observes these updates and reacts by adjusting the state accordingly. It specifically scans for pods that haven’t been assigned to any node in etcd, adding each detected unassigned pod to the podQueue.
The primary function of the scheduler is to methodically remove pods from the podQueue and allocate them to the most fitting nodes available for their execution.
Once a pod is assigned to a node, this binding is updated in the etcd and communicated to the kubelet on the respective worker node.
The kubelet, which operates on the worker node and keeps track of pod assignments, then initiates the execution of the newly assigned pod, thus beginning its operation on the node. The scheduler’s core logic rotates through the nodes using a round-robin method, and for each pod awaiting assignment, it executes steps of filtering and scoring to determine the optimal node.

Filtering step: The selection process during the filtering stage involves narrowing down the worker nodes that meet the specific requirements of a pod, based on a predefined set of rules. This stage primarily utilizes predicates—boolean functions that determine if a pod is compatible with a given worker node. These requirements are defined using node labels within the pod’s configuration. For instance, including the disktype:ssd label in a Kubernetes pod’s YAML file dictates that the pod should only be hosted on nodes equipped with solid-state drives. Several policies are implemented at this stage in Kubernetes, including PodFitsHostPorts, which verifies the availability of a requested port on the node; PodFitsResources, ensuring the node possesses the necessary available resources such as CPU or memory; PodFitsHost, which mandates the pod to be deployed on a specified node; and CheckNodeCondition, which assesses the node’s health, ensuring factors like network availability and kubelet readiness.

Scoring step: This phase evaluates each potential node by assigning scores based on specific criteria and metrics. Nodes that are better suited for a pod receive higher scores. The node that accumulates the highest score is selected for pod placement. During this phase, nodes are assessed based on various aspects such as resource availability, current load, and overall node condition. Scoring can follow a range of predefined policies or can be tailored through custom policies defined by the user [25]. Commonly, nodes are scored on how much free CPU and memory they have available, utilizing the LeastRequestPriority policy as a typical method. Other strategies may focus on ensuring balanced resource utilization across nodes (BalanceResourceAllocation policy), spreading pods from the same service across different nodes (SelectorSpreadPriority policy), or distributing pods randomly (RandomDistribution policy). Furthermore, node preferences can be specified through node affinity or anti-affinity configurations, such as with the NodeAffinityPriority policy.

4. Methodology

The default scheduling algorithm in Kubernetes can result in significant scheduling time overhead, particularly under conditions where the cluster is large or numerous pods enter the scheduling queue simultaneously [26]. This issue primarily stems from the preference phase of the scheduling process, where multiple priority strategies are applied to the list of nodes that may host the pod. These strategies typically employ a serialized scoring mechanism to rank each node according to its suitability for the pod based on various criteria.

4.1. Deficiency Analysis of Default Scheduling Algorithm

When the cluster size is substantial, or there is a high volume of pods requiring scheduling, the serialized nature of this scoring process becomes a bottleneck. Each preferred strategy needs to evaluate all nodes against specific metrics or conditions, which can be computationally intensive and time-consuming. This method fails to quickly produce a reasonable allocation plan, resulting in increased latency in pod placement.

Furthermore, the complexity of node evaluation increases with the number of affinity rules, taints, and tolerations, as well as resource requests and limits defined for each pod. Each additional parameter can exponentially increase the time it takes to score and rank nodes, particularly in diverse and resource-intensive environments.

Optimizations such as parallel processing of node scoring, caching frequently accessed data, and reducing the complexity of affinity rules could mitigate these issues. Additionally, adopting more dynamic and adaptive scheduling policies that prioritize nodes based on real-time data and previous scheduling decisions might enhance performance and reduce time overheads in large-scale Kubernetes deployments.

4.1.1. Default Scoring Algorithm

During the Scoring phase, each node that passed the Filtering phase is then scored based on a set of scoring functions or rules [27]. These rules consider several factors:

NodeSelectorPriority: This scheduling strategy ranks nodes based on the degree of label matching between the nodes and the pods. Nodes that have a higher match score with the labels specified in the pod specifications receive a higher priority. This approach ensures that pods are scheduled on nodes that meet specific labeling criteria, enhancing the effectiveness of resource utilization and node selection based on predefined attributes.

NodeAffinityPriority: This strategy evaluates the node affinity rules specified in a pod’s definition. If a node meets the criteria outlined in the node affinity settings, it is assigned a higher priority. Node affinity allows pods to specify conditions on node characteristics that must be met for the pod to be placed on the node, such as requiring nodes in specific clusters or zones that have certain performance characteristics.

PodFitsResourcesPriority: This strategy calculates the amount of resources remaining on each node and prioritizes nodes based on the availability of these resources. Nodes with the most available resources are given the highest priority, ensuring that pods are scheduled on nodes that are capable of meeting their resource demands without overloading any single node.

Balanced Resource Allocation Priority: During load balancing, this strategy considers whether the resource utilization rate across all nodes is uniform. If so, it will distribute the load evenly. This strategy is particularly useful for pods that require long-running stable environments, as it aims to ensure that no single node is overburdened, thus maintaining performance and reliability across the cluster.

Different priority strategies correspond to different scoring functions, the scoring function records each priority strategy has its own different functionality and the weight value set, and can return the node score, and finally use the weight value and the weighted sum of the scores to calculate the final score of the node, the node score formula as shown in Equation (1):

FinalScoreNode = \sum_{i = 1}^{n} {weight}_{i} \times {priorityFunc}_{i}

(1)

In Equation (1), FinalScoreNode denotes the final score of the node,

{weight}_{i}

denotes the weight value of each function, and

{priorityFunc}_{i}

denotes the score of the specific preference function, which optimizes the scheduling of Pods and selects the most suitable node for a Pod by means of preference and pre-selection strategies. These strategies can be used to measure the suitability of nodes by various conditions such as resource utilization, node affinity, node selector, etc., so as to meet the needs of different application scenarios.

After calculating the final scores for each node, these scores are normalized into a range between 0 and 10 to maintain a standard scale for evaluation. Based on these normalized scores, all the nodes are then ranked in a priority queue. The node that emerges at the top of this queue, having the highest score, is selected as the outcome of the scheduling decision and proceeds to the subsequent phase of the scheduling process.

4.1.2. Horizontal Pod Autoscaling (HPA) Burst Scaling Scenarios

Definition and Background

The Horizontal Pod Autoscaler is a Kubernetes control-plane mechanism that automatically adjusts the number of replicas of a workload (e.g., Deployment/ReplicaSet) based on observed metrics. In each control loop, HPA evaluates one or more signals (e.g., CPU utilization, memory utilization, or custom/external metrics such as QPS, request latency, or queue length) and computes a target replica count. When the current replica count is below the target, HPA issues scale-up actions, which results in the creation of additional Pods that must be scheduled and bound to nodes.

What We Mean by “HPA Scenarios”

In this paper, HPA scenarios refer to short-time burst scale-out events where a workload quickly increases replicas and produces a large batch of Pods within a small time window. Concretely, these scenarios have the following properties:

Burstiness: Many Pods are created nearly back-to-back (often within seconds), producing a spike of pending Pods in the scheduler queue.
Homogeneity: The new Pods are typically replicas of the same template and hence share nearly identical scheduling-relevant properties (resource requests/limits, labels/selectors, and often the same affinity/toleration/topology constraints).
Strict latency requirement: Service quality depends on how quickly new replicas become Ready; therefore, scheduling latency becomes part of end-to-end scale-out latency.

Typical Application Scenarios

HPA burst scaling commonly appears in cloud-native deployments where load changes faster than manual capacity planning. Representative scenarios include:

Traffic spikes in online services: Sudden increases in requests (e.g., flash sales, ticketing events, time-limited promotions) cause CPU/QPS/latency metrics to cross the HPA threshold, triggering rapid replica expansion.
Event-driven backlogs: Message queues or stream processors scale out when queue length grows (custom metrics), producing many workers with identical Pod templates.
Multi-tenant microservices: A shared platform experiences correlated load increases across multiple services, creating concurrent bursts of similar replicas.
Failover and recovery: Node restarts or transient failures may temporarily reduce capacity; when capacity recovers, controllers may recreate many missing replicas quickly.

Why Scheduling Becomes a Bottleneck in HPA Bursts

Although HPA is responsible for computing the desired replica count, the actual scale-out speed is bounded by the time it takes for those new Pods to be scheduled and bound. During a burst, the scheduler must handle a large number of pending Pods while the cluster state is simultaneously changing due to recent bindings. This creates an unfavorable setting: high arrival rate of Pods and expensive per-Pod scheduling work.

Default Scheduler Pipeline

For each pending Pod, the default Kubernetes scheduler performs a standard pipeline:

Snapshot and feasibility: Obtain a scheduling snapshot and run PreFilter/Filter (and extenders if enabled) to produce a feasible node set F.
Scoring: Run enabled Score plugins over (typically all) feasible nodes to produce node scores, normalize them, and compute a total score for each node.
Selection and binding: Select a host (including tie-breaking) and then execute the binding cycle (Reserve/Permit/Bind and related hooks).

The computational hot spot in HPA bursts is frequently the Score stage, because it scales with the number of feasible nodes and must evaluate multiple plugins for each node.

Limitations of the Default Scheduler Under HPA Burst Workloads

In burst scale-out, many incoming Pods are replicas of the same template. However, the default scheduler processes each Pod largely independently, which leads to several inefficiencies:

Redundant scoring for near-identical Pods. When a workload creates many replicas, the scheduling-relevant inputs of consecutive Pods (requests, constraints, and often feasible node sets) are highly similar. Nevertheless, the default scheduler re-executes the full scoring pipeline for each Pod, repeatedly evaluating the same Score plugins over largely the same node set. This causes redundant computation that does not directly improve the decision quality for every replica.
Score-stage cost grows with cluster size and plugin complexity. Let $| F |$ denote the number of feasible nodes and let $P$ be the set of enabled Score plugins. The per-Pod scoring cost is roughly proportional to $| F | \cdot | P |$ , and some plugins (e.g., those involving affinity/topology reasoning or non-trivial resource calculations) are significantly more expensive than simple resource checks. In a burst, this cost is multiplied by the number of new replicas, amplifying scheduler CPU consumption and latency.
Queueing and throughput degradation under burst arrivals. HPA can create Pods faster than the scheduler can complete scoring and binding, which increases the pending queue length. As queue length grows, Pods spend more time waiting before entering the scheduling cycle, increasing the overall scale-out time (create → scheduled → ready). In practice, this queueing delay can dominate for large bursts.
Limited reuse across consecutive scheduling events. While the scheduler maintains internal caches for certain computations, the default pipeline does not provide a general mechanism to reuse node-score rankings across consecutive Pods that are scheduling-equivalent. As a result, even when consecutive Pods would yield nearly identical node rankings, the scheduler recomputes scores from scratch rather than amortizing the cost across a burst.

HPA burst scale-out produces a large number of highly similar Pods in a short interval, making the scheduler’s per-Pod scoring overhead a key contributor to end-to-end scaling latency. The default scheduler’s largely independent processing of each Pod leads to redundant scoring work, reduced throughput, and queueing delay, motivating scheduler-side optimizations that can amortize repeated computations while preserving correctness under dynamic cluster conditions.

4.2. Custom Scheduling Algorithm Design

The proposed custom scheduler algorithm is designed to optimize the scheduling process by reusing previously computed scores. This design addresses the efficiency issues in rapid scaling scenarios by avoiding redundant computations and reducing resource istage and time overhead. The core idea is to maintain a cache of node scores in a priority queue, allowing for swift retrieval of the highest-scored node for upcoming pod assignments.

4.2.1. Similarity Definition and Cache Applicability

A central requirement for score reuse is to precisely define when two Pods can safely share cached scoring results. In Kubernetes, feasibility and scoring can be affected by resource requests, selectors, (anti-)affinity rules, topology constraints, tolerations, and scheduler profile configuration. We therefore define a conservative equivalence signature, ScoreKey, constructed from scheduling-relevant Pod fields.

ScoreKey (Pod Equivalence Signature)

For a Pod p, we define:

\begin{matrix} ScoreKey (p) = & (schedulerName, priorityClass, runtimeClass, \\ cpuMilli, memBytes, gpuCount, \\ H (nodeSelector), H (affinity), H (tolerations), H (topologySpread)) . \end{matrix}

(2)

where cpuMilli, memBytes, and gpuCount are aggregated resource requests (containers plus pod overhead), and

H (\cdot)

denotes a stable deep-hash over the corresponding structured field. Two Pods are considered similar if and only if their ScoreKeys are identical:

p_{1} \sim p_{2} \Leftrightarrow ScoreKey (p_{1}) = ScoreKey (p_{2}) .

(3)

This strict equality is intentionally conservative: if any scheduling-relevant constraint differs, score reuse is disabled for that pair.

Feasibility-Context Fingerprint (feasibleHash)

Even for identical Pods, cached rankings may become invalid when the feasible node set changes due to cluster dynamics. Let F be the feasible node set returned by the standard Filter stage for Pod p. We compute an order-independent fingerprint:

feasibleHash (F) = H (sort ({n . name ∣ n \in F})) .

(4)

A cached ranking is applicable only when both the ScoreKey matches and the feasibility context matches:

Applicable (p, F, E) \Leftrightarrow (E . key = ScoreKey (p)) \land (E . feasibleHash = feasibleHash (F)) .

(5)

If the feasibility fingerprint differs, the cache entry is invalidated and the scheduler recomputes scores from scratch.

4.2.2. Caching Mechanism

We introduce a score reuse layer that accelerates the Score phase for bursts of similar pods. Instead of recomputing scores for every pod, the scheduler caches a sorted list of node-score results for an equivalence class of pods. For each scheduling cycle, if the cache contains a valid entry for the current pod’s equivalence key and the current feasible node set, the scheduler returns only the top-ranked node and advances the cached cursor by popping that element. This “rotating” behavior reduces repeated scoring while preventing repeated selection of a single node across many replicas.

Cache Entry Structure and Update Rules

For each ScoreKey k, the cache stores one entry

E_{k} = (feasibleHash, L, initialN, popped),

where L is a list of node-score records sorted in descending order by the total framework score (with a deterministic tie-break on node name), initialN is the list length at insertion time, and popped counts how many candidates have been consumed since the last recomputation.

Insertion (Seeding)

On a cache miss or after invalidation, the scheduler executes the standard Kubernetes scoring pipeline once over all feasible nodes F (including all enabled Score plugins and normalization), producing a list of scored nodes. The scheduler then sorts this list into L and seeds the cache entry with popped

= 0

and initialN

= | L |

under the current feasibility fingerprint.

Selection (Pop-and-Advance Top-1)

On a cache hit, the scheduler returns only the first element of L (Top-1) as the scheduling decision and removes it from L (pop). This “pop-and-advance” policy enables repeated similar Pods to reuse the remaining high-quality candidates without re-running full scoring, while avoiding repeatedly choosing the same node for every replica.

Invalidation and Bounded Refresh

A cache entry is discarded and recomputed when any of the following holds: (i) feasibility context changes (feasibleHash mismatch), (ii) the cached list becomes empty (

| L | = 0

), or (iii) the consumed fraction exceeds a threshold

τ

:

\frac{popped}{initialN} > τ,

(6)

where we use

τ = 0.5

by default. This consumption-based refresh bounds staleness under dynamic resource changes.

Capacity Control and Concurrency

The cache enforces a maximum number of entries (MaxEntries); when the capacity is reached, an existing entry is evicted (a simple eviction policy; replaceable by LRU without changing the algorithmic interface). To avoid redundant recomputation under concurrent scheduling workers, we use an in-flight guard so that at most one recomputation per ScoreKey proceeds at a time, while other workers wait and reuse the published result.

4.2.3. Cache Content

Our score reuse mechanism maintains a cache that stores ranked node candidates for each class of similar Pods. Intuitively, each cache entry can be viewed as a pre-computed ranking table for a specific Pod equivalence class, together with minimal metadata to ensure that reusing this ranking remains safe under cluster dynamics.

Cache Key

Each entry is indexed by a pod-equivalence signature

k = ScoreKey (p)

, which summarizes scheduling-relevant Pod attributes (resource requests and constraint hashes). Only Pods that produce the same ScoreKey share the same cache entry.

Cache Entry Structure

For a given key k, the cache stores an entry:

E_{k} = (feasibleHash, L, initialN, popped, ts),

(7)

where:

feasibleHash is a fingerprint of the current feasible node set produced by the Filter stage (computed by hashing the sorted feasible node names). It binds the cached ranking to the feasibility context under which it was computed, and prevents reusing rankings when feasibility changes.
L is the core cached content: an ordered list of scored node records sorted by descending total score. Each element $r_{i} \in L$ corresponds to one feasible node and stores:

$r_{i} = (n_{i}, s_{i}, s_{i}),$

(8)

where $n_{i}$ is the node identity (e.g., node name), $s_{i}$ is the final aggregated score used for ranking, and $s_{i}$ (optional) is a vector of per-plugin scores (useful for debugging and analysis but not required for selection). The list is sorted as:

$L = {sort}_{↓} ({r_{i}}),$

(9)

primarily by $s_{i}$ (descending). If multiple nodes have the same total score, we apply a deterministic tie-breaker on node name to ensure stable, repeatable behavior.
initialN records the size of L at insertion time, i.e., $initialN = | L |$ when the entry is seeded. This value is used to compute a consumption ratio for bounded refresh.
popped counts how many candidates have already been consumed from the front of the list since the last recomputation. It reflects how far the cache entry has progressed through its ranking.
ts (optional) is a timestamp for the last recomputation, used for diagnostics and performance measurement (not strictly required by the algorithm).

Interpretation: A Reusable “Ranking Table” with a Moving Pointer

The list L represents a reusable ranking of feasible nodes for ScoreKey k. Each successful scheduling decision consumes the current best candidate and advances the entry by removing the front element. Therefore,

E_{k}

behaves like a ranking table with a moving pointer: the scheduler returns the best remaining node without recomputing scores, while naturally rotating among high-quality nodes during a burst of similar Pods.

Seeding on Miss (What Gets Inserted)

On a cache miss or after invalidation, the scheduler executes the standard Kubernetes scoring pipeline once over the current feasible node set F, producing scored records

{r_{i}}

. These records are then sorted into L and inserted into the cache together with the current feasibleHash, setting

popped = 0

and

initialN = | L |

.

Safe Reuse and Bounded Staleness (Why Metadata Is Needed)

Reusing a cached ranking is allowed only when the cached feasibility context matches the current one, i.e., when feasibleHash is unchanged. In addition, to bound staleness under changing resource usage, we periodically force recomputation based on how much of the ranking has been consumed. Specifically, we refresh the entry when

\frac{popped}{initialN} > τ,

(10)

where

τ

is a configurable threshold (default

τ = 0.5

). This ensures that the cache does not persist indefinitely and remains responsive to cluster dynamics.

4.2.4. Scheduling Logic

Our scheduler preserves the standard Kubernetes scheduling architecture and modifies only the decision-making within the scheduling cycle. Each scheduling attempt consists of two phases: (i) a scheduling cycle, which computes a scheduling decision (i.e., a SuggestedHost), and (ii) a binding cycle, which commits that decision through the Kubernetes binding pipeline. In ScheduleOne, the scheduler dequeues one pending Pod, builds the corresponding scheduling framework, runs the scheduling cycle to determine the target node, and then executes the binding cycle asynchronously. The binding cycle follows the default Kubernetes sequence (Permit→PreBind→Bind→PostBind). If binding fails, the scheduler performs the standard cleanup (e.g., Unreserve/ForgetPod when applicable) and re-queues the Pod according to the default failure handling logic. Our optimization does not bypass any of these standard safety checks; instead, it accelerates the expensive part of the scheduling cycle by reusing scoring results when the workload exhibits repeated “similar” Pods.

(1): Feasible node computation (Filter stage remains unchanged).

For each Pod p, we first take a fresh scheduling snapshot and execute the standard filtering pipeline to compute the feasible node set F. This includes the normal Kubernetes predicate chain (e.g., PreFilter/Filter plugins) and any configured extenders. The result is a set (or list) of nodes that satisfy all hard constraints for p given the current cluster state. If

| F | = 0

, the Pod is declared unschedulable for the current snapshot. If

| F | = 1

, we immediately select that single node as the scheduling decision. Therefore, feasibility is always computed online and is never approximated by cached results.

(2): Similarity modeling with ScoreKey.

When multiple feasible nodes exist (

| F | > 1

), the default scheduler would execute the full Score pipeline over all nodes in F. Our method exploits the observation that bursty scale-out often generates Pods whose scheduling-relevant specifications are identical or highly repetitive (e.g., replicated Deployments/Jobs). To safely reuse scoring outcomes across such Pods, we construct a ScoreKey for each Pod. The ScoreKey is a compact signature of scheduling-relevant Pod features and is used as the cache key for score reuse. Concretely, the key includes:

The scheduler/profile identity (to avoid cross-profile interference),
Priority and runtime class information,
Aggregated resource requests (CPU, memory, GPU) computed from containers and overhead, and
Hashed representations of constraint-bearing fields (e.g., nodeSelector, affinity, tolerations, and topologySpreadConstraints).

Two Pods are considered “similar” if and only if they produce the same ScoreKey. This strict equality design prioritizes correctness: if any relevant constraint differs, the key changes and score reuse is disabled for that pair of Pods.

(3): Feasibility-context binding via feasibleHash.

Even for identical Pods, score reuse can become unsafe when the feasibility context changes. After several bindings, some nodes may no longer satisfy resource constraints, or cluster events (e.g., taints, node updates, or capacity changes) may alter which nodes are feasible. To capture this context efficiently, we compute a feasibility-set fingerprint feasibleHash from the current feasible node set F. We hash the sorted list of feasible node names to obtain an order-independent signature. This feasibleHash is stored together with the cached score ranking and serves as a staleness guard: cached rankings are reused only when the current feasibleHash matches the stored one.

(4): Cache-aware scoring with rotating Top-1 selection.

For each scheduling cycle with

| F | > 1

, we invoke a cache-aware Top-1 routine rather than scoring all nodes every time. The routine operates as follows:

Compute $k = ScoreKey (p)$ and $h = feasibleHash (F)$ .
Query the score cache for an entry associated with k that is valid under feasibility context h.
Cache hit: If a valid entry exists, it contains a ranked list L of nodes sorted by their total framework score (descending). We return only the first element of L (the current Top-1 candidate) and remove it from the list (“pop”). This yields a pop-and-advance behavior: subsequent Pods with the same ScoreKey rotate to the next-best candidates without re-running the full scoring pipeline.
Cache miss or invalid entry: If no valid cache entry exists (first time for this key, or entry expired/invalidated), we execute the standard Kubernetes scoring pipeline once over all nodes in F (i.e., the original scoring function including enabled Score plugins and normalization). The resulting list is sorted into a ranked list L, stored in the cache together with feasibility context h, and then we pop and return the Top-1 element as the scheduling decision.

This design reduces scheduling overhead after warm-up because repeated similar Pods avoid repeated full scoring, while still producing a deterministic top candidate under the same feasibility context.

(5): Invalidation and refresh (bounded staleness).

Because node scores can drift as resources are consumed, cache reuse is deliberately bounded by explicit invalidation and refresh rules. A cached entry for key k is discarded and recomputed when any of the following conditions holds:

Feasible-set mismatch: If the stored feasibleHash differs from the current h, the cached ranking is considered stale and is immediately invalidated. This ensures that we do not select a node that is no longer feasible or ignore newly feasible nodes.
Consumption-based refresh: Each cache entry tracks how many nodes have been popped relative to the initial list size. When the consumed fraction exceeds a configured threshold (default: refresh after more than $50 %$ of the cached nodes have been popped), we invalidate the entry so that the next scheduling attempt for the same ScoreKey triggers a full recomputation. This policy limits long-lived reuse and keeps rankings responsive to cluster dynamics.
Empty ranking: If the cached list is exhausted, the entry is removed and recomputed upon the next request.

Together, these rules implement a practical trade-off between performance (fewer full scoring passes) and timeliness (bounded reuse).

(6): Concurrency control.

Kubernetes scheduling can involve concurrent workers. To avoid redundant recomputation under concurrent cache misses, we employ an in-flight guard: at most one recomputation per ScoreKey proceeds at a time, while other goroutines wait for the result. This ensures that the cache remains consistent and that concurrent similar Pods benefit from a single scoring pass rather than duplicating expensive work.

(7): Output and integration with the default pipeline.

The cache-aware Top-1 routine returns a length-one priority list, from which the scheduler selects the SuggestedHost. The remainder of the scheduling cycle (e.g., Reserve/Permit) and the entire binding cycle remain unchanged. Therefore, our method is an integration-level optimization of the Score stage: it reuses and rotates score rankings but does not modify the semantics of filtering, binding, or plugin execution.

To be more specific, the pseudocode of our cached scheduler can be outlined as follows (Algorithm 1):

Algorithm 1: Rotating Score-Cache Reuse for Similar Pods

4.2.5. Correctness and Freshness Considerations and Safety Mechanisms

Score reuse can be unsafe if node feasibility or relative ranking changes across scheduling events. We therefore implement two safeguards.

(i): Feasible-set binding: Cache reuse is permitted only when the hash of the current feasible node set matches the hash stored with the cache entry; otherwise, the entry is dropped and recomputed. This ensures that nodes that become infeasible due to resource consumption or state changes are not selected from stale rankings.
(ii): Consumption-based refresh: Even if feasibility remains unchanged, score ordering may drift as resources change. To limit staleness, we recompute the cached ranking after a configurable fraction of the list has been consumed (default: refresh after more than 50% of the cached nodes have been popped). This provides a practical trade-off between reuse benefits and responsiveness to cluster dynamics.

4.3. Related Work: Result Reuse in Kubernetes Scheduling

Equivalence Cache (eCache). Kubernetes previously experimented with an equivalence-class cache (eCache) to speed up scheduling by reusing predicate (feasibility) evaluation results for Pods with the same scheduling requirements. However, the implementation introduced practical drawbacks such as lock contention and non-trivial invalidation/maintenance complexity, and it was ultimately removed with plans to redesign the mechanism [28]. Therefore, eCache did not evolve into a stable, widely adopted approach for reusing scoring results in fast autoscaling bursts.

KEP-5598 Opportunistic Batching. Recent Kubernetes releases introduce opportunistic batching to reduce redundant work when scheduling many compatible Pods [29,30,31]. The scheduler identifies Pods with an equivalent scheduling signature and can reuse filtering and scoring results across back-to-back scheduling cycles, with the official documentation noting a short cache lifetime (e.g., expiring after a fixed interval) and several current applicability restrictions (e.g., excluding certain constraints and invalidating the cache when placing more than one Pod per node) [30].

This work. Our approach shares the same high-level goal—avoiding redundant scoring during rapid scale-up—but differs in how reuse is performed and how staleness is controlled. Instead of a short-lived batch cache, we maintain a per-ScoreKey score cache that stores a ranked list of node candidates (scored under the standard framework) together with a feasibility-context fingerprint. On each cache hit, the scheduler returns the current Top-1 node and removes it from the cached ranking (pop-and-advance), enabling repeated similar Pods to reuse precomputed scores while naturally rotating placements across high-scoring nodes. To preserve correctness under cluster dynamics, the cache is invalidated and recomputed when the feasible node set changes, or when the consumed portion of the cached ranking exceeds a threshold (default

τ = 0.5

).

5. Experiment

To effectively evaluate the performance of the custom cached scheduler developed in this research, a comprehensive experimental design is outlined below. The aim of these experiments is to compare the newly proposed scheduler with Kubernetes default scheduler and schedulers that have been widely used in industry and academia in terms of efficiency, latency, throughput, and the performance in the real HPA scenario.

5.1. Experimental Environment

5.1.1. Cloud Platform and Cluster Provisioning

All experiments were conducted on Google Kubernetes Engine (GKE) using a managed, production-grade Kubernetes control plane. The cluster was created in the Tokyo region to reduce network latency and to align with the availability of compute quotas during the experiment period.

Cloud provider: Google Cloud Platform (GCP);
Managed Kubernetes: Google Kubernetes Engine (GKE);
Cluster name: sched-exp;
Location (zone): asia-northeast1-a;
Kubernetes version (nodes): v1.33.5-gke.2019000.

The cluster was accessed and operated through standard Kubernetes tooling (e.g., kubectl) and the Google Cloud SDK (gcloud). Experiments were executed by applying Kubernetes manifests and running automation scripts from a client machine (Cloud Shell) that had authenticated access to the target cluster.

5.1.2. Node Pool Configuration

A dedicated node pool was used as the primary execution substrate for the benchmarks. This pool was configured with a general-purpose VM type and a moderate boot disk size to keep the environment reproducible under quota constraints (the configuration is presented in Table 1).

Autoscaling Configuration

To support autoscaling-oriented scenarios, node pool autoscaling was enabled with an intended minimum of 5 nodes and an upper bound of 8 nodes (bounded by regional quota). Importantly, GKE cluster autoscaling is triggered by unschedulable pods (i.e., pending pods whose resource requests cannot be placed), rather than by request traffic directly. Therefore, scenarios that do not create pending pods may not trigger node scale-out even under high application load.

5.1.3. Schedulers Under Comparison

The evaluation compares four schedulers: the default Kubernetes scheduler, two representative alternative schedulers (YuniKorn and Koordinator), and the proposed custom scheduler (cache-based score reuse). All schedulers were deployed as in-cluster components and selected via the schedulerName field in pod specifications (Table 2).

Apache YuniKorn (Baseline Scheduler)

Apache YuniKorn is a standalone scheduler designed for Kubernetes that targets multi-tenant and batch-style workloads by introducing a queue-centric scheduling model. Conceptually, it separates scheduling into a scheduler core plus a Kubernetes-facing shim: the core maintains scheduling state and policies, while the shim translates Kubernetes objects (Pods/Namespaces) into YuniKorn scheduling entities and feeds scheduling decisions back to Kubernetes [32,33]. Compared with the default Kubernetes scheduler (which primarily optimizes per-pod placement via filtering/scoring), YuniKorn emphasizes:

(i): Hierarchical queues and fairness (e.g., enforcing capacity/fair-share between tenants or workloads);
(ii): Placement rules to map applications/users/namespaces into specific queues in a policy-driven way;
(iii): Workload-level semantics (e.g., treating a set of Pods as one “application” for queueing and scheduling) [32,33].

In our evaluation, YuniKorn serves as a baseline representing a production-grade alternative scheduler with explicit multi-tenant policy support. Deployment is typically done by installing the YuniKorn scheduler components (core + k8s shim) into the cluster, and benchmark workloads can target it by setting spec.schedulerName to the YuniKorn scheduler name (e.g., yunikorn) [33].

Koordinator (Baseline Scheduling & QoS System)

Koordinator is a Kubernetes-oriented QoS-based scheduling and resource orchestration system that goes beyond “where to place a pod” and aims to improve cluster-level efficiency and workload QoS under mixed workloads (latency-sensitive services plus best-effort/batch jobs). At a high level, Koordinator commonly includes a control-plane that manages resource policies and scheduling decisions (e.g., manager/scheduler-related components) and node-side agents that enforce runtime resource governance [34]. A distinguishing feature is that Koordinator explicitly incorporates load-aware/interference-aware signals into scheduling decisions (i.e., scheduling with awareness of node utilization/pressure rather than only static requests/limits), which is useful when evaluating “scheduling-duration-only” optimizations against a baseline that targets runtime QoS and stability [35]. Koordinator therefore acts as a complementary baseline: while our scheduler focuses on reducing scheduling latency via reuse/caching, Koordinator represents a broader class of systems that trade additional control-plane/node-side logic for better utilization balance and QoS isolation under contention.

In our experiments, we deploy Koordinator as an additional scheduler baseline (together with its required control components) and direct benchmark workloads to it via spec.schedulerName (e.g., koord-scheduler) when applicable.

5.1.4. Proposed Cache Scheduler Deployment

Our scheduler (cache-scheduler) is implemented by modifying the upstream kube-scheduler source code of Kubernetes v1.33.5, which matches the Kubernetes version used by our GKE cluster. Therefore, the required configuration and deployment artifacts are essentially the same as those of a standard “out-of-tree” custom scheduler deployment: we build a scheduler binary from the v1.33.5 codebase, package it as a container image, push it to Docker Hub, deploy it as an additional scheduler instance, and route selected Pods to it using spec.schedulerName. The only behavioral difference lies in our code-level changes (score reuse and rotation) and the source code is open access at GitHub [36]; the control-plane integration (RBAC, leader election, scheduler configuration) follows the conventional kube-scheduler deployment model.

Required Manifests and Concrete Deployment Steps

To run cache-scheduler alongside the default scheduler, we use four Kubernetes manifests and apply them in a fixed order. The artifacts are:

(i): A dedicated Namespace for isolating scheduler resources;
(ii): A ConfigMap containing a KubeSchedulerConfiguration that registers a scheduler Name;
(iii): ServiceAccount plus RBAC bindings (cluster-scoped reads and leader-election permissions);
(iv): A Deployment that launches the customized scheduler image with the mounted configuration.

Finally, workloads opt in by setting spec.schedulerName.

(1)

Namespace: create an isolated namespace to host all resources of the custom scheduler (e.g., scheduler-system). This keeps the deployment independent from kube-system and avoids name collisions with default components.

(2)

Scheduler configuration (ConfigMap): store a v1 scheduler component config (kubescheduler.config.k8s.io/v1). The key fields are:

profiles[0].schedulerName=cache-scheduler: Binds this scheduler instance to Pods with the same spec.schedulerName.
leaderElection.leaderElect=true: Enables leader election (safe even for single replica; prevents dual-active scheduling if scaled).
leaderElection.resourceNamespace=scheduler-system and resourceName=cache-scheduler: The lock (Lease) is created and maintained in the custom namespace.

(3)

RBAC (ServiceAccount + bindings): Run the scheduler under a dedicated Service Account (scheduler-system/cache-scheduler). In Kubernetes v1.33.5, a scheduler requires:

Core scheduler permissions (watch/list pods, nodes, PV/PVC, create bindings, update Pod status, etc.). In practice, this is commonly satisfied by binding the service account to the built-in ClusterRole system:kube-scheduler (when present in the cluster).
Informer read permissions observed necessary in our GKE setup:
(a)
Listing/watching StorageClass objects at cluster scope;
(b)
Listing/watching ConfigMap objects, in particular access to the control-plane kube-system/extension-apiserver-authentication configmap which is used by client-go authentication plumbing.
Leader election permissions in the scheduler namespace: get/create/update/patch on leases.coordination.k8s.io for the lock named cache-scheduler.

(4)

Scheduler Deployment: Deploy the customized scheduler as a single-replica Deployment in scheduler-system. The key fields are:

serviceAccountName:cache-scheduler to attach the RBAC identity.
image:mrboen123/cache-scheduler:<tag> pointing to our v1.33.5-based build. The <tag> used in the evaluation is v1.33.5-r2.
args: Start with --config=/etc/kubernetes/scheduler-config.yaml and a verbosity level (e.g., --v=3).
volumeMounts: Mount the config file from the ConfigMap at the path referenced by --config.

(5)

Workload opt-in: For any workload that should be scheduled by cache-scheduler, set:

spec:

schedulerName: cache-scheduler

This single field ensures the default scheduler ignores the Pod, and our custom scheduler becomes responsible for scheduling it.

(6)

Apply order and verification commands: The concrete deployment procedure is:

1.: Apply namespace, configuration, RBAC, and deployment manifests in order:
kubectl apply -f 01-namespace.yaml
kubectl apply -f 02-configmap.yaml
kubectl apply -f 03-rbac.yaml
kubectl apply -f 04-deployment.yaml
2.: Wait for the scheduler Pod to become Running:
kubectl -n scheduler-system get pods -l app = cache-scheduler -o wide
3.: Check scheduler logs for successful startup and leader acquisition (no RBAC forbidden errors):
kubectl -n scheduler-system logs deploy/cache-scheduler –tail = 200
4.: Launch an opt-in smoke-test Pod and confirm it is scheduled (a Scheduled event is generated and a node is assigned):
kubectl -n sched-test describe pod <pod-name>

Files used in our experiments. We maintain the following manifest files as part of the experimental artifact:
01-namespace.yaml (namespace scheduler-system);
02-configmap.yaml (KubeSchedulerConfiguration with
schedulerName=cache-scheduler);
03-rbac.yaml (service account and RBAC including leases permissions);
04-deployment.yaml (deployment of mrboen123/cache-scheduler:<tag> mounting the configuration).

In addition, benchmark workloads include a patch or template snippet that sets spec.schedulerName:cache-scheduler for the Pods under test.

5.2. Experiment Design

5.2.1. Scheduling Performance Benchmark Design

Objective

We designed a reproducible benchmark to compare the scheduling behavior of our custom scheduler against multiple baselines under bursty scale-out. The benchmark measures (i) scheduling latency, (ii) end-to-end readiness latency (best-effort), (iii) placement distribution and fairness, and (iv) coarse-grained resource consumption snapshots, using the same workload template and cluster environment across schedulers. (Implementation script: N = $P o d N u m b e r$ SCHED = $S c h e d u l e r N a m e$ RUNID = $R u n I d$ ./bench_plus_deploy.sh.).

Controlled Workload and Burst Generation

Each trial deploys a single Deployment whose replica count is scaled from 0 to N (default

N = 200

) to trigger a burst of pending Pods. The Pod template uses a minimal pause container (image registry.k8s.io/pause:3.9) with small resource requests (cpu: 10 m, memory: 16 Mi) to minimize runtime-side noise and emphasize scheduler-side overhead. The terminationGracePeriodSeconds is set to 0 to accelerate cleanup between trials.

Scheduler Selection and Run Isolation

To evaluate different schedulers with the same workload, the benchmark parameterizes the schedulerName of the Pod template: if a scheduler name is provided, it is injected into the Deployment; otherwise, the field is removed so Pods use the default scheduler. Each trial is assigned a unique run identifier RUNID, and Pods are labeled with bench-run = RUNID and bench-scheduler = {name} to prevent mixing samples across runs and to simplify post-processing. Additionally, an applicationId label is attached for YuniKorn which is harmless for other schedulers.

Trial Procedure (Cleanup → Burst → Wait Conditions → Export)

Each trial follows the same sequence:

Namespace and Deployment preparation: Create the namespace if needed; apply/update the Deployment manifest with replicas initially set to 0.
Forced refresh and cleanup: Patch the Pod template labels with a new RUNID to force the controller to create a new ReplicaSet; scale the Deployment to 0 and wait for old Pods to be deleted.
Burst scale-out: Scale replicas to N; poll until at least N Pods with bench-run=RUNID are observed to handle controller creation delays.
Synchronization points: Wait until all Pods satisfy condition PodScheduled. Then, wait for Ready condition as a best-effort signal whcih is bounded by a timeout.
Data export: Export all Pods of the run as JSON (pods_RUNID.json). If the Metrics Server is available, also snapshot node-level resource usage via kubectl top nodes (topnodes_RUNID.txt).

Latency Metrics

We compute per-Pod latency metrics from Kubernetes timestamps available in Pod objects, without instrumenting the scheduler:

Scheduling latency $T_{sched}$ :
PodScheduled.lastTransitionTime − metadata.creationTimestamp.
Ready latency $T_{ready}$ :
Ready.lastTransitionTime − metadata.creationTimestamp.
Post-scheduling latency $T_{post}$ :
Ready.lastTransitionTime − PodScheduled.lastTransitionTime.

For each metric we report summary statistics including mean and tail percentiles (p50/p90/p95/p99) and the maximum value. Pods missing PodScheduled or Ready are counted to avoid silent bias.

Placement Distribution and QoS-Oriented Indicators

To characterize distribution quality (i.e., whether placements concentrate on a subset of nodes), we compute:

Pods per node: Min/max/mean, standard deviation, and coefficient of variation (CV).
Fairness: Jain’s fairness index over (i) Pod counts per node, (ii) aggregated requested CPU per node, and (iii) aggregated requested memory per node. Higher Jain values indicate more even distribution.
Skew inspection: The top-5 nodes with the highest Pod counts are printed for quick diagnosis.

These indicators provide an interpretable proxy for “distribution QoS” under burst scheduling, complementary to latency metrics.

Repeatability and Comparison Across Schedulers

We repeat the above trial multiple times per scheduler (e.g., default, Koordinator, YuniKorn, and our scheduler), each time using a fresh RUNID to isolate samples. Final results are compared using the same exported artifacts and the same post-processing logic, ensuring consistent measurement and fair cross-scheduler comparison.

The Burst Scheduling Benchmark is presented in the Algorithm 2.

5.2.2. HPA Benchmark Design

Purpose

This experiment evaluates scheduler performance under an HPA-driven scale-out scenario by measuring how quickly the service returns to a target quality-of-service (QoS) after an induced overload. The benchmark is fully automated by HPA.sh and produces run-isolated artifacts for reproducibility.

Workload and Autoscaling Configuration

We deploy the standard CPU-bound web workload php-apache (registry.k8s.io/hpa-example) as a Kubernetes Deployment, exposed via a ClusterIP Service. The Horizontal Pod Autoscaler (HPA) is configured using autoscaling/v2 with: (i) minReplicas and maxReplicas (default 3 and 20), and (ii) a CPU utilization target averageUtilization (default 50%). To compare schedulers, the Pod template explicitly sets schedulerName to the scheduler under test via SCHEDULER_NAME.

Algorithm 2: Burst Scheduling Benchmark

Run Isolation and Artifacts

Each run is identified by a sanitized RUNID and stored under:

OUTDIR / NS / RUNID / TS,

where TS is a UTC timestamp. The script exports applied manifests, Fortio logs (fortio.log), per-stage CSVs (find.csv, hold.csv), HPA and Pod timelines (hpa_timeline.csv, pods_timeline.csv), and post-run snapshots (state_after.txt, events.txt, optional top_nodes.txt/top_pods.txt).

Load Generation and SLA Definition

Client load is generated by Fortio (fortio/fortio) executed as in-cluster Kubernetes Jobs to reduce external network noise. Each Fortio job reports request count, achieved QPS, p99 latency, and an error percentage derived from non-200 responses. A window is considered SLA OK iff:

p 99 \leq SLA_P 99 \land e r r % \leq SLA_ERR_PCT .

In our updated configuration we tighten the latency SLA to SLA_P99 = 1.0 s (with SLA_ERR_PCT defaulting to 1% unless overridden).

Two-Stage Protocol: FIND Then HOLD

The benchmark follows a two-stage protocol:

(1): FIND (threshold discovery). Concurrency is ramped from FIND_START to FIND_MAX in steps of FIND_STEP, and each level is exercised for FIND_DURATION. The first concurrency $C^{★}$ that violates the SLA is selected as the hold concurrency. If no violation is observed up to FIND_MAX, the script conservatively sets $C^{★} = FIND_MAX$ .
(2): HOLD (time-to-recover measurement). The script applies constant stress at concurrency $C^{★}$ for HOLD_TOTAL using fixed windows of length HOLD_WINDOW. In the updated run configuration, HOLD_TOTAL is extended to 15 minutes to increase the probability of observing both degradation and recovery under autoscaling dynamics. Recovery is declared after RECOVERY_STREAK consecutive SLA-OK windows following a first observed breach; in our updated configuration we set RECOVERY\_STREAK=1 to measure the earliest return to the SLA target once autoscaling takes effect.

TTR Definition and Interpretation Modes

Let

t_{0}

be the epoch timestamp at the start of the HOLD stage. The script computes:

TTR = t_{recover} - t_{0},

where

t_{recover}

is the end time of the first SLA-OK window that satisfies the consecutive-OK requirement after an SLA breach is observed. To avoid ambiguity in cases where no breach occurs, HPA.sh reports ttr_mode in summary.txt:

ttr_mode = recovered: an SLA breach occurred and the service recovered; ttr_sec is the measured recovery time.
ttr_mode = no_degradation: no SLA breach occurred during HOLD; the script reports ttr_sec = 0 (TTR not applicable because the service never degraded).
ttr_mode = no_recovery_within_hold: an SLA breach occurred but no recovery was observed within the HOLD horizon; ttr_sec = −1 denotes right-censoring.

Timeline Collection (HPA and Pod Readiness)

Two background pollers sample system dynamics every POLL_INTERVAL (default 5 s): (i) hpa_timeline.csv logs currentReplicas and desiredReplicas, and (ii) pods\_timeline.csv logs total Pods and Ready Pods for the application. These time series help attribute QoS recovery to (a) HPA decisions, (b) Pod creation and readiness progression, and (c) scheduling throughput.

Scheduling/Ready Latency for HPA-Created Pods

To connect service-level recovery to scheduler behavior, the script extracts Pods created after HOLD start (

c r e a t i o n E p o c h \geq t_{0}

) and computes: (i) scheduling latency

T_{sched}

(creation → PodScheduled), (ii) ready latency

T_{ready}

(creation → Ready), and (iii) post-scheduling startup

T_{startup}

(PodScheduled → Ready). Per-Pod latencies are exported in pod_latencies.tsv, and summary statistics (min/p50/p90/p99/max) are written to pod_latency_summary.txt. :contentReference[oaicite:9]index = 9.

The HPA Benchmark with FIND + HOLD and TTR Modes is presented in the Algorithm 3.

Algorithm 3: HPA Benchmark with FIND+HOLD and TTR Modes (as implemented in HPA.sh)

5.3. Experiment Result

5.3.1. Scheduling Performance Benchmark

We report results of the burst scheduling benchmark where a single Deployment is scaled from 0 to 200 replicas (

N = 200

) on a 5-worker cluster to evaluate the scheduling performance. We evaluated four schedulers: the default kube-scheduler, Koordinator, YuniKorn, and our Cache-scheduler. Each configuration is repeated three times. In all runs, all 200 Pods are successfully scheduled and become Ready (missing_sched = 0, missing_ready = 0).

Metrics

Let

T_{sched}

denote scheduling latency (creation → PodScheduled), and

T_{ready}

denote end-to-end readiness latency (creation → Ready). We also report placement quality using Jain’s fairness index (higher is better) and the coefficient of variation (CV) of Pods-per-node (lower is better).

Overall Comparison

Table 3 summarizes the average across three runs. Cache-scheduler achieves the lowest mean scheduling latency (801.7 ms), reducing

\bar{T_{sched}}

by 31.8% compared to the default scheduler (1175.0 ms), and improves the scheduling tail by reducing

p 99 (T_{sched})

from 2673.3 ms to 2000.0 ms (25.2%). Koordinator achieves a similar scheduling-latency reduction (816.7 ms mean), while YuniKorn exhibits substantially higher scheduling latency (1955.0 ms mean) and a much larger tail.

For readiness, Cache-scheduler slightly improves

\bar{T_{ready}}

(4818.3 ms vs. 4958.3 ms for default), while

p 99 (T_{ready})

remains essentially unchanged (8670.0 ms vs. 8673.3 ms). This indicates that the scheduling-stage speedup does not necessarily translate into proportional improvements in Ready tail latency, which also includes runtime-side effects (image pull, container start, networking, etc.).

Placement Quality (Distribution QoS)

Cache-scheduler preserves a near-uniform placement distribution comparable to the default scheduler (Jain

\approx 0.992

, CV

\approx 0.091

). Koordinator shows a more skewed placement (Jain 0.9806, CV 0.141). YuniKorn produces the most imbalanced placement (Jain 0.8793, CV 0.370), indicating a strong concentration of Pods on a subset of nodes in this burst scenario.

Latency comparison under burst deployment (

N = 200

) is presented in Figure 5.

Placement distribution quality is presented in the Figure 6.

5.3.2. HPA Scenario Benchmark

We compare four schedulers (default kube-scheduler, Koordinator, YuniKorn, and our Cache-scheduler) under an HPA-driven scale-out workload (php-apache) using the updated benchmark configuration:

SLA_P 99 = 1.0

s,

HOLD_TOTAL = 15

m, and

RECOVERY_STREAK = 1

.

The benchmark first runs a FIND stage to determine the hold concurrency

C^{★}

(the first concurrency that violates the SLA), then applies constant stress at

C^{★}

during the HOLD stage and reports time-to-recover (TTR). In this dataset, all four schedulers observe the first SLA breach already at

C^{★} = 10

, enabling an apples-to-apples comparison under the same stress level.

Time-to-Recover (TTR)

Table 4 shows that Cache-scheduler achieves the fastest recovery (115 s), followed by YuniKorn (152 s) and Koordinator (191 s), while the default scheduler is slowest (346 s). Relative to the default scheduler, Cache-scheduler reduces TTR by 66.8% (3.01× speedup), YuniKorn by 56.1% (2.28×), and Koordinator by 44.8% (1.81×). Figure 7 visualizes this comparison.

Autoscaling Dynamics vs. QoS Recovery

All runs trigger scale-out quickly: the desired replica count starts increasing within 17–28 s after the run begins, and reaches the maximum desired replicas (20) at around 96–107 s. Ready replicas also reach 20 at around 99–111 s. Notably, the gap between “all replicas ready” and “QoS recovered” differs substantially: Cache-scheduler recovers almost immediately after the system reaches 20 ready replicas (only ∼10 s later), whereas the default scheduler requires an additional ∼247 s. This indicates that, beyond scale-out completion, scheduler-side throughput and placement behavior can still influence how quickly the service returns to the target p99 latency under sustained load.

QoS Degradation During HOLD

In addition to recovery time, we report the maximum observed p99 during HOLD as a sanity indicator (Figure 8). Koordinator shows the highest peak p99 (2.61 s), suggesting a stronger transient degradation before recovery. Cache-scheduler maintains a peak p99 comparable to the default scheduler (1.87 s vs. 1.96 s) while recovering substantially faster.

6. Conclusions

6.1. Achievements

This paper presents a Kubernetes scheduler optimization that targets HPA bursty scale-out workloads by reducing redundant scoring computation across repeated, scheduling-equivalent Pods. The main achievements are summarized as follows:

A keyed score-reuse mechanism integrated into kube-scheduler. We implemented a scheduler-side score reuse layer that identifies scheduling-equivalent Pods via a ScoreKey derived from scheduling-relevant PodSpec fields (resource requests and hashed constraints). The mechanism reuses standard framework scoring outputs without modifying individual scheduling plugins, preserving compatibility with the Kubernetes scheduling framework.
Rotating Top-1 selection to avoid repeated placement on a single node. Instead of repeatedly selecting the same cached best node for all replicas, our cache stores a ranked node list and applies a pop-and-advance policy: each successful scheduling decision returns the current Top-1 candidate and removes it from the cached list. This design naturally rotates placements among high-scoring nodes in a burst, improving diversity without introducing additional optimization passes.
Bounded staleness via feasibility-context validation and refresh. We introduced two explicit safety controls to maintain correctness under dynamic cluster conditions. First, cache applicability is guarded by a feasibility-context fingerprint (feasibleHash) computed from the current feasible node set; mismatches trigger immediate invalidation and recomputation. Second, we bound reuse over time by a consumption-based refresh rule (default threshold $τ = 0.5$ ), which forces periodic full recomputation after consuming a significant portion of cached candidates.
Evaluation on a real GKE cluster, including HPA burst scenarios. We evaluated the proposed scheduler on a production-grade Google Kubernetes Engine (GKE) cluster with one control plane and five worker nodes. In addition to rapid deployment bursts (e.g., creating 200 Pods), we also conducted experiments under HPA-driven scale-out scenarios to validate effectiveness in realistic autoscaling conditions. We compared against the default scheduler as well as representative alternative schedulers (Koordinator and YuniKorn). The results demonstrate that reusing ranked scoring results can reduce average scheduling latency after warm-up while maintaining placement fairness comparable to the default scheduler.

6.2. Limitations

While the proposed design improves scheduling throughput in bursty workloads, it has several limitations:

Conservative similarity definition may reduce cache hit rate. ScoreKey uses strict equality over hashed constraint fields and aggregated resource requests. Semantically equivalent but syntactically different specifications (e.g., reordered constraints) may not match, lowering reuse opportunities. This is an intentional correctness-first choice, but it may underutilize reuse in practice.
Feasible-set fingerprint is a coarse proxy for cluster dynamics. Our feasibleHash guards reuse based on changes in the feasible node set. However, scores can drift even when feasibility remains unchanged (e.g., resource headroom decreases but still passes filters). We mitigate this using consumption-based refresh, yet the fingerprint does not directly capture finer-grained score drift.
Top-1 determinism and interaction with tie-breaking. The cache-aware routine returns a single Top-1 candidate rather than a full priority list. This can reduce the randomness introduced by reservoir sampling among equally-scored nodes in the default scheduler. Although pop-and-advance provides diversity across consecutive replicas, the selection may still be more deterministic than the default behavior.
Workload dependence and warm-up cost. The benefits depend on the presence of repeated, scheduling-equivalent Pods (typical in replica bursts). For heterogeneous workloads with low repetition, cache hits will be rare and performance will approach the default scheduler. Additionally, the first Pod in each ScoreKey class still incurs a full scoring pass (warm-up).
Limited experimental scope. Our evaluation is conducted on a specific cluster scale and configuration (GKE, 5 workers) and a limited set of workloads and scheduler configurations. Results may vary with larger clusters, different plugin sets, different node heterogeneity, or different autoscaling policies.

6.3. Future Work

There are several promising directions to extend this work:

Richer similarity modeling and canonicalization. We plan to improve ScoreKey robustness by canonicalizing constraint specifications (e.g., normalizing ordering) and by selectively including additional scheduling-relevant fields (e.g., certain annotations or resource classes). This could increase cache hit rates while maintaining safety.
Stronger freshness controls beyond feasible-set hashing. Future versions could incorporate lightweight resource-state summaries into the applicability check (e.g., hashing coarse resource headroom buckets) or adopt plugin-aware staleness signals. This would allow refresh decisions to reflect score drift more directly rather than relying mainly on set changes and consumption thresholds.
Adaptive refresh and diversity policies. Instead of a fixed $τ$ , the scheduler could adapt the refresh threshold based on observed cluster volatility, scheduling latency targets, or workload characteristics. Similarly, diversity policies could be extended (e.g., mixing Top-k sampling from the cached ranking) to better emulate default tie-breaking while retaining reuse benefits.
Scaling evaluation and broader baselines. We will extend experiments to larger clusters and more diverse workloads, and evaluate interactions with additional scheduling features such as preemption, extenders, and heterogeneous node pools. We also plan to evaluate end-to-end autoscaling latency (HPA decision → Pod ready) under controlled load traces.
Engineering hardening and observability. Additional engineering work includes improved cache eviction policies (e.g., LRU), richer metrics for cache hit rate and recomputation causes, and tracing hooks to help operators understand when score reuse is effective or when it is being invalidated.

Author Contributions

Conceptualization, B.Z. and S.K.M.; Methodology, B.Z. and S.K.M.; Software, B.Z. and S.K.M.; Validation, B.Z. and S.K.M.; Formal analysis, B.Z., S.K.M., Y.C. and H.M.D.K.; Investigation, B.Z., S.K.M. and Y.C.; Resources, B.Z. and S.K.M.; Data curation, B.Z. and S.K.M.; Writing—original draft, B.Z., S.K.M., Y.C. and H.M.D.K.; Writing—review & editing, B.Z., S.K.M., Y.C. and H.M.D.K.; Visualization, B.Z. and S.K.M.; Supervision, S.K.M. and H.M.D.K.; Project administration, S.K.M. and H.M.D.K.; Funding acquisition, S.K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Development Fund of Macao, Macao SAR, China under grant 0033/2022/ITP and in part by the Faculty Research Grant Projects of Macau University of Science and Technology, Macao SAR, China under grant FRG-22-020-FI.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Authors gratefully acknowledge funding sources. The authors also would like to thank the anonymous reviewers for their quality reviews and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VM	Virtual Machine
CaaS	Container as a Service
ECS	Elastic Container Service
OS	Operating System
LXC	Linux Containers
OCI	Open Container Initiative
API	Application Programming Interface
CRI	Container Runtime Interface
GKE	Google Kubernetes Engine
AKS	Azure Kubernetes Service
EKS	Operating System
CLI	Command-Line Interface
QoS	Quality of Service
HA	High Availability
K8s	Kubernetes
HPA	`H`orizontal `P`od `A`utoscaler

References

Kun, H.; Hongjun, C. The Applied Research on the Virtualization Technology in Cloud Computing. In Proceedings of the 1st International Workshop on Cloud Computing and Information Security, Shanghai, China, 9–11 November 2013; Atlantis Press: Dordrecht, The Netherlands, 2013; pp. 526–529. [Google Scholar]
Xiao, Z.; Song, W.; Chen, Q. Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 2012, 24, 1107–1117. [Google Scholar] [CrossRef]
Bentaleb, O.; Belloum, A.S.; Sebaa, A.; El-Maouhab, A. Containerization technologies: Taxonomies, applications and challenges. J. Supercomput. 2022, 78, 1144–1181. [Google Scholar] [CrossRef]
Merkel, D. Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014, 239, 2. [Google Scholar]
Al Jawarneh, I.M.; Bellavista, P.; Bosi, F.; Foschini, L.; Martuscelli, G.; Montanari, R.; Palopoli, A. Container orchestration engines: A thorough functional and performance comparison. In Proceedings of the ICC 2019-2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Malviya, A.; Dwivedi, R.K. A comparative analysis of container orchestration tools in cloud computing. In Proceedings of the 2022 9th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 23–25 March 2022; IEEE: New York, NY, USA, 2022; pp. 698–703. [Google Scholar]
Rodriguez, M.A.; Buyya, R. Container-based cluster orchestration systems: A taxonomy and future directions. Softw. Pract. Exp. 2019, 49, 698–719. [Google Scholar] [CrossRef]
Rashid, A.; Chaturvedi, A. Virtualization and its role in cloud computing environment. Int. J. Comput. Sci. Eng. 2019, 7, 1131–1136. [Google Scholar] [CrossRef]
Pahl, C.; Brogi, A.; Soldani, J.; Jamshidi, P. Cloud container technologies: A state-of-the-art review. IEEE Trans. Cloud Comput. 2017, 7, 677–692. [Google Scholar] [CrossRef]
Ambrosino, G.; Fioccola, G.B.; Canonico, R.; Ventre, G. Container mapping and its impact on performance in containerized cloud environments. In Proceedings of the 2020 IEEE International Conference on Service Oriented Systems Engineering (SOSE), Oxford, UK, 13–16 April 2020; IEEE: New York, NY, USA, 2020; pp. 57–64. [Google Scholar]
Morabito, R. A performance evaluation of container technologies on internet of things devices. In Proceedings of the 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), San Francisco, CA, USA, 10–14 April 2016; IEEE: New York, NY, USA, 2016; pp. 999–1000. [Google Scholar]
Boettiger, C. An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 2015, 49, 71–79. [Google Scholar] [CrossRef]
Stanojevic, P.; Usorac, S.; Stanojev, N. Container manager for multiple container runtimes. In Proceedings of the 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 27 September–1 October 2021; IEEE: New York, NY, USA, 2021; pp. 991–994. [Google Scholar]
Casalicchio, E.; Iannucci, S. The state-of-the-art in container technologies: Application, orchestration and security. Concurr. Comput. Pract. Exp. 2020, 32, e5668. [Google Scholar] [CrossRef]
Carrión, C. Kubernetes scheduling: Taxonomy, ongoing issues and challenges. ACM Comput. Surv. 2022, 55, 138. [Google Scholar] [CrossRef]
Burns, B.; Grant, B.; Oppenheimer, D.; Brewer, E.; Wilkes, J. Borg, omega, and kubernetes. Commun. ACM 2016, 59, 50–57. [Google Scholar] [CrossRef]
Verma, A.; Pedrosa, L.; Korupolu, M.; Oppenheimer, D.; Tune, E.; Wilkes, J. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems, Bordeaux, France, 21–24 April 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1–17. [Google Scholar]
Chhajed, S. Learning ELK Stack; Packt Publishing Ltd.: Birmingham, UK, 2015. [Google Scholar]
Carvalho, M.; Macedo, D.F. QoE-aware container scheduler for co-located cloud environments. In Proceedings of the 2021 IFIP/IEEE International Symposium on Integrated Network Management (IM), Virtual, 17–21 May 2021; IEEE: New York, NY, USA, 2021; pp. 286–294. [Google Scholar]
Nguyen, T.T.; Yeom, Y.J.; Kim, T.; Park, D.H.; Kim, S. Horizontal pod autoscaling in kubernetes for elastic container orchestration. Sensors 2020, 20, 4621. [Google Scholar] [CrossRef] [PubMed]
Santos, J.; Wauters, T.; Volckaert, B.; De Turck, F. Towards network-aware resource provisioning in kubernetes for fog computing applications. In Proceedings of the 2019 IEEE Conference on Network Softwarization (NetSoft), Paris, France, 24–28 June 2019; IEEE: New York, NY, USA, 2019; pp. 351–359. [Google Scholar]
Wojciechowski, Ł.; Opasiak, K.; Latusek, J.; Wereski, M.; Morales, V.; Kim, T.; Hong, M. Netmarks: Network metrics-aware kubernetes scheduler powered by service mesh. In Proceedings of the IEEE INFOCOM 2021-IEEE Conference on Computer Communications, Virtual, 10–13 May 2021; IEEE: New York, NY, USA, 2021; pp. 1–9. [Google Scholar]
Qi, S.; Kulkarni, S.G.; Ramakrishnan, K. Assessing container network interface plugins: Functionality, performance, and scalability. IEEE Trans. Netw. Serv. Manag. 2020, 18, 656–671. [Google Scholar] [CrossRef]
Menouer, T. KCSS: Kubernetes container scheduling strategy. J. Supercomput. 2021, 77, 4267–4293. [Google Scholar] [CrossRef]
Pérez de Prado, R.; García-Galán, S.; Muñoz-Expósito, J.E.; Marchewka, A.; Ruiz-Reyes, N. Smart containers schedulers for microservices provision in cloud-fog-IoT networks. Challenges and opportunities. Sensors 2020, 20, 1714. [Google Scholar] [CrossRef] [PubMed]
Rejiba, Z.; Chamanara, J. Custom scheduling in kubernetes: A survey on common problems and solution approaches. ACM Comput. Surv. 2022, 55, 151. [Google Scholar] [CrossRef]
Senjab, K.; Abbas, S.; Ahmed, N.; Khan, A.u.R. A survey of Kubernetes scheduling algorithms. J. Cloud Comput. 2023, 12, 87. [Google Scholar] [CrossRef]
Kubernetes SIG Scheduling. Remove Equivalence Cache (eCache) from the Scheduler Code Base. GitHub Issue #71013, Kubernetes/Kubernetes. 2018. Available online: https://github.com/kubernetes/kubernetes/issues/71013 (accessed on 26 January 2026).
Kubernetes Enhancements. KEP-5598: Opportunistic Batching. Kubernetes Enhancement Proposal (KEP), SIG Scheduling. 2025. Available online: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/5598-opportunistic-batching/README.md (accessed on 26 January 2026).
The Kubernetes Authors. Scheduler Performance Tuning: Enabling Opportunistic Batching. Kubernetes Documentation. 2026. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduler-perf-tuning/ (accessed on 26 January 2026).
The Kubernetes Authors. Kubernetes v1.35: Timbernetes (The World Tree Release). Kubernetes Blog. 2025. Available online: https://kubernetes.io/blog/2025/12/17/kubernetes-v1-35-release/ (accessed on 26 January 2026).
Apache YuniKorn Project. YuniKorn Kubernetes Shim: Design/Kubernetes Shim Design. 2026. Available online: https://yunikorn.apache.org/docs/next/archived_design/k8shim (accessed on 29 January 2026).
Apache YuniKorn Project. YuniKorn on Kubernetes: Scheduler Shim Overview (Documentation Page). 2026. Available online: https://yunikorn.apache.org/docs/ (accessed on 29 January 2026).
Alibaba Cloud. ACK Koordinator (FKA ack-slo-Manager): Product Overview and Architecture. 2025. Available online: https://www.alibabacloud.com/help/en/ack/product-overview/ack-koordinator-fka-ack-slo-manager (accessed on 29 January 2026).
Koordinator Project. Load-Aware Scheduling (Koordinator Documentation). 2026. Available online: https://koordinator.sh/docs/user-manuals/load-aware-scheduling (accessed on 29 January 2026).
AdriftVin. k8s-Cache-Scheduler: A Custom Scheduler with Score Cache for Kubernetes (Based on kube-Scheduler v1.33.5). GitHub Repository. Release v1.33.5-r2. 2026. Available online: https://github.com/AdriftVin/k8s-cache-scheduler (accessed on 29 January 2026).

Figure 1. Container lifecycle and container orchestration [14,15].

Figure 2. Kubernetes architecture (Adapted from https://kubernetes.io/docs/concepts/architecture/, accessed on 17 December 2025).

Figure 3. Kubernetes Pod creation-state diagram. The Solid Arrow represents a request message or a call initiated by one component to another. The Dashed Arrow represents a return message or a response.

Figure 4. Kubernetes default Scheduler [15,16].

Figure 5. Latency comparison under burst deployment (

N = 200

).

Figure 5. Latency comparison under burst deployment (

N = 200

).

Figure 6. Placement distribution quality (higher Jain/lower CV indicates more even spread).

Figure 7. Time-to-recover (TTR) under HPA-driven scale-out (lower is better).

Figure 8. HOLD-stage p99 latency over time at

C^{★} = 10

(the first-breach concurrency for all schedulers).

Figure 8. HOLD-stage p99 latency over time at

C^{★} = 10

(the first-breach concurrency for all schedulers).

Table 1. GKE node pool configuration used in the experiments.

Item	Configuration
Node pool name	`exp-pool`
Machine type	`e2-standard-2`
Boot disk size	30 GB
Node OS/runtime	GKE default node image; container runtime via `containerd`
Nodes during main runs	5 worker nodes
Planned upper bound	up to 8 nodes (subject to quota)

Table 2. Schedulers evaluated and their scheduler names.

Scheduler	Namespace	Scheduler Name
Default Kubernetes scheduler	`kube-system`	`default-scheduler`
YuniKorn scheduler	`yunikorn`	`yunikorn`
Koordinator scheduler	`koordinator-system`	`koord-scheduler`
Proposed cache scheduler	`scheduler-system`	`cache-scheduler`

Table 3. Burst deployment results (N = 200). Values are averaged over three runs. Latencies are in milliseconds. ↑ the higher the better, ↓ the lower the better.

Scheduler	$\bar{T_{sched}}$	$p 99 (T_{sched})$	$\bar{T_{ready}}$	$p 99 (T_{ready})$	Jain ↑	CV ↓
Default	1175.0	2673.3	4958.3	8673.3	0.9919	0.090
Koordinator	816.7	2000.0	4856.7	11,670.0	0.9806	0.141
YuniKorn	1955.0	4336.7	11,540.0	26,003.3	0.8793	0.370
Cache-scheduler	801.7	2000.0	4818.3	8670.0	0.9917	0.091

Table 4. HPA benchmark results (

SLA_P 99 = 1.0

s,

RECOVERY_STREAK = 1

,

C^{★} = 10

).

Table 4. HPA benchmark results (

SLA_P 99 = 1.0

s,

RECOVERY_STREAK = 1

,

C^{★} = 10

).

Scheduler	TTR (s) ↓	Speedup vs. Default	Max p99 (s) ↓	Mean p99 (s) ↓	OK-Rate ↑	Ready-20 Time (s) ↓
Default	346	1.00×	1.962	1.273	0.40	99
Koordinator	191	1.81×	2.613	1.291	0.43	105
YuniKorn	152	2.28×	1.823	1.166	0.33	111
Cache-scheduler	115	3.01×	1.872	1.178	0.43	105

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, B.; Mondal, S.K.; Cheng, Y.; Kabir, H.M.D. On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes. Appl. Sci. 2026, 16, 2481. https://doi.org/10.3390/app16052481

AMA Style

Zhou B, Mondal SK, Cheng Y, Kabir HMD. On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes. Applied Sciences. 2026; 16(5):2481. https://doi.org/10.3390/app16052481

Chicago/Turabian Style

Zhou, Bowen, Subrota Kumar Mondal, Yuning Cheng, and H. M. Dipu Kabir. 2026. "On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes" Applied Sciences 16, no. 5: 2481. https://doi.org/10.3390/app16052481

APA Style

Zhou, B., Mondal, S. K., Cheng, Y., & Kabir, H. M. D. (2026). On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes. Applied Sciences, 16(5), 2481. https://doi.org/10.3390/app16052481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Optimized Scheduling Scheme for Rapid Pod Autoscaling in Kubernetes

Abstract

1. Introduction

1.1. Challenges

1.2. Positioning and Contributions

1.3. Difference to Prior Reuse/Batching Proposals

1.4. Paper Organization

2. Background

2.1. Container Orchestration

2.2. Kubernetes Architecture

3. Scheduling in Kubernetes

3.1. User Specifications

3.2. Internal Workflow

Default Scheduler

4. Methodology

4.1. Deficiency Analysis of Default Scheduling Algorithm

4.1.1. Default Scoring Algorithm

4.1.2. Horizontal Pod Autoscaling (HPA) Burst Scaling Scenarios

Definition and Background

What We Mean by “HPA Scenarios”

Typical Application Scenarios

Why Scheduling Becomes a Bottleneck in HPA Bursts

Default Scheduler Pipeline

Limitations of the Default Scheduler Under HPA Burst Workloads

4.2. Custom Scheduling Algorithm Design

4.2.1. Similarity Definition and Cache Applicability

ScoreKey (Pod Equivalence Signature)

Feasibility-Context Fingerprint (feasibleHash)

4.2.2. Caching Mechanism

Cache Entry Structure and Update Rules

Insertion (Seeding)

Selection (Pop-and-Advance Top-1)

Invalidation and Bounded Refresh

Capacity Control and Concurrency

4.2.3. Cache Content

Cache Key

Cache Entry Structure

Interpretation: A Reusable “Ranking Table” with a Moving Pointer

Seeding on Miss (What Gets Inserted)

Safe Reuse and Bounded Staleness (Why Metadata Is Needed)

4.2.4. Scheduling Logic

4.2.5. Correctness and Freshness Considerations and Safety Mechanisms

4.3. Related Work: Result Reuse in Kubernetes Scheduling

5. Experiment

5.1. Experimental Environment

5.1.1. Cloud Platform and Cluster Provisioning

5.1.2. Node Pool Configuration

Autoscaling Configuration

5.1.3. Schedulers Under Comparison

Apache YuniKorn (Baseline Scheduler)

Koordinator (Baseline Scheduling & QoS System)

5.1.4. Proposed Cache Scheduler Deployment

Required Manifests and Concrete Deployment Steps

5.2. Experiment Design

5.2.1. Scheduling Performance Benchmark Design

Objective

Controlled Workload and Burst Generation

Scheduler Selection and Run Isolation

Trial Procedure (Cleanup → Burst → Wait Conditions → Export)

Latency Metrics

Placement Distribution and QoS-Oriented Indicators

Repeatability and Comparison Across Schedulers

5.2.2. HPA Benchmark Design

Purpose

Workload and Autoscaling Configuration

Run Isolation and Artifacts

Load Generation and SLA Definition

Two-Stage Protocol: FIND Then HOLD

TTR Definition and Interpretation Modes

Timeline Collection (HPA and Pod Readiness)

Scheduling/Ready Latency for HPA-Created Pods

5.3. Experiment Result

5.3.1. Scheduling Performance Benchmark

Metrics

Overall Comparison

Placement Quality (Distribution QoS)

5.3.2. HPA Scenario Benchmark

Time-to-Recover (TTR)

Autoscaling Dynamics vs. QoS Recovery