An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters

Furnadzhiev, Radoslav

doi:10.3390/engproc2025100025

Open AccessProceeding Paper

An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters^†

by

Radoslav Furnadzhiev

^1,2

¹

Department of Computer Systems and Technologies, Faculty of Electronics and Automation, Technical University of Sofia, Plovdiv Branch, 1797 Sofia, Bulgaria

²

Center of Competence “Smart Mechatronics, Eco- and Energy Saving Systems and Technologies”, 4000 Plovdiv, Bulgaria

^†

Presented at the 14th International Scientific Conference TechSys 2025—Engineering, Technology and Systems, Plovdiv, Bulgaria, 15–17 May 2025.

Eng. Proc. 2025, 100(1), 25; https://doi.org/10.3390/engproc2025100025

Published: 9 July 2025

(This article belongs to the Proceedings of The 14th International Scientific Conference TechSys 2025—Engineering, Technologies and Systems)

Download

Browse Figures

Versions Notes

Abstract

Kubernetes clusters are deployed across data centers for geo-redundancy and low-latency access, resulting in new challenges in scheduling workloads optimally. This paper presents a practical evaluation of network-aware scheduling in a distributed Kubernetes cluster that spans multiple network zones. A custom scheduling plugin is implemented within the scheduling framework to incorporate real-time network telemetry (inter-node ping latency) into pod placement decisions. The assessment methodology combines a custom scheduler plugin, realistic network latency measurements, and representative distributed benchmarks to assess the impact of scheduling on traffic patterns. The results provide strong empirical confirmation of the findings previously established through simulation, offering a validated path forward to integrate not only network metrics, but also other performance-critical metrics such as energy efficiency, hardware utilization, and fault tolerance.

Keywords:

cloud computing; Kubernetes; orchestration; multi-region

1. Introduction

Kubernetes has become the de facto standard for orchestrating containerized workloads across cloud and edge environments. As organizations deploy Kubernetes clusters across multiple regions or data centers for geo-redundancy and low-latency access, new challenges emerge in scheduling workloads optimally. The default Kubernetes scheduler primarily considers CPU and memory availability and ignores network topology—assuming a flat, homogeneous network. In practice, multi-region clusters exhibit non-uniform network latencies and bandwidth between nodes [1].

Data-intensive frameworks such as Apache Spark [2] perform massive data shuffles between executors. Shuffle-heavy Spark workloads introduce significant scheduling challenges in Kubernetes-based clusters due to high network traffic across executors. If these executors are scheduled on high-latency links, the network I/O becomes a major bottleneck, slowing the completion of the job [3]. Similarly, distributed deep learning training requires exchanging model gradients each iteration; insufficient network bandwidth or added latency in this communication phase will affect the scaling efficiency [4]. Recent research [1,5,6,7,8] proposes extending the scheduler with plugins that account for inter-node latency and network costs, in addition to the usual resource-based criteria. In the simulation study, these network-aware scheduling algorithms achieved significantly lower inter-region communication delays while maintaining high resource utilization. This aligns with a broader trend in scheduling research: moving from single-objective resource allocation to multi-objective optimization that includes metrics like network latency, energy consumption, and reliability. Building on this background, this paper presents a practical evaluation of network-aware scheduling in an operational Kubernetes cluster that spans multiple network zones. A custom scheduling plugin is implemented within the Kubernetes scheduling framework to incorporate real-time network telemetry (inter-node ping latency) into pod placement decisions. Three isolated VLAN regions, each with its own subnet, are connected via a firewall, allowing the control of latency between regions. The behavior of the network-aware scheduler is compared against the default scheduler using two representative distributed workloads:

Batch analytics workload: TPC-H [9] benchmark queries running on Apache Spark.
Distributed machine learning workload: Synchronous data-parallel training with PyTorch [10].

During these experiments, detailed performance metrics (execution times, throughput) are collected, and network flow data via NetFlow logs is captured. This approach allows for not only quantifying improvements in job completion times, but also how network traffic patterns change under network-optimized scheduling.

2. Materials and Methods

The assessment methodology combines a custom network-aware scheduler plugin, live network latency measurements, and representative distributed benchmarks to assess the impact of network-aware scheduling. The scheduler extension, implemented within the Kubernetes scheduler framework, operates during the score phase to prioritize node assignments that minimize inter-pod communication delays. Network topology awareness is facilitated through a deployment that systematically measures inter-node latencies via ICMP ping. These measured values are subsequently consolidated into a matrix, which undergoes periodic refreshment to reflect the current state of the network.

2.1. Scheduler Plugin Design

A custom scheduler plugin for Kubernetes is developed, based on the official scheduler framework (Scheduler SDK). The algorithms proposed in [1] are implemented as individual plugins to the secondary scheduler, with the selection determined by a configuration parameter:

Ant Colony Optimization (ACO)—A probabilistic technique for solving optimization problems where ants (artificial agents) construct solutions by incrementally assigning pods to nodes.
Non-dominated Sorting Genetic Algorithm-II (NSGA-II)—An evolutionary algorithm that evolves scheduling decisions with objectives such as minimizing the overall network distance between pods belonging to the same group.
Integer Linear Programming (ILP)—A Simplex solver that finds the best scheduling solution based on constraints for latency and resource usage.
Simulated Annealing (SA)—An algorithm that employs an initial greedy allocation strategy followed by a metallurgy-inspired annealing process that optimizes assignments based on CPU, memory, and latency.

The secondary scheduler deployed to the cluster is utilized only for designated workloads. Pods belonging to the same workload are marked with a custom annotation.

2.2. Workload Benchmarks and Metrics

To assess network-aware scheduling efficacy, two network-sensitive workloads are selected:

Apache Spark TPC-H Benchmark [11]: Apache Spark (v3.5.0) is used in Kubernetes mode, executing all TPC-H SQL queries on a 10GB synthetic data set. The distributed architecture comprises a central driver pod that coordinates multiple executor pods. Performance is evaluated by comparing query execution times under both default Kubernetes scheduling and network-aware scheduler variants. Multiple iterations were conducted to establish statistical validity. Primary metrics included query completion time and Spark’s internal shuffle metrics to correlate performance improvements with network conditions.
Distributed PyTorch Training: A distributed training workflow is developed using the PyTorch Distributed Data Parallel module, featuring a synthetic CNN architecture designed to emphasize communication costs. The topology consisted of one parameter server pod coordinating with four worker pods. Performance comparison between default and network-aware scheduling is focused on epoch duration and overall training time. Key metrics included processing throughput and per-iteration communication overhead, with particular attention to gradient synchronization efficiency across network links.

Experimental runs under controlled conditions are performed for both benchmarks, eliminating interference from other significant workloads to isolate scheduling decision effects. Performance results from multiple iterations are visualized using box plots, which effectively demonstrate both median performance improvements and variability patterns. For Spark, these visualizations represent query runtime distributions under each scheduling approach. Similarly, for PyTorch, the box plots illustrate the training duration across scheduling scenarios. This analytical approach enables a comprehensive assessment of both central tendency improvements and execution stability variations attributable to network conditions or other environmental factors.

2.3. Experimental Setup and Network Traffic Collection

The evaluation of the proposed scheduling approach is conducted in a controlled experimental environment designed to simulate a distributed Kubernetes cluster with three distinct network zones and inter-zone latency.

The hardware testbed and software setup are designed to mimic a distributed Kubernetes cluster with included latency between the three network zones.

The Kubernetes cluster comprises 24 virtual machines distributed across four Dell PowerEdge R730 physical servers (Dell, Round Rock, TX, USA), each equipped with dual Intel Xeon E5-2683v4 processors (Intel, Santa Clara, CA, USA) and 256 GB RAM. The cluster architecture includes a single master node (4 CPU cores, 6 GB RAM) and seven worker nodes (6 CPU cores, 32 GB RAM per node) in each region.

The network connectivity between virtual machines was managed using a Cisco Catalyst 3750x switch (Cisco, San Jose, CA, USA) with a C3KX-SM-10G service module and an OPNSense firewall. Each region’s nodes are assigned to a separate VLAN with its own subnet, while the OPNSense firewall facilitates inter-VLAN routing. A WAN connection between regions is simulated by adding latency in routing through traffic shaping [12], as described in Table 1. The C3KX-SM-10G Service Module allows network traffic metadata to be exported via the NetFlow protocol without negatively impacting cluster performance.

For network analysis, a Switched Port Analyzer (SPAN) session was configured to mirror VLAN traffic to service module interfaces [13]. By capturing and analyzing network flow data, the impact of scheduling on traffic patterns is directly observed. All NetFlow data is exported to an Elasticsearch database using Filebeat [14].

A Kubernetes DaemonSet, an abstraction facilitating pod deployment across all cluster nodes, is created to periodically measure inter-node latencies using ICMP pings. Each DaemonSet pod sends timestamped pings to every other node’s IP address and computes an average round-trip time. These latency measurements are aggregated into a matrix and published in a Kubernetes ConfigMap (shared key-value store accessible via the Kubernetes API). In the live cluster, this ConfigMap is updated once every 30 secounds to reflect current network conditions.

3. Results

The proposed network-aware scheduling algorithms are compared against the default Kubernetes scheduler in controlled experiments using the two proposed workloads. Each scheduler variant is tested across multiple iterations to assess both median performance and variance, reflecting not only efficiency gains but also improvements in predictability and stability. The benchmark implemented in Apache Spark is focused on the execution time of all TPC-H queries, each one with varying I/O and communication intensities, while PyTorch experiments focus on per-epoch training time and throughput. Furthermore, NetFlow-based traffic analysis offers insight into the degree of traffic localization achieved with each scheduling algorithm.

3.1. TPC-H Query Execution Time Analysis

To evaluate the scheduling algorithms under data analytics workloads, the full suite of TPC-H queries is executed in Apache Spark running in Kubernetes mode. The entire set of queries is executed multiple times with the default scheduler as well as the four network-aware plugin algorithms: ACO, NSGA-II, ILP, and SA. The results, illustrated through box plots, capture both central tendency (median execution time) and variability, providing a view of the variance of execution times between all runs.

The default Kubernetes scheduler consistently demonstrated higher execution times and greater variance across all tested queries, showing the impact of executor pods being placed on different nodes without network awareness. This shows the performance degradation caused by its lack of awareness of underlying network topology, often placing executors across high-latency regions and incurring excessive time costs for inter-node shuffles. In contrast, network-aware plugins delivered the best performance, reducing the query execution time by 10–28% and showing reduced variance. Execution times across all 22 TPC-H queries in Figure 1 reveal a consistent performance advantage of all network-aware scheduling strategies over the default Kubernetes scheduler. Looking specifically at queries Q17, Q18, and Q21 [11], which are complex nested queries processing a large amount of information and cause a lot of data to be shuffled between nodes, shows the most significant improvement. ILP and SA consistently yield the lowest query execution times across most TPC-H queries, showing effectiveness and consistent placements. ACO and NSGA-II, though slightly less effective, also demonstrate clear improvements with reduced variability and stable performance. In contrast, the default scheduler exhibits wider variance, often resulting from randomly distributed pod placements across high-latency regions.

The node-to-node traffic heatmaps in Figure 2 and Figure 3 are generated by utilizing captured Netflow data, presenting a matrix visualization, where nodes are ordered by regional subnet allocation (10.0.1.0/24—Region 1, 10.0.2.0/24—Region 2, 10.0.3.0/24—Region 3), with color intensity corresponding to inter-node data exchange volumes during the experiment. In Figure 3, a high-volume inter-node traffic appears scattered across regions, indicating poor data locality and extensive cross-region shuffles. In Figure 2, the high-volume communication remains largely within a subset of co-located nodes, reflecting improved locality.

3.2. PyTorch Distributed Training Results

For the machine learning workload, a synthetic CNN model was trained using PyTorch with a Distributed Data Parallel module across eight pods—one parameter server and seven workers. The total time required to complete 10 training epochs under each scheduling policy was measured. As with Spark, each test scenario was repeated ten times to ensure statistical confidence. The results in Figure 4 show dramatic improvements when using network-aware scheduling. Training under the default scheduler took more than 2500 s in some cases, largely due to gradient synchronization overhead between nodes in different network zones. Metaheuristic strategies such as ACO and NSGA-II provided substantial improvements, while ILP and SA delivered the best results, cutting training time by more than half and producing very consistent times.

The PyTorch node-to-node traffic heatmaps in Figure 5 and Figure 6 visually corroborate the substantial runtime improvements observed. These visualizations capture communication patterns during training, particularly highlighting gradient synchronization between workers and the parameter server. Figure 5 shows that under default scheduling, communication exhibits dispersed patterns across all zones, with significant cross-regional traffic resulting from the arbitrary assignment of worker pods to geographically distant nodes. Figure 6 shows a high volume of network traffic between only three nodes of the same region, showing clear improvement in network locality.

The results demonstrate the network latency sensitivity of this PyTorch distributed training benchmark, revealing how communication patterns and their corresponding latency implications directly impact overall performance metrics when worker pods are arbitrarily distributed across network regions.

With network-aware scheduling enabled in Figure 3 and Figure 6, the heatmaps show traffic clustered within regions, confirming that pod placement closely aligns with latency-optimized objectives. These localized communication patterns explain the reduced shuffle overhead in Spark and faster gradient synchronization in PyTorch, ultimately validating the design goals of the custom plugins.

4. Conclusions

The empirical results strongly support the hypothesis that network-aware scheduling improves the efficiency and predictability of distributed workloads in Kubernetes clusters spanning multiple network zones. By incorporating latency measurements into the scheduling process, the custom plugins consistently reduced the execution time for both analytics and machine learning tasks. The results were particularly pronounced in the workloads chosen due to the high communication demands. In particular, both ILP and SA strategies achieved superior performance in most test cases, providing the best results through an exhaustive search of the solution space.

The insights gained from the NetFlow data collected provide strong empirical confirmation of the findings previously established through simulation. This allows for direct observation of how scheduling decisions affect communication patterns in a live cluster and how those patterns influence performance outcomes across varied workloads. The close alignment between empirical results and previous simulation-based studies confirms the practical relevance of network-aware scheduling. This research quantifies the improvements found in previous research through simulation, offering a validated path forward to integrate not only network metrics but also other performance-critical telemetry such as energy efficiency, hardware utilization, and fault tolerance.

Funding

This research is supported by the Bulgarian Ministry of Education and Science under the second stage of the National Program “Young Scientists and Postdoctoral Students—2”. The equipment used was funded by the European Regional Development Fund within the OP “Research, Innovation and Digitalization Programme for Intelligent Transformation 2021–2027”, Project No. BG16RFPR002-1.014-0005 Center of competence “Smart Mechatronics, Eco-and Energy Saving Systems and Technologies”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACO	Ant Colony Optimization
CNN	Convolutional Neural Network
I/O	Input/output
ICMP	Internet Control Message Protocol
ILP	Integer Linear Programming
NSGA-II	Non-dominated Sorting Genetic Algorithm-II
SA	Simulated Annealing
SDK	Software Development Kit
SPAN	Switched Port Analyzer
SQL	Structured Query Language
VLAN	Virtual Local Area Network

References

Furnadzhiev, R.; Shopov, M.; Kakanakov, N. Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster. Computers 2025, 14, 114. [Google Scholar] [CrossRef]
Apache Software Foundation. Apache Spark™—Unified Engine for Large-Scale Data Analytics. 2025. Available online: https://spark.apache.org/ (accessed on 19 April 2025).
Zhu, C.; Han, B.; Zhao, Y. A comparative performance study of spark on kubernetes. J. Supercomput. 2022, 78, 13298–13322. [Google Scholar] [CrossRef]
Zhang, Z.; Chang, C.; Lin, H.; Wang, Y.; Arora, R.; Jin, X. Is Network the Bottleneck of Distributed Training? In Proceedings of the Workshop on Network Meets AI & ML; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Centofanti, C.; Tiberti, W.; Marotta, A.; Graziosi, F.; Cassioli, D. Latency-Aware Kubernetes Scheduling for Microservices Orchestration at the Edge. In Proceedings of the 2023 IEEE 9th International Conference on Network Softwarization (NetSoft), Madrid, Spain, 19–23 June 2023; pp. 426–431. [Google Scholar]
Santos, J.; Wauters, T.; Volckaert, B.; Turck, F.D. Towards Network-Aware Resource Provisioning in Kubernetes for Fog Computing Applications. In Proceedings of the 2019 IEEE Conference on Network Softwarization (NetSoft), Paris, France, 24–28 June 2019; pp. 351–359. [Google Scholar]
Zhang, X.; Li, L.; Wang, Y.; Chen, E.; Shou, L.; Zhang, X. Zeus: Improving Resource Efficiency via Workload Colocation for Massive Kubernetes Clusters. IEEE Access 2021, 9, 105192–105204. [Google Scholar]
tzu Lin, M.; Xi, J.; Bai, W.; Wu, J. Ant Colony Algorithm for Multi-Objective Optimization of Container-Based Microservice Scheduling in Cloud. IEEE Access 2019, 7, 83088–83100. [Google Scholar]
Transaction Processing Performance Council (TPC). TPC-H Homepage. 2025. Available online: https://www.tpc.org/tpch/ (accessed on 19 April 2025).
The Linux Foundation. PyTorch. 2025. Available online: https://pytorch.org/ (accessed on 19 April 2025).
Transaction Processing Performance Council (TPC). TPC BENCHMARKTM H (Decision Support) Standard Specification. 2022. Available online: https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdfl (accessed on 19 April 2025).
Deciso, B.V. Traffic Shaping—OPNsense Documentation. 2025. Available online: https://docs.opnsense.org/manual/shaping.html (accessed on 19 April 2025).
Cisco Systems Inc. Cisco Catalyst 3K-X Service Module: Enabling Flexible NetFlow in the Access. 2011. Available online: http://www.ibbconsult.de/download/pdf/20CiscoCAT.pdf (accessed on 19 April 2025).
Elasticsearch, B.V. NetFlow Input | Elastic Documentation. 2025. Available online: https://www.elastic.co/docs/reference/beats/filebeat/filebeat-input-netflow (accessed on 19 April 2025).

Figure 1. Bar chart illustrating the comparative performance of different scheduling algorithms on all TPC-H queries. The bar height shows the mean execution time, and the whiskers show the value distribution.

Figure 2. Node-to-node traffic heatmap during one TPC-H Spark experiment without network-aware scheduling.

Figure 3. Node-to-node traffic heatmap during one TPC-H Spark experiment with network-aware scheduling.

Figure 4. Total PyTorch training execution time (in seconds) under different scheduling algorithms.

Figure 5. Node-to-node traffic heatmap during one distributed PyTorch training experiment without network-aware scheduling.

Figure 6. Node-to-node traffic heatmap for one distributed PyTorch training experiment with network-aware scheduling.

Table 1. Extra network latency values added via traffic shaping.

	Region 1	Region 2	Region 3
Region 1	0 ms	15 ms	25 ms
Region 2	15 ms	0 ms	30 ms
Region 3	25 ms	30 ms	0 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Furnadzhiev, R. An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters. Eng. Proc. 2025, 100, 25. https://doi.org/10.3390/engproc2025100025

AMA Style

Furnadzhiev R. An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters. Engineering Proceedings. 2025; 100(1):25. https://doi.org/10.3390/engproc2025100025

Chicago/Turabian Style

Furnadzhiev, Radoslav. 2025. "An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters" Engineering Proceedings 100, no. 1: 25. https://doi.org/10.3390/engproc2025100025

APA Style

Furnadzhiev, R. (2025). An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters. Engineering Proceedings, 100(1), 25. https://doi.org/10.3390/engproc2025100025

Article Menu

An Experimental Evaluation of Latency-Aware Scheduling for Distributed Kubernetes Clusters^†

Abstract

1. Introduction