A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments

Marković, Marko; Soleša, Dragan; Karabašević, Darjan

doi:10.3390/electronics15112336

Open AccessArticle

A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments

by

Marko Marković

^1,*

,

Dragan Soleša

²

and

Darjan Karabašević

^1,3,*

¹

Faculty of Applied Management, Economics and Finance in Belgrade, University Business Academy in Novi Sad, Jevrejska 24, 11000 Belgrade, Serbia

²

Faculty of Economics and Engineering Management in Novi Sad, University Business Academy in Novi Sad, Cvećarska 2, 21000 Novi Sad, Serbia

³

College of Global Business, Korea University, Sejong 30019, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(11), 2336; https://doi.org/10.3390/electronics15112336

Submission received: 15 April 2026 / Revised: 22 May 2026 / Accepted: 26 May 2026 / Published: 28 May 2026

Download

Browse Figures

Versions Notes

Abstract

Modern distance learning platforms represent important infrastructure in contemporary higher education, particularly during periods of intensive use such as examinations, assignment deadlines, and simultaneous access to learning materials. In such situations, Learning Management System (LMS) platforms may face sudden traffic spikes that can lead to increased latency, reduced availability, and service degradation. Traditional autoscaling mechanisms in Kubernetes commonly rely on CPU or memory utilization, which may react too late when overload first appears at the network or application layer. This paper proposes a simulation-based latency-aware autoscaling model for LMS platforms in Kubernetes-like cloud-native environments. The model uses network latency measured at the ingress layer as an early control signal for adaptive horizontal scaling. The proposed architecture conceptually integrates the NGINX Ingress Controller, Prometheus-based telemetry, a Custom Metrics Adapter, and the Horizontal Pod Autoscaler within a closed feedback loop based on the MAPE-K paradigm. The model was evaluated through a Python-based simulation that replicates bursty load conditions in an LMS environment, supporting up to 2000 concurrent users or requests per second. The simulation results indicate that the latency-aware approach can initiate scaling earlier than a traditional CPU-based approach under the defined workload assumptions. In the simulated environment, the latency-aware model reduced the time to the first scaling action from approximately 90 s in the CPU-based baseline to approximately 12–15 s under the same workload assumptions. This result should not be interpreted as a direct reduction in application response time, but as an earlier activation of the scaling mechanism in the simulation. Since the validation was carried out in a simulated environment, rather than in a real Kubernetes cluster, these results should be interpreted within the limits of the simulation assumptions. In future research, the proposed model can be implemented in a real Kubernetes cluster using NGINX Ingress, Prometheus, HPA, and load generation tools such as Locust, JMeter, or k6.

Keywords:

automatic scaling; microservices architecture; network latency; Kubernetes orchestration; LMS platforms

1. Introduction

Modern higher education institutions have increasingly adopted systems and platforms that support distance learning (Learning Management Systems—LMSs), as these platforms are widely used for distributing and tracking educational materials through digital platforms and channels [1,2]. Most institutions rely on ready-made technical solutions that are open source and can be tailored to their needs; one such solution is the LMS platform itself, which supports the integration of various educational services [3]. In order for the LMS to meet all the requirements and needs of an institution, it must be expanded with various plugins or through the integration of API services. This shows that LMS platforms have become an important part of modern education and support many key educational processes [4]. The increasing demand for such systems and the extent of their use lead to a key problem regarding the performance and infrastructure of an LMS. Numerous studies indicate that the choice of infrastructure (e.g., local installation in relation to the cloud environment) significantly affects the distribution of information, the response of certain operations, and the overall loading time of the LMS environment.

In many cases, cloud solutions achieve a significant difference in performance compared to local implementations [5]. In addition, institutions indicate that poor technology choices can also affect the process of successful implementation and the use of the distance learning platform, which can lead to platform overload and occasional outages. Such situations often encourage the application of container and orchestration technologies for greater availability and sustainability of the system [6]. The system must be capable of withstanding all activities from start to finish and be prepared for a sudden surge in concurrent users and demands [7]. The period of such load is identified as the main cause of traffic congestion and increased latency that is visible on the user side during the loading process [8,9]. In high-demand architectures, systems can suffer significant degradation or even partial failure, which is why special architectural strategies are necessary to preserve them during sudden load spikes [10]. In distributed web services, the biggest problem is the sudden load that often leads to failures, known as the Thundering Herd effect. This process occurs when too many clients send identical requests to the system at the same time, resulting in a sudden spike in requests instead of a gradual increase, which is preferable. In practice, this means that the components of the system responsible for receiving and processing requests become overloaded and may increase execution time or latency, or experience downtime due to an insufficient number of available working units [11]. From a technical standpoint, a large number of simultaneous requests encounter the Load Balancer, which at that moment becomes a chokepoint in the processing of user requests. It accepts all requests from users and distributes them to a limited number of instances [12]. Although it distributes requests across available instances, it cannot by itself increase the resources needed to process them. It often happens that the number of incoming requests exceeds the number of available instances, causing processes to pile up until the waiting queues become very large, not due to the processing itself, but because of the waiting time to reach that processing. As backend instances are gradually scaled up, their internal state changes. This can lead to more errors and additional strain on a system that is already burdened with traffic control, monitoring, and overload recovery. All of these issues need to be identified when designing the architecture that will be tasked with handling a large number of requests and sessions [13]. Even microservices architecture suffers the same consequences, and the pressure is even more pronounced, as the network communication between services, as well as the mechanisms for monitoring routing and error handling, are also burdened. A significant limitation in a large number of implementations is the auto-scaling of processors based on resource utilization thresholds. Numerous implementations of cloud solutions often automatically add or remove instances when the CPU exceeds a certain threshold [14]. The problem can be viewed as a mismatch between the layers of the system, primarily the network and traffic layer, which first detects and suffers the consequences of a sudden spike in load, while the computing layer only reacts when system overload is reached.

The primary goal of this research is to design and analyze a simulation-based model that integrates the network layer with the computing layer in a Kubernetes-like cloud-native environment. In this case, network intelligence is integrated at the very entry point of the Kubernetes environment through the Ingress/traffic-management component, while the computing layer is designed through container orchestration and automatic scaling mechanisms within the platform. This approach is based on research that shows distributed systems under heavy load experience sudden spikes that can lead to overload, service degradation, or temporary service unavailability. Therefore, it is necessary to manage the load in a balanced way and to utilize the capabilities of auto-scaling, combining them into an integrated mechanism capable of continuous monitoring [15]. Previous traffic management and routing must play an even greater role in maintaining service availability due to variable load.

The motivation for this research arises from the need for LMSs to be designed to be resilient during critical moments of the educational process, which involves high loads. It is particularly important to ensure that operations in specific scenarios where classical auto-scaling techniques using resource metrics are employed do not respond in time to prevent system degradation or service failure.

The research gap lies in the fact that existing studies largely analyze the performance of LMS platforms from the perspective of infrastructure choice, cloud implementation, containerization, or the application of standard auto-scaling processes, while integrated approaches in which latency is used as a trigger for adaptive scaling are significantly less frequently addressed. In this context, latency can be treated as an early control variable because it reflects increased user activity before CPU utilization reaches the scaling threshold. In other words, the connection between the intelligence of the network layer for traffic management and the resource orchestration mechanisms in the context of the resilience of LMSs to sudden spikes in load has not been sufficiently explored.

Analyzing all these assumptions, the work is focused on designing and analyzing a model that integrates the network with the computing layer within a Kubernetes-like simulated environment, with the aim of recognizing traffic congestion anomalies at the earliest possible stage and responding as efficiently as possible. Instead of relying on CPU utilization, the proposed approach uses the measured network latency at the very entry point as an early signal to activate horizontal scaling mechanisms.

The key contributions of this work are:

The proposed conceptual model for LMS environments integrates ingress-layer latency and orchestration mechanisms in a Kubernetes-like simulated cloud-native environment in order to examine autoscaling behavior under sudden load spikes.

In the proposed architecture, latency is introduced at the entry point as a control mechanism for activating adaptive horizontal scaling. In the simulation, this signal enabled earlier activation of the scaling mechanism than the CPU-based baseline under the same workload assumptions.

A management mechanism based on the MAPE-K paradigm (Monitor-Analyze-Plan-Execute over Knowledge) is defined, which enables monitoring of the system’s state, analysis of load conditions, planning of responses, and execution of scaling actions in a closed control loop [16].

A simulation-based evaluation of the proposed approach was conducted in a Python 3.13 environment using NumPy 2.2 and Matplotlib 3.10.0. The results were compared with a CPU-threshold-based autoscaling baseline implemented within the same Python simulation. Both models use the same workload, replica limits, and simulated service-capacity assumptions, while they differ only in the metric used to trigger scaling. The baseline, therefore, represents a CPU-based autoscaling policy, not a full monolithic or centralized production architecture [17].

The contribution of this paper is not the introduction of latency-based autoscaling as a completely new concept, but its application and simulation-based evaluation in the context of LMS platforms exposed to sudden and intensive traffic peaks.

The novelty of this paper should therefore be interpreted in a limited and specific sense. More specifically, the contribution of the paper is limited to the definition of a simplified threshold-based feedback controller and its use in a controlled Python simulation to compare CPU utilization and ingress-layer latency as autoscaling triggers under identical LMS burst-load assumptions. The study does not claim to introduce latency-based autoscaling as a completely new autoscaling paradigm, nor does it provide production-level validation in a real Kubernetes cluster. The paper does not propose an optimized, predictive, or production-ready autoscaling algorithm. Therefore, the results should be understood as evidence of the behavior of the proposed control logic under defined simulation assumptions, rather than as general proof of performance improvement in Kubernetes-based LMS deployments.

The paper follows the following structure, starting with an introduction that provides the motivation for the research, identifies the research gap, and presents the goals and contributions of the work. The second part examines the relevant literature that encompasses the fields of microservice architecture, containerization, Kubernetes autoscaling, and traffic management in cloud systems, with a particular focus on the limitations of existing approaches. The third part presents the proposed architecture along with a description of the model based on the MAPE-K loop. In the fourth part, the experimental environment, research methodology, and load scenarios are presented. The last, fifth part examines and analyzes the results of the simulation-based research. Guidelines for future work, limitations of the research, and a summary of the entire study are provided in the conclusion of this paper.

2. Literature Review

2.1. Architectural Evolution: From Monolithic to Microservice Systems

Monolithic architecture represents the most commonly used approach in software system design, forming a single logical unit where data access, code logic, and rules are interconnected. The advantage of this approach is the simple development and deployment of the system. Relevant literature shows that monolithic architecture has significant limitations in maintenance, flexibility, and scalability, which are most noticeable under conditions of increasing load and system complexity [18,19]. A major problem with monolithic architecture is the interconnection between system components and the inability to modify or maintain just one part of the application. Furthermore, under conditions of increased load, it is often necessary to scale the entire architecture, even when only one of its subsystems has become a critical bottleneck. In this way, resource costs increase, and the efficiency of system management decreases [20]. In order to adequately address this problem and completely eliminate these limitations, a microservices approach has been developed. Microservice architecture allows an application to be divided into multiple independent services, each of which has a precisely defined functionality. Research indicates that this approach to application design enhances modularity, allows for independent deployment of components, and enables selective scaling of only those parts of the system that are under load at that moment [21]. Research comparing distributed and centralized approaches also indicates better system resilience and significantly greater fault tolerance in functionally separated services within Kubernetes environments.

However, microservice architecture is not a universal solution, although it brings advantages in terms of scalability and fault isolation. At the same time, the complexity of network communication, traffic monitoring, routing, and coordination between services is also increasing, as interactions are replaced by distributed network calls [22]. The transition from a monolithic to a microservices architecture is not only about the structure of the code but also a general operational shift in the design of distributed systems, where all modules must align and become familiar with a larger number of network-connected components under dynamic load.

2.2. Containerization and Orchestration in Cloud-Native Environments

The development of microservice architectures is largely associated with the implementation of containerization. Containerization is a technique that allows us to package the entire application, including all libraries and configurations, into a single container, i.e., an isolated and portable environment. Compared to virtual machines, containers do not require the launch of a separate operating system, which sets them apart in terms of resource reduction and shorter startup times, along with greater efficiency in deployment [23]. One of the most well-known containerization technologies is Docker, which is often used for microservices architecture, as it fundamentally allows each container to behave the same way in different environments [24]. And while containers simplify the deployment and isolation of applications, their implementation in systems with a larger number of applications can pose challenges. There is a need to organize in such a way that allows them to run on the right machines, communicate with each other, scale when the load begins to increase, and, of course, be available at all times [25]. Today, Kubernetes (K8s) is the standard for container orchestration. Kubernetes enables all microservices to operate in all environments in the same predefined manner, to automatically adapt to the environment and traffic, and to fulfill all requests at any moment regardless of the influx of a larger number of users [26,27]. In this environment, the basic unit of management is Pods, which represent the smallest unit and can contain one or more containers that share the same network resources and storage. The number of pods is adjusted according to the load, and they are also viewed as the primary controlled unit for scaling [28]. Pods run on nodes, or machines, that make up the cluster infrastructure, where their hardware characteristics and available resources directly affect the system’s capacity, as well as the efficiency of orchestration and resource allocation [29,30]. It is important to emphasize that the mere implementation of Kubernetes does not guarantee optimal system performance, although orchestration allows for flexibility, availability, and control; actual efficiency depends on its configuration, the method of resource allocation, and the management of traffic flows. Therefore, in the analysis of cloud-native systems, Kubernetes cannot be viewed in isolation, but rather in interaction with load metrics and the network layer [31].

2.3. Autoscaling in Kubernetes Environments

In order for the adaptation mechanism in a Kubernetes environment to be effective, automatic scaling techniques are needed, aimed at aligning the number of active system instances with the current load. When pods are created and removed dynamically during scaling, stable communication between components is ensured through Kubernetes services. The Horizontal Pod Autoscaler (HPA) is a standard mechanism that enables the automatic adjustment of the number of pods based on load metrics, most commonly relying on utilization thresholds, which can lead to delays in response during sudden spikes in demand. However, CPU and memory metrics are more frequently used directly [32]. The advantage of this approach lies in its simple implementation and good integration with the Kubernetes ecosystem. However, several authors note the limitations of the resource-based approach in situations of sudden and short-term increases in the number of requests. In such conditions, network and application symptoms of overload, such as increased latency and congestion at the entry point, can appear before the CPU load reaches the threshold necessary to activate the HPA mechanism. This means that the system can enter a state of performance degradation before standard scaling responds at all [33]. For this reason, increasing attention is being paid to an approach that, in addition to infrastructure metrics, also considers metrics related to traffic and service quality. However, although such approaches are conceptually significant, their application in the context of LMSs and burst-load scenarios is still insufficiently systematized. It is particularly under-researched to what extent latency measured at the very entry point of the system can serve as an early and operationally useful signal for activating the scaling mechanism [34].

2.4. Network Layer and Load Balancing in Cloud-Native Systems

In cloud-native systems, the network layer serves not only for connectivity but also becomes an active control flow, as it affects the performance, availability, and resilience of the system. Microservices communicate exclusively over the network, and all their requests must be routed and monitored under conditions of dynamic scaling and variable load [35]. As Kubernetes environments grow due to scale and complexity, it is essential to ensure that the network itself is precisely configured and to manage traffic in light of latency, throughput, and the reliability of the service itself, especially in systems where a sudden spike in demand is expected due to distribution and real-time adjustments [36]. A key distinction is made between layer 4 and layer 7. Layer 4 load balancing operates at the transport layer (TCP/UDP) and distributes connections without understanding the application content, while layer 7 load balancing operates at the application layer (most commonly HTTP/HTTPS) and can make decisions based on requests, URL paths, routes, headers, or other parameters [37]. Analyses show that the efficiency of cloud load balancing is not measured solely by resource utilization, but also by latency and scalability (metrics that are directly related to the application layer and therefore better supported by the mechanisms of layer 7). It is particularly emphasized in web applications because the requests are not the same; that is, they can vary and can have different complexities, so the distribution by request is more precise than distribution by connection in protecting the user experience. In Kubernetes environments, the Ingress Controller represents layer 7 as the main entry point for HTTPS traffic to applications operating within the cluster. This means that all external traffic first arrives at the Ingress Controller, and only then is forwarded to the intended service; therefore, this is not just a pass-through for requests, but rather a representation of a central point that performs routing, security, distribution, and access control to resources. Therefore, it represents a key infrastructure that manages the performance and availability of the system, while efficiency is measured through latency under certain request conditions [38]. Poorly configured network settings increase latency and reduce throughput even in applications that have sufficient resources, whereas this cannot happen in applications that have well-optimized network infrastructures. In other words, the network layer in such an architecture directly affects how quickly and reliably the user receives a response, which is why the network becomes a key factor rather than just a background mechanism. Layer 7 also supports NGINX Ingress, which means that this controller operates at the application level of the network model and analyzes HTTP/HTTPS requests rather than just network connections. With this Layer 7 capability, NGINX Ingress can make routing decisions based on the URL, headers, request type, or other parameters, thereby enabling flexible traffic management within a Kubernetes cluster [39]. Unlike mechanisms that operate at a lower layer and forward all traffic without control, layer 7 enables functionalities such as intelligent load balancing, TLS termination, and the application of security policies. This makes it a key element for controlling performance, security, and availability of services in the cloud. On the other hand, layer 4 load balancers do not have insight into the response time of a request, as they operate exclusively at the transport level and have no connection to the application layer [40].

This is particularly significant in systems with sudden spikes in load, where the consequences of overload first become visible at the network layer, while mechanisms based solely on resource metrics often respond with a delay. In this sense, the literature supports the view that the network layer in cloud-native environments should not be regarded as a secondary infrastructure mechanism, but rather as an active control point that can play an important role in adaptive system management.

2.5. Autonomous Computing and Feedback Loops

Autonomous systems are designed to operate with minimal human intervention and support by continuously adapting to changes in their environment and their own state. In terms of networking and orchestration, these systems are often characterized as self-managing because they can automatically respond to any changes and variable loads, which justifies their goal of maintaining performance and availability without manual intervention. This is important in cloud environments, where workloads can often vary and where static resource allocation or manual scaling are often insufficient to maintain consistent service [41].

For this reason, modern cloud platforms increasingly utilize automated processes such as autoscaling and policy-driven orchestration in order to keep the system stable and efficient under resource-constrained conditions. Such automation involves the use of the MAPE-K model, which is regarded as a reference framework for designing systems with a closed control loop.

MAPE-K involves a constant and continuous cycle in which the system continuously monitors its state, analyzes deviations from expected behavior, plans actions, and then executes all measures to enable optimized system operation using a database that stores all information about previous actions. This principle of autonomous computing is associated with specific mechanisms in distributed systems. Monitoring involves the collection of metrics through various sensors, or parameters that describe the state and external demands, of which the most critical are resource utilization, information flow, or latency. The analysis of all collected data serves to uncover potential system disruptions. In cloud environments, such analysis often boils down to checking whether the defined parameters have been exceeded or if there is a risk of system degradation [42].

The plan specifies actions such as changing constraints or redistributing loads in order to restore the system to a desirable functional state. At the end of the execution phase, based on all decisions, it applies all mechanisms that enable the smooth operation of the environment, for example, through APIs that create or remove instances. Modern systems can therefore be viewed as a practical application of the MAPE-K closed control loop, as autoscaling actually operates through continuous monitoring of the system’s state, analyzing data, and applying dynamic resource allocation. Research in the Kubernetes environment has shown that the Horizontal Pod Autoscaler (HPA) actually functions as a reactive control loop, as scaling decisions are made only after the recorded metrics exceed defined thresholds. By observing CPU utilization against latency, as well as the defined control policies, they have a crucial impact on the system’s response and its ability to sustain itself during unpredictable changes in load [43].

If it happens that the selected metrics cannot represent the actual state of the application, a timely response or inadequate assessment can lead to the failure of the entire system. Therefore, an important engineering challenge in the very design of the mechanism is not only reflected in the control loop but also in the way it is shaped and in how the metrics are monitored, how their data is analyzed, and according to what rules all actions are executed. It is precisely these elements that determine whether the system will function stably or whether a heavy load will negatively impact the autonomy of the entire system [44].

The presentation of the MAPE-K model, including the mathematical formalization and pseudocode of the adaptive scaling mechanism, is provided in the next chapter, where the model is presented as a closed control loop with functions for monitoring, analysis, planning, and execution of operations. Figure 1 shows the Control System for Latency Management in a cloud environment. The figure illustrates the feedback loop used for latency-based control, starting from metric monitoring and continuing through analysis, planning, and execution. The monitoring component collects latency and resource-related data, while the analysis and planning components determine whether a scaling action is required. The execution component then applies the selected action by adjusting the number of active replicas.

2.6. Limitations of Existing Approaches and Positioning of Research

Existing studies show that autoscaling in cloud-native environments is most often based on infrastructure metrics, usually CPU and memory utilization. This approach is easy to implement in Kubernetes HPA, but it is often insufficient in situations where the first signs of overload appear at the network level. This is particularly important for systems that experience sudden traffic spikes, since latency and traffic congestion can increase before CPU utilization reaches the defined scaling threshold. A review of the literature indicates that latency is often used as a performance indicator for assessing system behavior, while less attention has been paid to its use as a direct signal for autoscaling. Predictive scaling and custom metrics have been considered in various cloud-native environments and microservice architectures, but their application to LMS platforms is still not sufficiently examined. LMSs have a specific workload pattern, especially during examinations, assignment submission deadlines, and simultaneous access to learning materials. In such situations, a large number of users generate many requests within a short time interval, which can increase service response time and reduce the availability of the LMS platform. For this reason, this paper focuses on the use of ingress-layer latency as an early scaling signal in a simulated LMS burst-load scenario, without claiming that latency-based autoscaling is an entirely new concept. To avoid ambiguity, the contribution of this paper is limited to the simulation-based comparison of two autoscaling control signals under the same LMS burst-load assumptions: CPU utilization and ingress-layer latency. The study does not compare a real Kubernetes deployment against a monolithic production system, nor does it isolate the full architectural benefit of microservices. Instead, it examines whether latency measured at the simulated ingress layer can activate horizontal scaling earlier than CPU utilization in the defined workload scenario.

3. Research Methodology

3.1. Research Approach and Experimental Design

This research uses a simulation-based experimental approach to investigate the behavior of the proposed latency-aware autoscaling model under controlled load conditions. The model was not deployed in a real Kubernetes cluster but was implemented as a Python-based simulation that represents the logical components of a cloud-native LMS environment. These components include ingress-level latency monitoring, metrics collection, scaling decision logic, and horizontal replica adjustment. In this paper, the term Kubernetes environment refers to a Kubernetes-like cloud-native model represented in simulation rather than a deployment in a real production Kubernetes cluster. The simulated LMS workload represents aggregated request patterns related to common LMS activities, including simultaneous user login, access to course materials, opening of online assessments, and assignment submission. These activities were not modeled as separate LMS functional modules but were represented through gradual and burst request patterns in order to capture the short-term load behavior typical of examinations, deadlines, and simultaneous access to learning resources.

In the simulation, the LMS workload was interpreted as an aggregated request stream composed of several typical academic activities rather than as a set of separately modeled LMS functions. The burst phase was assumed to be dominated by simultaneous user logins and access to examination or course-content pages, since these actions commonly occur within a narrow time window at the beginning of an online examination or scheduled learning activity. During the middle part of the scenario, the workload was assumed to include repeated content access, page navigation, assessment interactions, and background requests generated by the LMS interface. Toward the latter part of the scenario, assignment submission or file-upload activity was considered as an additional source of short-term pressure on the system. These activities were not assigned separate request weights or independent service models in the present simulation. Instead, they were represented as a combined traffic pattern in order to keep the comparison focused on the selected autoscaling trigger, namely CPU utilization versus ingress-layer latency.

The reason for using a simulation environment is to provide repeatable conditions in which different loads and traffic patterns can be examined, together with latency thresholds, cooldown periods, and scaling conditions. This also avoids the additional variability introduced by real cluster configuration, node performance, container startup behavior, and provider infrastructure. In this way, the simulation allows different scaling mechanisms to be compared under the same service load conditions.

The experimental research included two approaches. The baseline model is defined as a CPU-threshold-based autoscaling policy within the same Python simulation environment. It is not used as a representation of a complete monolithic LMS architecture. The difference between the proposed model and the base case lies in the scaling trigger, where the base case relies on CPU usage, whereas the proposed model relies on ingress-layer latency. This setup makes it possible to isolate the impact of the selected control signal. The first approach therefore represents a traditional CPU-based scaling model, where scaling decisions are made only after a defined infrastructure load threshold is exceeded. The second approach is based on measuring network latency at the ingress layer, which is then used as a trigger for activating adaptive horizontal scaling. The same load scenario was applied to both models in order to compare system behavior during sudden load spikes. The primary reported variables in this simulation were average latency, replica count, threshold-crossing time, and time to the first scaling action after a load spike. Throughput, error rate, and full recovery time are recognized as important service-quality indicators, but they were not evaluated with the same depth in the present simulation and are therefore treated as limitations and targets for future real-cluster validation. This approach made it possible to assess how the selected scaling trigger affects autoscaling reaction and latency behavior under burst-load conditions.

3.2. Architecture of the Experimental Environment

The architecture of the experimental environment is shown in Figure 2, designed as a multilayer model that illustrates the key functionalities of the components. This model is based on the assumption that traditional approaches relying on CPU and memory metrics are not always adequate for maintaining quality of service under dynamic and unpredictable load spikes. At the Data Plane layer, all traffic passes through the NGINX Ingress Controller, which performs TLS termination and routes requests to the appropriate microservices using layer 7. In this model, the ingress layer is not only a network gateway, but also a key point for measuring performance, as it enables the collection of latency-related data that reflects the user experience. The ingress layer, therefore, becomes the source of important information about system behavior under load. The Telemetry Pipeline layer establishes a bridge between network telemetry and orchestration decision-making. NGINX Ingress provides latency-related metrics [45], while Prometheus performs periodic data collection at defined time intervals. Unlike the standard approach, where HPA relies mainly on resource metrics, the proposed model uses a Custom Metrics Adapter that normalizes the data through defined PromQL queries. In this way, latency is transformed into an indicator suitable for control logic, enabling the transition from a resource-based to a performance-oriented management model. Within the Control Plane layer, the Horizontal Pod Autoscaler is represented as part of a closed MAPE-K cycle. In the monitoring phase, the system tracks deviations from the defined latency threshold, after which the analysis phase assesses the degree of performance degradation. If a deviation from the defined value is detected, an appropriate system response in the form of scaling is planned and then executed by adding or removing pods in the simulation model. This closed control loop enables dynamic adaptation of instances in accordance with the load and perceived quality of service. In this way, Figure 2 explains how the network and orchestration layers are connected, with latency serving as the primary control signal for resource allocation.

3.3. Configuration of the Experimental Environment

In order to make the testing procedure repeatable, an experimental environment was defined through several parameters, including the logical structure of the cluster, available resources, network communication, and automatic scaling mechanisms. This environment does not represent a production Kubernetes cluster. Instead, it is modeled as a simulation-based cloud-native LMS environment that allows the behavior of different scaling mechanisms to be examined under controlled conditions. The simulation environment is modeled as a Kubernetes-like cluster consisting of one control-plane node and three worker nodes. In the simulation, each worker node has limited memory and processing capacity. This allows the observation of how horizontal scaling affects system stability and resource availability. The service itself starts with the minimum number of replicas defined in the simulation, while the upper replica limit is predefined according to the modeled capacity and available system resources. The network layer was configured to represent HTTPS communication through an ingress component and application-layer traffic routing, with bandwidth limitations modeled in line with typical cloud-native environments [46]. Network behavior included baseline latency, variations in delay during traffic growth, and the interval at which metrics were collected. These values were used as parameters of the Prometheus-like telemetry model. A latency threshold was defined as the main control parameter for activating the latency-aware scaling mechanism. To reduce oscillatory behavior, a cooldown period and limits on the minimum and maximum number of pods were also introduced. Load scenarios were generated in a Python environment as sequences of requests with gradual growth and sudden traffic spikes [47]. This enabled the examination of system behavior during stable traffic, short-term overload, and recovery after a load spike. Locust, JMeter, and k6 were not used in the current simulation. They are listed only as recommended load-generation tools for future validation in a real Kubernetes cluster. The defined configuration allows for reproducibility of the experiment and positions the results within the context of a limited, yet representative, cloud-native LMS environment.

Table 1 presents the main parameters used to define the simulation environment and the control logic of the proposed model.

The parameters used in the simulation approximate a realistic load scenario for the LMS platform during a period of load growth. The upper bound of the load range of up to 2000 concurrent users or requests per second represents aggregated values across scenarios, most commonly when students are accessing online exams, just before an assignment deadline, or during concurrent access to course materials. The latency threshold of 500 ms was used as a practical latency boundary, as users might already notice delays above this value during LMS operations such as logging in, opening assessment pages, navigating learning materials, or uploading files. The 15 s metric collection interval was a compromise between timely detection and the need to avoid excessive telemetry noise for the tracking-based autoscaling process. The 10 s cooldown period was introduced to avoid repeated scaling decisions resulting from short-term latency fluctuations. The replica range of 1–6 pods represents a small to medium-sized institutional LMS installation, where horizontal scaling is possible but is still limited by finite computing capacity. Finally, the simulated pod activation delay represents the time lag between the scaling decision and the time an additional replica becomes available to process LMS requests. These assumptions do not claim to reproduce a specific production LMS installation but provide a controlled and repeatable approximation of the bursty load conditions typically observed during exams, assignment submissions, and synchronized access to learning resources.

In this study, the delay in activating the pod is considered a simulation parameter in the model rather than the actual Kubernetes pod start-up time. It refers to the gap between the point at which a scaling decision is made and the time when a new pod is ready to process requests. Such a simplification was made since the model does not simulate all the processes performed in a real Kubernetes cluster.

3.4. Telemetry Scaling Control Model

The proposed model is based on the assumption that traditional approaches relying mainly on CPU and memory utilization may not respond quickly enough during unpredictable load spikes [48]. For these reasons, this paper discusses an approach that allows for the utilization of network latency measured at the very entry point of the system as the main trigger for activating adaptive resource management. At the system entry point, all traffic passes through the NGINX Ingress Controller, which also performs TLS termination and routes requests to the appropriate microservice units using layer 7. At the data plane layer, data on system behavior is also collected by measuring latency and other indicators that can help maintain the user experience. By establishing a telemetry pipeline between the networking and orchestration layers, NGINX Ingress can provide raw metrics from the very entry point of this data stream, while Prometheus will periodically collect and aggregate them for the needs of the Custom Metrics Adapter, which will further process them into a format suitable for the Kubernetes Custom Metrics API. In this way, latency becomes a signal suitable for the control logic of the Horizontal Pod Autoscaler mechanism. Within the control plane layer, the Horizontal Pod Autoscaler is used as a closed-loop adaptation cycle. After that, in the monitoring section, deviations in measured latencies from the defined threshold are tracked in order to assess the degree of degradation. If the system detects exceedances of the allowed values, responses to these are defined in the planning phase using appropriate scaling actions, which are implemented in the execution phase through the Deployment controller by adding or removing pods. With this approach, the model integrates the network and orchestration layers, with latency set as the primary trigger for resource allocation. Figure 3. Conceptual architectural model of automatic horizontal scaling based on application-level metrics.

3.5. Formal Model of a Closed Control Loop

As part of this research, a formal model of an adaptive control mechanism based on feedback coupling has also been defined, which dynamically manages the modeled infrastructure capacities of the LMS under conditions of variable load, as presented in Figure 4. This control model includes the following phases of monitoring, analysis, planning, and execution [49], thereby ensuring continuous adjustment of the required number of microservice replicas. Through the loop phases, the operation of the proposed model is described. The monitoring phase collects performance metrics of the simulated microservice cluster, primarily network latency and CPU utilization, which are forwarded to the decision-making logical layer. In the next phase of analysis, the values of measured latencies are compared with the defined threshold values, based on which it is determined whether the system is currently in a stable or overloaded state. If anomalies regarding the degradation of the system’s performance are detected during the check, a horizontal scaling strategy is activated, while if the system is under conditions of reduced load, capacity reduction is initiated, and unnecessary resources are removed. In order to minimize the oscillatory behavior of the system [50], there are stabilization mechanisms, including cooldown, which represents the period between two scaling decisions, and the limitation of the maximum and minimum number of active replicas. In the simulation, scaling is performed by increasing or decreasing the number of active replicas by one step, while respecting the defined minimum and maximum replica limits and the cooldown period. The decision made in the control mechanism is then forwarded to the execution layer, where the number of active replicas is modified within the simulation model. In a real Kubernetes implementation, this action would correspond to changing the replica count through the Kubernetes API interface, thereby closing the feedback loop over the system.

3.6. Formal Representation of the MAPE-K Model

Let the state of the system at time t be defined by a vector:

S(t) = {L(t), C(t), M(t), P(t)}

(1)

where

L(t) denotes the measured latency,

C(t) denotes processor utilization,

M(t) denotes memory utilization,

P(t) denotes the number of active pods in the system.

The monitoring phase can be represented as a function of collecting metrics:

M o n (t) \to S (t)

(2)

The analysis phase examines whether the desired operating condition of the system has been violated:

A (t) = \{\begin{matrix} 1, & if L (t) > L_{t h r} \\ 0, & o t h e r w i s e \end{matrix}

(3)

where

L_{t h r}

is a predefined latency threshold. In this model, latency is used as the primary control signal for the proposed scaling mechanism, while CPU utilization is used for comparison with the traditional scaling approach.

The planning phase defines the target number of pods according to the current latency state, cooldown condition, and the minimum and maximum replica limits:

P l a n (t) = \{\begin{matrix} m i n (P_{m a x}, P (t) + Δ p), & i f A (t) = 1 a n d T (t) \geq T_{c o o l d o w n} \\ m a x (P_{m i n}, P (t) - Δ p), & i f A (t) = 0 a n d T (t) \geq T_{c o o l d o w n} \\ P (t), & o t h e r w i s e \end{matrix}

(4)

where

Δ p

indicates the scaling step,

P_{m i n}

and

P_{m a x}

represent the minimum and maximum number of active pods, and

T_{c o o l d o w n}

represents the minimum time interval between two scaling decisions. In this simulation, the scaling step is defined as one pod.

The execution phase applies a planned action to the system:

E x e c (t) : P (t + 1) = P l a n (t)

(5)

The knowledge component K contains historical metric values, decision rules, threshold values, cooldown information, and information about previous scaling actions. Therefore, the complete model can be viewed as

M A P E - K = (M o n i t o r i n g, A n a l y s i s, P l a n n i n g, E x e c u t i o n, K n o w l e d g e b a s e)

(6)

This formalization presents the adaptive scaling mechanism as a closed control loop, in which the decision to act depends on monitoring and analyzing system behavior in real time. In the simulation model, the number of active pods changes only when the latency and cooldown conditions are met, while the number of replicas remains within the defined minimum and maximum limits.

3.7. Adaptive Scaling Engine Pseudocode

Algorithm 1 presents the proposed adaptive scaling logic. It does not execute real Kubernetes API calls; instead, it models the behavior of horizontal scaling by changing the number of active replicas according to latency, cooldown, and replica limit conditions.

Algorithm 1: Latency-aware adaptive scaling

1. Input:
2.           L_thr                    //Latency threshold
3.           P_min, P_max    //Minimum and maximum number of replicas
4.           cooldown           //Minimum time interval between two scaling decisions
5. Initialize:
6.           P = P_min
7.           last_scale_time = 0
8. Loop:
9.        measure L(t), C(t), M(t)
10.         current_time = t
11.         if current_time-last_scale_time >= cooldown then
12.           if L(t) > L_thr and P < P_max then
13.              P = P + 1
14.              execute scale_out(P)
15.              last_scale_time = current_time
16.         else if L(t) < L_thr and P > P_min then
17.            P = P − 1
18.            execute scale_in(P)
19.            last_scale_time = current_time
20. end if
21. end if
22. store metrics and decisions in knowledge base K
23. End Loop

This algorithm provides an example of the process of adaptive horizontal scaling according to the latency parameter in a feedback control loop [51]. Initially, the number of replicas equals the minimum value. Then, the system continuously evaluates latency, CPU usage, and memory consumption. If the cooldown interval expires, the current latency value is compared with the predefined threshold. If the latency value exceeds the predefined threshold and the maximum number of replicas has not been reached, the process of scaling out occurs automatically by adding one more replica. Otherwise, if the latency value stays below the predefined threshold and the number of replicas exceeds the predefined minimum, the process of scaling in takes place by reducing the number of replicas by one. In this way, scaling remains bound by the minimum and maximum number of replicas and by the cooldown interval, which helps reduce unnecessary scaling oscillations in the simulated scenario.

The proposed control logic should be interpreted as a threshold-based feedback controller rather than as an optimal or predictive autoscaling algorithm. Its purpose is to examine whether ingress-layer latency can serve as an earlier scaling trigger than CPU utilization under the defined simulation assumptions. The model does not optimize the latency threshold, cooldown period, metric-collection interval, or pod activation delay. These parameters were fixed in the present experiment, and their sensitivity is recognized as an important limitation of the current study. Therefore, the pseudocode should be understood as a simplified decision model for simulation-based evaluation, rather than as a complete production-ready Kubernetes autoscaling implementation.

4. Discussion

The simulation results indicate that, under the defined workload and parameter assumptions, the latency-aware autoscaling signal activated scaling earlier than the CPU-based signal and kept simulated latency closer to the defined threshold. Figure 5 shows the relationship between average latency and the number of concurrent users for the two autoscaling control signals evaluated in the same Python simulation: CPU utilization and ingress-layer latency. Therefore, the comparison should be interpreted as a simulation-based comparison of two autoscaling signals under the same workload and parameter assumptions, rather than as a comparison of two complete architectural models. The figure compares the CPU-based baseline policy and the latency-aware policy under the same simulated increase in user load. The key point shown in the figure is the moment when the traditional model reaches its capacity limit, after which latency increases sharply. In contrast, the proposed model keeps latency closer to the defined threshold under simulated conditions because scaling is activated earlier based on ingress-layer latency.

To avoid ambiguity in the interpretation of the presented figures, it is important to distinguish between conceptual representations, simulated outputs, and inferred indicators. Figure 1, Figure 2, Figure 3 and Figure 4 are conceptual and methodological representations of the proposed architecture, control loop, and adaptive scaling logic. They do not present measured performance data. Figure 5 and Figure 6 are based on outputs generated by the Python simulation under the defined workload and parameter assumptions. The latency values, replica counts, workload growth, and scaling behavior shown in these figures are therefore simulated values rather than direct measurements from a real Kubernetes cluster. The reported time to the first scaling action was inferred from the simulated timeline by identifying the moment at which the scaling condition was satisfied and the first replica adjustment was triggered. No production Kubernetes deployment was used in this study, and, therefore, the figures should not be interpreted as directly measured operational metrics from a real LMS. Instead, they illustrate the expected behavior of the proposed control logic under the selected simulation assumptions.

In the CPU-based baseline policy, latency increases sharply when the simulated workload reaches the modeled capacity limit. In this case, the limit is around 600 simultaneous users. This is also a critical load point in the system, where the response time sharply exceeds values of 2000 ms, directly violating the defined SLA threshold of 500 ms. This behavior can be defined according to the M/M/1 queueing model, where the delay dramatically increases as the system approaches maximum resource utilization [52]. In contrast, the latency-aware policy shows more stable simulated latency behavior in the simulated scenario, even as the number of users increases. The spike in latency over short intervals of around 60 ms represents early signals of a load surge that activate automatic scaling mechanisms. This principle clearly distinguishes itself from the classical approach, where the system reacts only after the load occurs, while the proposed solution responds as soon as the first signs of performance degradation appear. The main reason for this behavior lies in the fact that the proposed model uses network latency as an early indicator of overload, thereby enabling the activation of scaling mechanisms before the overload occurs. This mechanism is modeled based on the measured latency at layer 7, specifically, latency-based telemetry, which serves as a trigger compared to traditional approaches where CPU and memory metrics are used in state analysis. While resource-based metrics often respond with a delay, the proposed model uses the latency signal as an early indicator, reducing the reaction time of the control mechanism in the simulated scenario.

The dynamics of this process are shown in Figure 6, which provides a detailed insight into the behavior of the control mechanism in real time during traffic growth. Traffic was simulated up to 2000 requests per second over a time interval of 60 s, while the number of active replicas was shown as a power function, with a new replica added every 350 requests per second. The figure also illustrates when the latency-aware mechanism reacts to threshold violation and how the number of replicas changes after the scaling conditions are met.

The simulation results suggest that the latency-aware model reacts earlier to the increase in load. In the displayed graph, the first scaling reaction is visible after approximately 12–15 s, when the simulated ingress-layer latency exceeds the defined threshold. This value represents the time to the first scaling action in the simulation. It should not be interpreted as end-user application response time or as the complete recovery time of the system. Complete recovery would require additional measurement of the time needed for latency to return below the defined threshold after new replicas become available. Additional replicas are then added within the interval of approximately 20–30 s, following the modeled traffic growth and the defined cooldown condition. The number of replicas is consistent despite the traffic being at a manageable average, illustrating the importance of the stabilization technique in preventing unnecessary scaling fluctuations. Within the simulation, the number of replicas in use does not surpass the predetermined upper limit. This approach allows the model to avoid excessive scaling within the selected simulation configuration, compared with the CPU-based baseline.

In a broader sense, viewed from the perspective of distributed systems, the obtained results are consistent with the general assumption that cloud-native architectures benefit from horizontal load distribution [53]. When new replicas are added, the concentration of requests on individual instances is reduced, which can reduce bottleneck effects and support more stable system behavior. In the context of this simulation, autoscaling should therefore be interpreted as a mechanism for adjusting modeled replica capacity in response to the selected control signal, rather than as direct evidence of improved user experience or production-level service stability. The obtained results are consistent with existing research that also points to the limitations of the classic Horizontal Pod Autoscaler when it relies solely on CPU and memory metrics [54]. This research further suggests that network latency collected at the ingress point can support earlier scaling decisions in the simulated LMS workload scenario [55]. The proposed model may also be applicable in larger distributed environments, but this would require additional validation. As the number of nodes, replicas, and services increases, the volume of telemetry data may also rise, which can complicate decision-making in a short time frame.

However, the proposed model in this research is not without limitations, as its efficiency and use depend to a considerable extent on the proper configuration of parameters, such as latency thresholds, intervals for collecting network traffic metrics, and cooldown periods. In larger and more demanding distributed computing systems, another challenge may be the increased amount of telemetry data, variations in network latency, and delays in launching new replicas, which can lead to temporary performance degradation due to the delayed availability of additional replicas. In such situations, it is necessary to anticipate the expansion of the model, including hierarchical aggregation of metrics, per-service scaling, and forecasting mechanisms [56]. Despite these limitations, the simulation results indicate that the proposed model can contribute to autoscaling by shifting the response from a purely resource-based approach to a performance-oriented management approach based on early latency signals. This simulated capability for earlier recognition of potential degradation and earlier scaling activation represents a key contribution of the proposed model within the scope of the present study.

Another limitation is that the research lacks sensitivity analysis regarding the latency threshold, cooldown duration, collection frequency, and simulated pod activation delay. The chosen values of these parameters could have a significant effect on autoscaling behavior, especially when applied to systems with varying workload intensity. Thus, the presented outcomes should be interpreted as evidence for the selected configuration only and cannot be used to prove the optimality of the suggested values in other LMS/Kubernetes systems.

Conceptually, lower latency thresholds would probably activate scaling earlier, which could reduce latency peaks but also increase the risk of unnecessary scaling and higher resource consumption. Higher latency thresholds would reduce the number of scaling actions, but they could also delay responses and allow more visible service degradation before additional replicas are activated. Similarly, shorter metric collection intervals could improve detection speed but may also make the system more sensitive to short-term fluctuations and telemetry noise. Longer collection intervals would provide smoother measurements but could delay the first scaling action. The cooldown period has a similar stabilizing role: shorter cooldown values may improve responsiveness, while longer cooldown values may reduce oscillations but slow down adaptation during sudden load spikes. Finally, longer simulated pod activation delays would likely increase the time between the scaling decision and the actual availability of additional processing capacity. Therefore, systematic sensitivity analysis across these parameters remains an important direction for future work.

5. Conclusions

This paper presents a simulation-based approach to automatic scaling of an LMS based on a microservice architecture. The research focused on examining the limitations of traditional scaling mechanisms that rely solely on CPU utilization, which may be an inadequate indicator during sudden traffic spikes typical of LMS usage during knowledge assessments. The proposed model conceptually integrates Kubernetes HPA, Prometheus Adapter, and NGINX Ingress Controller, using network latency as the primary signal to trigger the automatic scaling mechanism. This principle enables the transition from a reactive resource management model to one that responds earlier to workload changes. The simulation results indicate that the proposed model activated the scaling mechanism earlier in the selected simulation configuration by reducing the time to the first scaling action from approximately 90 s in the CPU-based baseline to approximately 12–15 s in the latency-aware model. This improvement refers to earlier autoscaling activation in the simulation and should not be interpreted as a measured reduction in end-user response time or full system recovery time. Under the defined simulation assumptions, this approach indicates earlier autoscaling activation during sudden load spikes. The study did not directly measure service availability, end-user experience, error rate, throughput, or full recovery time. These aspects remain targets for future real-cluster validation. Within the selected simulation configuration, the latency-aware policy maintained simulated latency closer to the defined threshold than the CPU-based baseline. The practical relevance of this work lies in its potential direction for future implementation in real cloud-native LMS environments, where availability, stability, throughput, error rate, and user experience would need empirical evaluation. The application of a network-based latency mechanism should therefore be interpreted as a simulated indication of earlier scaling activation, while its effect on user experience requires empirical validation in a real LMS deployment. However, claims about error reduction, throughput improvement, and full-service stability require additional validation in a real Kubernetes cluster. In addition to the advantages it offers, the proposed model also has certain limitations. The research was conducted in a controlled simulation environment, which may limit the variability observed in real-world conditions. Accordingly, the findings should be interpreted within the boundaries of the simulation assumptions and not as conclusive proof of performance in an actual Kubernetes cluster. The outcomes of this study should therefore be understood as simulation-based observations. They illustrate how the proposed latency-aware control logic behaves under a defined set of assumptions related to LMS workload, latency threshold, metric collection interval, cooldown period, replica limits, and simulated pod activation delay. These observations do not constitute direct evidence of performance improvement in a real Kubernetes deployment. Instead, the main contribution of the study is the definition of a simplified threshold-based feedback controller and its use in a controlled Python simulation to compare CPU-based and ingress-latency-based scaling triggers under identical LMS burst-load assumptions. The effectiveness of the model depends largely on proper parameter and infrastructure configuration, such as latency thresholds, metrics collection intervals, and cooldown periods. In more complex environments, additional challenges may include variations in network latency, delays in launching new instances, and an increased volume of collected data, all of which can directly affect the stability of the control mechanism. As a guideline and suggestion for future research, it is recommended to further validate the proposed model in a real Kubernetes cluster using load-generation tools such as Locust, JMeter, or k6. Future development may also include the integration of predictive mechanisms based on artificial intelligence, as well as the enhancement of the introduction of hierarchical and distributed control loops to improve system performance. In addition, future research may also focus on expanding the proposed model to edge or hybrid cloud environments, where latency has an even more pronounced impact on system operation.

Author Contributions

Conceptualization, M.M. and D.S.; methodology, M.M. and D.K.; data curation, M.M. and D.K.; writing—original draft preparation, M.M., D.S., and D.K.; writing—review and editing, M.M., D.S., and D.K.; supervision, D.S. and D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Volkov, M.; Nikolaev, V.; Gulyaev, I.; Galavetdinov, I.; Belashov, E. Developing the Distance Learning System Using the .NET Framework: Insights and Methodologies. In Proceedings of the International Scientific and Practical Conference Smart Cities and Sustainable Regional Development (SMARTGREENS); 2025; Volume 1, pp. 306–312. Available online: https://ssdl.online/images/conf/2024/smartgreens2024/45.pdf (accessed on 25 May 2026). [CrossRef]
Basit, M.S.; Pratama, A.B.; Firizqi, J.D.; Indrajit, R.E. Implementation of Container Orchestrator Management in Learning Management System. J. Tek. Inform. 2023, 4, 941–951. [Google Scholar] [CrossRef]
Fawareh, H.; Dahham, O.; Aljawawdeh, H.; Al Daoud, E. Evaluation of Cloud Computing for Advancement LMS through Different Environments. Int. J. Adv. Soft Comput. Appl. 2024, 16, 125–148. [Google Scholar] [CrossRef]
Kaleci, D. Integration and Application of Artificial Intelligence Tools in the Moodle Platform: A Theoretical Exploration. J. Educ. Technol. Online Learn. 2025, 8, 100–111. [Google Scholar] [CrossRef]
Subhi, H.; Qashi, R.; Abdulrahman, L.; Omar, M.; Yazdeen, A. Performance Analysis of Enterprise Cloud Computing: A Review. J. Appl. Sci. Technol. Trends 2023, 4, 1–12. [Google Scholar] [CrossRef]
Jihadi, H.; Maulana, D.; Baskoro, H. Construction and Design of a Web Information System for Extracurricular Activities in High Schools in Bogor City, Indonesia. Digit. J. Comput. Sci. Appl. 2024, 2, 142–157. [Google Scholar] [CrossRef]
Ardana, P.R.; Trisnapradika, G.A. Multi-Tier Architecture Design for Scalable and Effective Non-Formal Learning: A Redesign of Serat Kartini Women’s School LMS. J. Inf. Syst. Inform. 2025, 7, 3826–3848. [Google Scholar] [CrossRef]
Kozub, V. Problems and Solutions in Building Highly Loaded Software. Am. J. Eng. Technol. 2025, 7, 230–236. [Google Scholar] [CrossRef]
Ramdhani, R.; Sujjada, A.; Nugraha, N. Enhancing E-Commerce System Scalability through Event-Driven Architecture with RabbitMQ and Docker. Bit-Tech 2025, 8, 1437–1445. [Google Scholar] [CrossRef]
Park, J. Towards Adaptive API Traffic Management: A Systematic Architecture for High Availability and Performance. Front. Artif. Intell. Res. 2025, 2, 143–150. [Google Scholar] [CrossRef]
Rabiu, S.; Chan, H.Y.; Syed-Mohamad, S.M. A Cloud-Based Container Microservices: A Review on Load-Balancing and Auto-Scaling Issues. Int. J. Data Sci. 2022, 3, 80–92. [Google Scholar] [CrossRef]
Karunamurthy, A. Scalable Web Application Deployment Using Auto Scaling, Load Balancer, and RDS. Int. J. Sci. Res. Eng. Manag. 2025, 9, 1–9. [Google Scholar] [CrossRef]
Prinafsika; Junaidi, A.; Haromainy, M. Cloud-Based High Availability Architecture Using Least Connection Load Balancer and Integrated Alert System. Bit-Tech 2025, 8, 263–274. [Google Scholar] [CrossRef]
Chippagiri, S. Optimizing Kubernetes Network Performance: A Study of Container Network Interfaces and System Tuning Profiles. Eur. J. Theor. Appl. Sci. 2024, 2, 651–668. [Google Scholar] [CrossRef] [PubMed]
Kratzke, N. Cloud-Native Observability: The Many-Faceted Benefits of Structured and Unified Logging-A Multi-Case Study. Future Internet 2022, 14, 274. [Google Scholar] [CrossRef]
Soto, P.; Camelo, M.; Vleeschauwer, D.D.; Bock, Y.D.; Chang, C.; Botero, J.F.; Latré, S. Network Intelligence for NFV Scaling in Closed-Loop Architectures. IEEE Commun. Mag. 2023, 61, 66–72. [Google Scholar] [CrossRef]
Vemasani, P.; Vuppalapati, S.M.; Modi, S.; Ponnusamy, S. Achieving Agility through Auto-Scaling: Strategies for Dynamic Resource Allocation in Cloud Computing. Int. J. Res. Appl. Sci. Eng. Technol. 2024, 12, 3169–3177. [Google Scholar] [CrossRef]
Pérez-Guzmán, R.E.; Rivera, M.; Salgueiro, Y.; Baier, C.R.; Wheeler, P. Moving Microgrid Hierarchical Control to an SDN-Based Kubernetes Cluster: A Framework for Reliable and Flexible Energy Distribution. Sensors 2023, 23, 3395. [Google Scholar] [CrossRef]
Al-Qora’n, L.F.; Al-Said Ahmad, A. Modular Monolith Architecture in Cloud Environments: A Systematic Literature Review. Future Internet 2025, 17, 496. [Google Scholar] [CrossRef]
Vaño, R.; Lacalle, I.; Sowiński, P.; S-Julián, R.; Palau, C.E. Cloud-Native Workload Orchestration at the Edge: A Deployment Review and Future Directions. Sensors 2023, 23, 2215. [Google Scholar] [CrossRef] [PubMed]
Bazhenov, A.E.; Vorobeva, E.G.; Larin, D.V.; Kartashov, D.A. Modeling Scalable Microservice Systems in Container Virtualization Using Kubernetes and Service Mesh. Softw. Syst. Comput. Methods 2025, 4, 94–107. [Google Scholar] [CrossRef]
Hassan, S.; Bahsoon, R.; Kazman, R. Microservice Transition and Its Granularity Problem: A Systematic Mapping Study. Softw. Pract. Exp. 2020, 50, 1651–1681. [Google Scholar] [CrossRef]
Raza, S.M.; Jeong, J.; Kim, M.; Kang, B.; Choo, H. Empirical Performance and Energy Consumption Evaluation of Container Solutions on Resource-Constrained IoT Gateways. Sensors 2021, 21, 1378. [Google Scholar] [CrossRef]
Imran, M.; Kuznetsov, V.; Dziedziniewicz-Wojcik, K.M.; Pfeiffer, A.; Paparrigopoulos, P.; Trigazis, S.; Tomasso, T.; Ciangottini, D. Migration of CMSWEB Cluster at CERN to Kubernetes: A Comprehensive Study. Clust. Comput. 2021, 24, 3085–3099. [Google Scholar] [CrossRef]
Lumpp, F.; Fummi, F.; Patel, H.; Bombieri, N. Enabling Kubernetes Orchestration of Mixed-Criticality Software for Autonomous Mobile Robots. IEEE Trans. Robot. 2024, 40, 540–553. [Google Scholar] [CrossRef]
Rejiba, Z.; Chamanara, J. Custom Scheduling in Kubernetes: A Survey on Common Problems and Solution Approaches. ACM Comput. Surv. 2022, 55, 1–37. [Google Scholar] [CrossRef]
Mondal, S.K.; Wu, X.; Kabir, H.M.D.; Dai, H.-N.; Ni, K.; Yuan, H.; Wang, T. Toward Optimal Load Prediction and Customizable Autoscaling Scheme for Kubernetes. Mathematics 2023, 11, 2675. [Google Scholar] [CrossRef]
Aruna, K.; Gurunathan, P. Enhancing Edge Environment Scalability Using Kubernetes. Concurr. Comput. Pract. Exp. 2024, 36, e8303. [Google Scholar] [CrossRef]
Aqasizade, H.; Ataie, E.; Bastam, M. Kubernetes in Action: Exploring Performance in the Cloud. Softw. Pract. Exp. 2025, 55, 1711–1725. [Google Scholar] [CrossRef]
Turin, G.; Borgarelli, A.; Donetti, S.; Damiani, F.; Johnsen, E.B.; Tapia Tarifa, S.L. Predicting Resource Consumption of Kubernetes Container Systems. J. Syst. Softw. 2023, 203, 111750. [Google Scholar] [CrossRef]
Oza, J.; Patil, A.; Maniyath, C.; More, R.; Kambli, G.; Maity, A. Harnessing Insights from Streams: Unlocking Real-Time Data Flow with Docker and Cassandra in the Apache Ecosystem. TechRxiv 2024. [Google Scholar] [CrossRef]
Tran, M.-N.; Kim, Y. Hybrid Resource Quota Scaling for Kubernetes-Based Edge Computing Systems. Electronics 2025, 14, 3308. [Google Scholar] [CrossRef]
Wade, M.; Hulland, J. The Resource-Based View and Information Systems Research: Review, Extension, and Suggestions for Future Research. MIS Q. 2004, 28, 107–142. [Google Scholar] [CrossRef]
Alharthi, S.; Alshamsi, A.; Alseiari, A.; Alwarafy, A. Auto-Scaling Techniques in Cloud Computing: Issues and Research Directions. Sensors 2024, 24, 5551. [Google Scholar] [CrossRef]
Vasireddy, I.; Kandi, P.; Gandu, S. Efficient Resource Utilization in Kubernetes: A Review of Load Balancing Solutions. Int. J. Innov. Res. Eng. Manag. 2023, 10, 44–48. [Google Scholar] [CrossRef]
Shafiq, D.A.; Jhanjhi, N.Z.; Abdullah, A. Load balancing techniques in cloud computing environment: A review. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 3910–3933. [Google Scholar] [CrossRef]
Lohumi, Y.; Gangodkar, D.; Srivastava, P.; Khan, M.Z.; Alahmadi, A.; Alahmadi, A. Load Balancing in Cloud Environment: A State-of-the-Art Review. IEEE Access 2023, 11, 134517–134530. [Google Scholar] [CrossRef]
Oyediran, M.O.; Ojo, O.S.; Ajagbe, S.A.; Aiyeniko, O.; Obuzor, P.C.; Adigun, M.O. Comprehensive Review of Load Balancing in Cloud Computing System. Int. J. Electr. Comput. Eng. 2024, 14, 3244–3255. [Google Scholar] [CrossRef]
Dakić, V.; Đambić, G.; Slovinac, J.; Redžepagić, J. Optimizing Kubernetes Scheduling for Web Applications Using Machine Learning. Electronics 2025, 14, 863. [Google Scholar] [CrossRef]
Medel, V.; Tolón, C.; Arronategui, U.; Tolosana-Calasanz, R.; Bañares, J.Á.; Rana, O.F. Client-Side Scheduling Based on Application Characterization on Kubernetes. In Economics of Grids, Clouds, Systems, and Services; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; Volume 10537. [Google Scholar] [CrossRef]
Chen, T.; Bahsoon, R.; Yao, X. A survey and taxonomy of self-aware and self-adaptive cloud autoscaling systems. ACM Comput. Surv. 2018, 51, 61. [Google Scholar] [CrossRef]
Arcaini, P.; Riccobene, E.; Scandurra, P. Modeling and Analyzing MAPE-K Feedback Loops for Self-Adaptation. In 2015 IEEE/ACM 10th International Symposium on Software Engineering for Adaptive and Self-Managing Systems; IEEE: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Sadikin, M.; Yusuf, R.; Rifai, D. Load balancing clustering on moodle LMS to overcome performance issue of e-learning system. Telkomnika 2019, 17, 131–138. [Google Scholar] [CrossRef]
Qu, C.; Calheiros, R.N.; Buyya, R. Auto-Scaling Web Applications in Clouds: A Taxonomy and Survey. ACM Comput. Surv. 2018, 51, 1–33. [Google Scholar] [CrossRef]
Tonge, A.S.; Baniya, B.K.; Gc, D. Efficient, Scalable, and Secure Network Monitoring Platform for SMEs. Network 2025, 5, 36. [Google Scholar] [CrossRef]
Wu, Y.W.; Xu, Y.J.; Wu, H.; Su, L.G.; Zhang, W.B.; Zhong, H. Apollo: Rapidly Picking Optimal Cloud Configurations for Big Data Analytics. J. Comput. Sci. Technol. 2021, 36, 1184–1199. [Google Scholar] [CrossRef]
Massa, J.; De Caro, V.; Forti, S.; Dazzi, P.; Bacciu, D.; Brogi, A. ECLYPSE: A Python Framework for Simulation and Emulation of the Cloud-Edge Continuum. J. Softw. Evol. Process 2026, 38, e70081. [Google Scholar] [CrossRef]
Rajasekar, V.; Saračević, M.; Karabašević, D.; Stanujkić, D.; Hasić, A.; Azizović, M.; Thirumalai, S. Security-Enhanced QoS-Aware Autoscaling of Kubernetes Pods Using HPA. J. Intell. Manag. Decis. 2024, 3, 175–186. [Google Scholar] [CrossRef]
Lorido-Botran, T.; Miguel-Alonso, J.; Lozano, J.A. A Review of Auto-Scaling Techniques for Elastic Applications in Cloud Environments. J. Grid Comput. 2014, 12, 559–592. [Google Scholar] [CrossRef]
Kephart, J.O.; Chess, D.M. The Vision of Autonomic Computing. Computer 2003, 36, 41–50. [Google Scholar] [CrossRef]
Grimaldi, D.; Persico, V.; Pescape, A.; Salvi, A.; Santini, S. A Feedback-Control Approach for Resource Management in Public Clouds. In 2015 IEEE Global Communications Conference (GLOBECOM); IEEE: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Wang, Y. Research on the Queuing Theory Based on M/M/1 Queuing Model. Highlights Sci. Eng. Technol. 2023, 61, 80–87. [Google Scholar] [CrossRef]
Lynn, T.; Fox, G.; Gourinovitch, A.; Rosati, P. Understanding the Determinants and Future Challenges of Cloud Computing Adoption for HPC. Future Internet 2020, 12, 135. [Google Scholar] [CrossRef]
Horn, A.; Mohammadi Fard, H.; Wolf, F. Multi-Objective Hybrid Autoscaling of Microservices in Kubernetes Clusters. In Euro-Par 2022: Parallel Processing; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Joyce, J.E.; Sebastian, S. An Experimental Evaluation of Horizontal Pod Autoscaler for Diverse Machine Learning Workloads in Kubernetes. Res. Sq. 2024. [Google Scholar] [CrossRef]
Fang, Z.; Ma, H.; Chen, G.; Buyya, R. HGraphScale: Hierarchical Graph Learning for Autoscaling Microservice Applications in Container-Based Cloud Computing. IEEE Trans. Serv. Comput. 2026, 19, 410–422. [Google Scholar] [CrossRef]

Figure 1. Conceptual control system for latency management in a cloud environment.

Figure 2. Conceptual adaptive Kubernetes-like system architecture with control loop.

Figure 3. Conceptual architectural model of automatic horizontal scaling based on application-level metrics.

Figure 4. Formal model of a closed control loop for adaptive scaling of microservices.

Figure 5. Simulated latency comparison between CPU-based and latency-aware autoscaling policies. The presented values were generated by the Python simulation and do not represent direct measurements from a real Kubernetes cluster.

Figure 6. Simulated time dynamics of horizontal scaling during increasing load. The presented values were generated by the Python simulation and do not represent direct measurements from a real Kubernetes cluster.

Table 1. Simulation environment and control parameters.

Category	Parameter	Value
Architecture	Type of environment	Simulation-based Kubernetes-like cloud-native LMS model
Architecture	Implementation environment	Python simulation
Cluster	Logical cluster structure	1 control-plane node and 3 worker nodes
Resources	Worker node capacity	4 vCPU, 8 GB RAM
Network	Incoming communication	HTTPS (TLS 1.3) via NGINX Ingress Controller
Network	Internal communication	gRPC
Telemetry	Metrics collection model	Prometheus-like metrics collection
Telemetry	Metrics adapter	Custom Metrics Adapter model
Scaling	Scaling mechanism	Horizontal Pod Autoscaler logic
	Primary control signal	Ingress-layer latency
	Baseline comparison signal	CPU utilization
	Replica range	1–6 pods
	Scaling action	Increase or decrease the number of active replicas
	Simulated pod activation delay	Modeled as the delay between a scaling decision and the availability of a new replica
Control parameters	SLA latency threshold	500 ms
	Collection interval	15 s
	Cooldown period	10 s
Persistence	Storage layer	Redis and PostgreSQL model
Testing	Load generation	Python simulation
	Load pattern	Gradual growth and burst traffic
	Load range	Up to 2000 concurrent users/2000 RPS
Future validation	Recommended load-generation tools	Locust, JMeter, k6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Marković, M.; Soleša, D.; Karabašević, D. A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments. Electronics 2026, 15, 2336. https://doi.org/10.3390/electronics15112336

AMA Style

Marković M, Soleša D, Karabašević D. A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments. Electronics. 2026; 15(11):2336. https://doi.org/10.3390/electronics15112336

Chicago/Turabian Style

Marković, Marko, Dragan Soleša, and Darjan Karabašević. 2026. "A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments" Electronics 15, no. 11: 2336. https://doi.org/10.3390/electronics15112336

APA Style

Marković, M., Soleša, D., & Karabašević, D. (2026). A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments. Electronics, 15(11), 2336. https://doi.org/10.3390/electronics15112336

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Simulation-Based Latency-Aware Autoscaling Model for LMS Platforms in Kubernetes Environments

Abstract

1. Introduction

2. Literature Review

2.1. Architectural Evolution: From Monolithic to Microservice Systems

2.2. Containerization and Orchestration in Cloud-Native Environments

2.3. Autoscaling in Kubernetes Environments

2.4. Network Layer and Load Balancing in Cloud-Native Systems

2.5. Autonomous Computing and Feedback Loops

2.6. Limitations of Existing Approaches and Positioning of Research

3. Research Methodology

3.1. Research Approach and Experimental Design

3.2. Architecture of the Experimental Environment

3.3. Configuration of the Experimental Environment

3.4. Telemetry Scaling Control Model

3.5. Formal Model of a Closed Control Loop

3.6. Formal Representation of the MAPE-K Model

3.7. Adaptive Scaling Engine Pseudocode

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI