ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation

Machiraju, Vishwanath Srikanth; Kumar, Vijay; Sharma, Sahil

doi:10.3390/mca31020049

Open AccessSystematic Review

ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation

by

Vishwanath Srikanth Machiraju

¹

,

Vijay Kumar

²

and

Sahil Sharma

^3,*

¹

Microsoft, Hyderabad 500032, India

²

Department of Information Technology, Dr B R Ambedkar National Institute of Technology, Jalandhar 144008, India

³

School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry BT48 7JL, UK

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2026, 31(2), 49; https://doi.org/10.3390/mca31020049

Submission received: 29 November 2025 / Revised: 10 February 2026 / Accepted: 9 March 2026 / Published: 16 March 2026

(This article belongs to the Special Issue Celebrate the 30th Anniversary of Mathematical and Computational Applications (MCA))

Download

Browse Figures

Versions Notes

Abstract

Elastic cloud systems are increasingly employing machine learning (ML) to automate resource scaling in response to variable workloads and stringent service-level objectives. However, current ML-based autoscalers are fragmented across different platforms, objectives, and evaluation frameworks. This survey examines 60 primary studies conducted between 2015 and 2025, categorising them according to a five-dimensional taxonomy that includes goal, decision logic, scaling mode, control scope, and deployment. This study classifies supervised, unsupervised, and reinforcement learning approaches and analyzes their integration into practical frameworks, including Kubernetes-based controllers and cloud provider services. This paper summarizes the application of machine learning to workload prediction, proactive and hybrid horizontal–vertical scaling, and adaptive policy optimization. Additionally, it synthesises common evaluation practices, encompassing workloads, metrics, and benchmarks. The analysis identifies ongoing challenges: actuation delays and telemetry lag, the intricacies of hybrid scaling, coordination across multi-service and edge-cloud deployments, and the constrained joint consideration of cost, SLO, and energy objectives. The identified gaps necessitate additional research on unified machine learning-driven orchestration, multi-agent and federated control, standardised benchmarks, and sustainability-aware autoscaling.

Keywords:

autoscaling; cloud computing; machine learning; reinforcement learning; horizontal scaling; vertical scaling; MAPE-K; orchestration

1. Introduction

Cloud computing serves as a foundation for contemporary digital infrastructure, offering flexible resource allocation and a pay-as-you-go pricing model that reduces the obstacles to implementing large-scale services [1]. Highly variable workloads complicate the maintenance of performance, reliability, and cost efficiency. Public cloud providers offer fundamental autoscaling mechanisms, such as Amazon EC2 Auto Scaling (https://aws.amazon.com/ec2/autoscaling/, accessed on 29 November 2025) and the Kubernetes Horizontal Pod Autoscaler (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/, accessed on 29 November 2025). However, these controllers are generally reactive and based on thresholds, placing the onus of ensuring end-to-end Quality of Service (QoS) and adherence to Service Level Agreements (SLAs) on application owners [2].

The transition from monolithic architectures to microservices and serverless paradigms has heightened the complexity of resource management [3,4,5]. Contemporary applications comprise numerous interrelated services and functions deployed across diverse execution environments, including virtual machines, containers, and serverless platforms. Resource demands exhibit variability and non-stationarity, while interference among co-located workloads complicates the prediction of the impact of scaling actions on end-to-end performance.

Traditionally, rule-based autoscaling policies face challenges in these circumstances. Fixed thresholds and handcrafted rules frequently struggle to manage unpredictable workload patterns, multi-objective trade-offs among quality of service, cost, and energy, as well as the scale of contemporary deployments [6]. The identified limitations have led to a growing interest in machine learning-based autoscaling. In this context, controllers utilise historical data and real-time telemetry to forecast demand, model Quality of Service (QoS), and develop adaptive scaling policies. Recent advancements include supervised prediction and reinforcement learning techniques for cloud autoscaling [6,7].

1.1. Motivation and Significance

Research on ML-based autoscaling has grown rapidly in recent years (Figure 1), reflecting its critical role in optimizing cost and performance for enterprise-grade applications. Intelligent autoscaling is essential for sustaining application responsiveness under variable demand while minimizing resource waste. This survey addresses the need for a comprehensive synthesis of ML-driven strategies, bridging the gap between conventional heuristics and adaptive, data-driven solutions.

To contextualize the need for this survey, Table 1 compares recent peer-reviewed surveys on autoscaling techniques, highlighting their coverage and gaps. This synthesis demonstrates that prior works either focus narrowly on specific paradigms (e.g., VMs or IoT) or omit emerging aspects such as microservices and, to a lesser extent, serverless platforms, as well as multi-agent reinforcement learning, sustainability, and provider integration, underscoring the need for a comprehensive and updated review.

1.2. Objectives and Contributions

This survey aims to achieve two primary objectives: (i) to structure the landscape of machine-learning-based autoscaling for elastic cloud systems within a coherent conceptual framework, and (ii) to connect that framework to specific algorithms, system architectures, and industrial platforms.

The specific contributions are:

1.: Taxonomy of ML-based autoscaling. We propose a taxonomy for machine learning-based autoscalers across five dimensions: goal, decision logic, scaling mode (horizontal, vertical, or hybrid), control scope (e.g., VMs, containers, services), and deployment setting (cloud, edge, hybrid), and use it to classify existing approaches.
2.: Systematic classification of ML techniques. We systematically examine supervised, unsupervised, and reinforcement learning approaches to autoscaling, characterising each by workload types, input signals, control decisions, and optimisation objectives, and provide comparative summaries of representative methods.
3.: Analysis of frameworks and platform integrations. We analyze end-to-end autoscaling frameworks and their implementations of the control loop, and study how ML-based autoscalers are integrated with practical platforms, such as Kubernetes (HPA, VPA, KEDA, and custom controllers), and autoscaling services from major cloud providers.
4.: Synthesis of evaluation practice and cross-cutting challenges. We consolidate evaluation practices, including metrics, workloads, and benchmarks, and identify recurring challenges such as hybrid scaling, multi-service coordination, telemetry lag, and concept drift, cost–SLO–energy trade-offs, and limited reproducibility, leading to design guidelines and concrete directions for future research.

1.3. Problem Statement

Autoscaling aims to adjust computing resources so that an application meets its performance targets under varying workload, while keeping resource cost low. We consider time as a sequence of decision steps (t = 1, 2, …). At each step, the autoscaler chooses a resource allocation

R (t)

based on the current or predicted workload.

Let

$R (t)$ : vector of allocated resources at time t (for example CPU, memory, and instance count);
$D (t)$ : workload or demand at time t;
$\hat{D} (t)$ : predicted workload or demand at time t;
$C (R (t), D (t))$ : cost incurred at time t;
$Q (R (t), D (t))$ : QoS metric at time t (for example response time or SLA violation rate).

Over a horizon of T decision steps, the autoscaling problem can be written as a constrained optimization:

\begin{matrix} min_{R (1), \dots, R (T)} \sum_{t = 1}^{T} C (R (t), D (t)) \\ subject to Q (R (t), D (t)) \leq Q_{\max}, t = 1, \dots, T, \end{matrix}

(1)

where

Q_{\max}

is a QoS threshold specified by the service level objectives.

In machine learning-based autoscaling, the allocation

R (t)

is produced by a learned policy

π

:

R (t) = π (D (t), M (t)),

(2)

where

M (t)

denotes monitoring data such as CPU, memory, and latency metrics, and

π

is obtained using supervised learning, reinforcement learning, or a hybrid method.

This formulation makes explicit the central trade off: the autoscaler must choose

R (t)

so that cost is minimized, while QoS constraints are respected for all decision steps.

1.4. Review Method

This study systematically reviews recent work on machine learning-based autoscaling for elastic cloud applications. The review procedure specifies (i) the scientific databases queried, (ii) the search limits (time window, language, and venue), and (iii) the inclusion and exclusion criteria used to select the final set of papers. The publication counts shown in Figure 1 correspond to the initial Google Scholar results obtained with this search strategy, before screening.

1.4.1. Search Engines

We queried the following databases: IEEE Xplore (https://ieeexplore.ieee.org/, 29 November 2025), SpringerLink (https://link.springer.com/, accessed on 29 November 2025), ACM Digital Library (https://dl.acm.org/, accessed on 29 November 2025), ScienceDirect (https://www.sciencedirect.com/, accessed on 29 November 2025), Scopus (https://www.scopus.com/, accessed on 29 November 2025), and Google Scholar (https://scholar.google.com/, accessed on 29 November 2025). These sources provide broad coverage of peer-reviewed work in cloud computing, machine learning, and systems research. In addition, we applied backward snowballing on the reference lists of key papers to identify further relevant studies [14].

1.4.2. Search Limits

An initial review of the literature on ML-based autoscaling was conducted to identify the main concepts and keywords, including autoscaling, cloud computing, resource management, workload, virtual machines, microservices, latency, service level agreements (SLAs), cost, and machine learning.

Based on this scan, we constructed queries that combine autoscaling terms with cloud-related terms, for example (“autoscaling” OR “auto-scaling” OR “vertical scaling” OR “horizontal scaling” OR “elastic”) AND “cloud computing”, and queries that target specific execution models, for example (“virtual machines” OR “microservices” OR “Kubernetes”) AND “auto scale*”. The search was limited to publications between 1 January 2015, and 1 November 2025 (the date of the bibliographic search). We considered only peer-reviewed journal articles and conference papers written in the English language.

A systematic review was conducted in accordance with the PRISMA guidelines (https://www.prisma-statement.org/, accessed on 29 November 2025) (see Figure 2) to ensure transparency and reproducibility. Initially, 995 research articles were identified through comprehensive searches across major scientific databases. After removing duplicates, 880 articles remained for screening. Titles and abstracts were assessed, resulting in 95 articles deemed relevant for full-text evaluation. Applying the predefined inclusion and exclusion criteria and restricting the focus to primary machine-learning-based autoscalers yielded 60 studies, which form the core dataset analysed in the remainder of this survey.

1.4.3. Inclusion Criteria

Publications were included if they satisfied both of the following:

They apply machine learning techniques to autoscaling cloud or elastic applications.
They explicitly define autoscaling goals or policies (for example, meeting QoS targets or minimising cost).

1.4.4. Exclusion Criteria

Publications were excluded if any of the following held:

They do not address autoscaling using machine learning techniques.
They do not concern cloud or elastic applications.
They are not peer reviewed (for example, theses, white papers, or technical reports).
They are not written in English.

1.5. Survey Organization

This survey is organised according to a structured taxonomy illustrated in Figure 3, with each section numbered to reflect a logical progression of ideas. Section 2 provides background on cloud architecture, infrastructure models, and autoscaling fundamentals, and introduces the autoscaling taxonomy. Section 3 outlines the different machine learning approaches used for autoscaling, categorising them by workload characterisation, ML techniques, and decision scope. Section 4 discusses frameworks and systems, including ML-driven autoscaling pipelines and integration with orchestration tools. Section 5 addresses key challenges and design trade-offs in ML-based autoscaling. Section 6 proposes future research directions, and Section 7 concludes the survey.

Table 2 summarises the mathematical notation introduced in Section 1.3 and used throughout the survey.

2. Taxonomy and Background

To systematically analyse autoscaling strategies, we adopt the taxonomy illustrated in Figure 4. It is organised around five questions: What is optimised? (goal: QoS, cost, utilisation, energy), How does it decide? (decision logic: reactive or proactive, rule-based or machine learning-based), How does it scale? (scaling actions: horizontal, vertical, or hybrid), What is controlled? (control scope: virtual machines, containers, functions, workflow stages), and Where does it run? (deployment: cloud data centre, edge, or hybrid). These five dimensions provide a vocabulary for describing autoscaling systems and are used in later sections to classify existing work.

2.1. Goal (What Is Optimised?)

The goal dimension captures what the autoscaler optimises when taking scaling decisions. In the formulation in Section 1.3, these goals appear in the cost function

C (\cdot)

and the QoS constraint

Q (\cdot) \leq Q_{\max}

. A first class of objectives concerns QoS metrics: the autoscaler aims to satisfy performance targets such as bounded response time, sufficient throughput, or SLA compliance, which corresponds to keeping

Q (R (t), D (t))

below the threshold

Q_{\max}

. A second objective is cost efficiency, where the controller seeks to minimise resource expenditure while meeting demand, consistent with the objective

\sum_{t} C (R (t), D (t))

in Equation (1). Many systems also target high resource utilisation, attempting to avoid both long periods of idleness and sustained saturation. Some explicitly optimise for energy efficiency or carbon footprint, for example, by consolidating workloads during periods of low demand.

In practice, these objectives are often combined into a multi-objective problem, where cost is traded off against QoS or energy. In learning-based methods, such trade-offs are typically encoded in the loss function (for supervised approaches) or in the reward function (for reinforcement learning-based autoscalers).

2.2. Decision (How Does It Decide?)

The decision dimension describes how the autoscaler decides when and what scaling actions to perform, covering both the timing of decisions and the logic used. From a timing perspective, mechanisms can be reactive or proactive. Reactive autoscalers trigger scaling in response to observed conditions or threshold breaches; for example, if CPU usage exceeds a specified threshold, a reactive policy adds instances. Proactive mechanisms instead anticipate future demand, typically via workload forecasting, and scale out or in ahead of time to avoid impending QoS violations. The decision logic itself can be rule-based or learning-based. Rule-based logic uses predefined policies such as thresholds, step rules, or time-based schedules, and is widely used in current platforms, including default cloud autoscalers and Kubernetes HPA. Learning-based logic uses machine learning to derive decisions from data: supervised methods train models to predict future load or required capacity, unsupervised methods detect patterns or anomalies in metrics and can trigger scaling in response to abnormal behaviour, and reinforcement learning learns a scaling policy by interacting with the system and receiving rewards that encode QoS and cost trade-offs.

Many systems combine these aspects, for example, using reactive thresholds for normal operation and adding predictive or reinforcement learning components for cases where simple rules perform poorly. Table 3 summarises typical decision mechanisms.

2.3. Scaling (How Does It Scale?)

The scaling dimension describes how resources are adjusted once a decision has been made. Autoscalers typically apply horizontal, vertical, or hybrid scaling. In horizontal scaling (scale out/scale in), the controller changes the number of instances, such as virtual machines, containers, or function instances. This is the most common form of scaling in cloud systems and is well-suited for stateless, replicated services. In vertical scaling (scale up/scale down), the controller changes the resources assigned to an instance, for example, CPU cores or memory, thereby avoiding the overhead of creating and managing additional instances, but remaining bounded by the capacity of the underlying host; the Kubernetes Vertical Pod Autoscaler is a representative example. Hybrid scaling combines horizontal and vertical scaling, allowing the controller to choose between them based on current conditions or optimization criteria. For instance, this may involve first increasing resources on existing instances and only adding new instances once a limit is reached.

Horizontal scaling improves elasticity and fault tolerance, but it also increases coordination overhead. Vertical scaling is simple to manage but limited by hardware. Hybrid strategies seek to balance these effects.

2.4. Control Scope (What Is Controlled?)

The control scope specifies which resources or components the autoscaler manages, and different service models and architectures lead to different scopes. At the infrastructure level (IaaS), the autoscaler scales virtual machines or physical hosts, a coarse-grained approach that is common for monolithic or legacy applications. At the platform level (PaaS) and in container orchestration platforms, the autoscaler scales containers or microservices, for example by changing the number of pod replicas for a service in Kubernetes. At the function level (FaaS), the autoscaler configures the scaling behaviour of serverless functions, such as adjusting concurrency limits or reserved capacity. In data processing pipelines or directed acyclic graph workflows, the control scope may be defined at the workflow or stage level, where the autoscaler targets specific stages or operators that become bottlenecks.

Fine-grained scopes (such as services, functions, or stages) allow for precise control but require detailed monitoring and coordination. Coarse-grained scopes (virtual machines) are simpler but less flexible. Some systems span multiple scopes, for example, scaling both the number of containers and the threads within each container.

2.5. Deployment (Where Does It Run?)

The deployment dimension specifies where the autoscaling logic runs and from which perspective it observes the system. In a cloud centralised setting, the autoscaler runs in a central controller with a global view of resources and workloads, often across regions or clusters; this facilitates optimisation of global objectives such as overall cost or utilisation, at the expense of higher decision latency for remote sites. At the edge or on premises, the autoscaler runs close to the application, for example in an edge cloud, cluster controller, or gateway, which reduces reaction time and is suitable for latency sensitive or bandwidth constrained scenarios, but provides only limited global information. Hybrid or distributed deployments combine these approaches: multiple autoscaler instances cooperate, with local controllers making fast decisions based on local metrics while higher level controllers coordinate policies across clusters, clouds, or regions. The chosen deployment reflects trade offs between global visibility, responsiveness, and fault tolerance.

Discussion. The five dimensions in Figure 4 provide a taxonomy for cloud autoscaling systems. Together, they describe what is optimised, how and when scaling decisions are made, how resources are adjusted, what is controlled, and where the controller runs. We utilize this taxonomy in the following sections to categorize machine learning-based autoscaling approaches and to relate algorithms to specific systems and platforms.

3. ML-Based Autoscaling Approaches

Machine learning (ML) has become a cornerstone for intelligent autoscaling in cloud environments. This section reviews state-of-the-art approaches, organized by decision-making technique, execution mechanism, workload scope, deployment context, and optimization objectives. Tables referenced in this section summarize representative works for each category.

3.1. ML Techniques for Autoscaling

Autoscalers leverage three major ML paradigms: supervised learning for predictive scaling, unsupervised learning for pattern discovery and anomaly detection, and reinforcement learning (RL) for adaptive policy optimization. Hybrid solutions often combine these techniques to improve accuracy and responsiveness.

3.1.1. Supervised Learning

Supervised learning frames autoscaling as a prediction problem, mapping system metrics (e.g., CPU utilisation, request rate) to performance outcomes or resource requirements. In terms of the optimisation problem in Section 1.3, these models approximate components such as future demand

D (t + 1 : t + H)

, QoS

Q (R, D)

, or directly the required allocation

R (t + 1)

. A hand-crafted or heuristic policy then uses these predictions to choose

R (t)

so as to keep

Q (R (t), D (t)) \leq Q_{\max}

while reducing the cost term

C (R (t), D (t))

.

Statistical and Classical ML Methods. Early foundational work by Calheiros et al. [1] established ARIMA-based workload prediction for SaaS applications, demonstrating 91% accuracy on seasonal data and enabling proactive resource provisioning. Hu et al. [15] compared multiple prediction models including Random Forest for cloud elasticity mechanisms, finding that ensemble methods outperform traditional time series approaches. Liu et al. [16] proposed adaptive prediction using SVM and linear regression with workload pattern discrimination, enabling context-aware forecasting. Chen et al. [17] developed a self-adaptive prediction method combining ensemble models with fuzzy neural networks for cloud resource demand estimation.

Deep Learning Approaches. The adoption of deep neural networks has significantly advanced predictive autoscaling. Zhang et al. [18] proposed an efficient deep learning model for predicting cloud workloads in industrial informatics applications. Wajahat et al. [19] employ neural networks to build an online black-box performance model relating monitored metrics to response time, enabling proactive autoscaling that minimizes SLA violations. Kim et al. [20] proposed CloudInsight, an ensemble framework combining multiple predictors for robust workload forecasting across diverse cloud applications. Saxena and Singh [21] developed a proactive autoscaling framework using online multi-resource neural networks for energy-efficient VM allocation. More recent work by Xu et al. [22] introduced esDNN, an efficient supervised deep neural network for multivariate workload prediction that captures long-term variance patterns. Saxena et al. [23] provided comprehensive performance analysis of ML-centered workload prediction models, comparing deep learning, ensemble, and quantum neural network approaches. Similarly, Shahin [24] proposed an LSTM-based autoscaler for cloud resource scaling, demonstrating improved responsiveness under bursty workloads compared to static thresholds.

Microservices and Container Environments. The shift toward containerized microservices has driven specialized supervised approaches. Yu et al. [25] developed Microscaler using online Bayesian regression for latency-aware microservice scaling. Yan et al. [26] proposed Hansel, a Bi-LSTM based proactive autoscaler that predicts future workload demands using historical time-series data. Zhong et al. [11] applied LSTM networks for Kubernetes pod autoscaling, while Horn et al. [27] extended supervised approaches to multi-objective optimisation, employing ML-based performance modeling to jointly balance response time SLOs and resource efficiency. Pintye et al. [28] enhanced ML-based autoscaling through statistical feature selection for application-specific metric identification. Priyadarshana et al. [29] proposed a hybrid Prophet-LSTM model for Kubernetes autoscaling, achieving 65–90% accuracy improvement over single-model approaches. Rahman and Lama [30] developed machine learning regression models for predicting end-to-end tail latency of containerized microservices.

Supervised approaches require extensive profiling or historical traces, and retraining is common to maintain generalisation. Hybrid analytical–ML models also appear, where ML calibrates queueing-theory parameters for better realism. Table 4 lists key supervised techniques spanning 2015–2025, their target workloads, and objectives.

3.1.2. Unsupervised Learning

Unsupervised methods enhance autoscaling by detecting anomalies and clustering workload states without labelled data. In the formulation of Section 1.3, they operate on the joint space of workload and monitoring signals

(D (t), M (t))

, discovering structure that can be exploited when choosing

R (t)

. These techniques have evolved significantly over the past decade, spanning workload clustering, anomaly detection, and hybrid metaheuristic approaches.

Workload Clustering. Early foundational work by Iqbal et al. [32] demonstrated unsupervised learning for dynamic resource provisioning in multi-tier web applications, using clustering to identify workload patterns from access logs and build allocation policies for each pattern. Subsequent advances include Chen et al. [33], who integrated K-means clustering with neural networks for improved workload prediction, and Nikravesh et al. [34], who developed an autonomic prediction suite using unsupervised clustering of resource usage profiles. More recent work by Daradkeh et al. [35] enhanced K-means with kernel density estimation for elastic cloud models, while Shahidinejad et al. [36] and Ghobaei-Arani and Shahidinejad [37] proposed hybrid approaches combining clustering with fuzzy logic and metaheuristic optimizers (genetic algorithms, gray wolf optimizer) for QoS-aware resource provisioning. Sridhar and Sathiya [38] advance this line with dynamic fuzzy c-means clustering for latency-aware scheduling, while Betti et al. [39] apply K-means clustering to horizontal autoscaling in hybrid cloud infrastructures.

Anomaly Detection. Unsupervised anomaly detection enables proactive scaling under abnormal conditions likely to violate the QoS constraint

Q (R (t), D (t)) \leq Q_{\max}

. Moghaddam et al. [40] introduced ACAS, employing Isolation Forests for cause-aware auto-scaling that identifies and responds to anomalous workload patterns. Zhang et al. [41] developed PerfInsight, a clustering-based system for detecting abnormal behavior in large-scale clouds. More sophisticated deep learning approaches have emerged, including He et al. [42], who proposed TopoMAD combining graph neural networks with LSTM for spatiotemporal anomaly detection, and Liu et al. [43], who integrated deep autoencoders with Gaussian mixture models for cloud security applications.

Feature Selection and Workload Characterization. Recent work has focused on improving the quality of unsupervised analysis through better feature engineering. Ali and Kecskemeti [44] developed SeQual, an unsupervised feature selection method using Silhouette coefficients to identify optimal attributes for workload clustering, followed by EFection [45], which automatically detects effective clustering dimensions using internal validation metrics.

These methods typically complement predictive or RL-based controllers by improving state representation and adaptability rather than directly optimising the cost term

C (R (t), D (t))

. Dimensionality reduction techniques (e.g., PCA) further simplify high-dimensional monitoring data for downstream models, improving stability and sample efficiency. Table 5 summarises representative unsupervised approaches spanning 2015–2025 and their QoS-driven objectives.

3.1.3. Reinforcement Learning (RL)

RL has emerged as a dominant paradigm for autoscaling due to its ability to learn policies through interaction. The agent observes a system state

s_{t}

(for example, a summary of

D (t)

,

M (t)

, and the current allocation

R (t)

), executes a scaling action

a_{t}

(scale out/in/up/down), and receives a reward shaped by QoS and cost objectives. A common design, consistent with the optimisation in Equation (1), is

r_{t} = - C (R (t), D (t)) - λ max (0, Q (R (t), D (t)) - Q_{max}),

where

λ > 0

penalises QoS violations. The goal is to learn a policy

π_{θ} (a_{t} ∣ s_{t})

that maximises the expected return

E [\sum_{t = 1}^{T} γ^{t - 1} r_{t}]

, providing a reinforcement-learning surrogate for the constrained optimisation problem.

Classical RL Approaches. Early works applied tabular Q-learning [48] and SARSA [49] to cloud autoscaling. Bahrpeyma et al. [50] introduced continuous Q-learning for dynamic resource provisioning aimed at minimizing energy consumption while preventing job rejection. Jamshidi et al. [51] proposed self-learning fuzzy Q-learning controllers for knowledge evolution in cloud elasticity management. Arabnejad et al. [52] developed a fuzzy Q-learning controller implemented in OpenStack that learns scaling rules at runtime without prior knowledge, later extending this to compare Fuzzy SARSA and Fuzzy Q-learning approaches [53]. Horovitz and Arian [54] demonstrated efficient SLA-aware scaling using standard Q-learning, while Nouri et al. [55] developed distributed Q-learning for autonomic decentralized elasticity.

Deep RL Methods. Recent studies employ deep RL algorithms to handle high-dimensional state spaces. Bitsakos et al. [56] introduced DERP, using Deep Q-Networks for elastic resource provisioning in NoSQL clusters. Zhang et al. [57] proposed A-SARSA, combining SARSA with ARIMA prediction for proactive container auto-scaling. Khaleq and Ra [58] provided comprehensive comparison of Actor-Critic, DQN, SARSA, and Q-learning for microservice response time optimization. Rossi et al. [59] developed dynamic multi-metric threshold learning using deep Q-learning for adaptive scaling decisions. Xue et al. [60] proposed a meta reinforcement learning approach for predictive autoscaling that generalizes across diverse workloads. Hanafy et al. (2023) [61] proposed CarbonScaler, a carbon-aware autoscaling framework that integrates grid carbon intensity and electricity pricing into RL reward functions to optimize energy efficiency with minimal performance trade-offs.

Multi-Agent and Distributed RL. Multi-agent RL [62] enables application-agnostic horizontal scaling of microservices by accounting for inter-dependencies between components, preventing bottlenecks, and adapting to dynamic workloads more effectively than traditional non-adaptive solutions. Bai et al. [63] proposed DRPC, a distributed reinforcement learning approach based on Twin Delayed Deep Deterministic Policy Gradient (TD3) that decentralizes scaling decisions across nodes for improved scalability in large-scale microservice clusters. Prodanov et al. [64] developed a multi-agent RL-based in-place scaling engine for edge-cloud systems enabling dynamic resource adjustment without pod restarts.

Safety-Aware and Graph-Enhanced RL. Safety-aware RL approaches have emerged to mitigate exploration risks in production environments. Qiu et al. [65] introduced AWARE, employing meta-RL with safe exploration for production cloud systems. Park et al. [66] proposed a graph neural network-based SLO-aware proactive autoscaling framework that captures microservice dependencies for improved resource prediction. Santos et al. [67] developed Gwydion, an RL framework for complex containerized applications in Kubernetes that considers microservice inter-dependencies when scaling horizontally.

Energy-Efficient and Multi-Objective RL. Recent work addresses sustainability alongside performance. Yuan et al. [68] proposed GIRP, using multi-objective multi-task reinforcement learning-based on deep deterministic policy gradient for energy-efficient QoS-oriented microservice provisioning, achieving 52% resource savings and 43% reduction in power consumption. Hua et al. [69] introduced Humas, a heterogeneity- and upgrade-aware microservice autoscaling framework for large-scale data centers that handles rolling updates and resource heterogeneity through adaptive RL policies.

Complementing these, Qiu et al. [70] presented

μ

-Serve, a power-aware model-serving system that combines GPU frequency scaling with model partitioning and speculative scheduling to achieve up to 2.6× power savings without SLO violations. Zhang et al. [71] proposed MArk, a predictive autoscaling framework that leverages multi-tier provisioning (IaaS and FaaS) to reduce cost and maintain SLO compliance, indirectly contributing to energy efficiency. At the edge, Kim and Wu [72] developed AutoScale, an RL-based execution scaling engine that selects energy-efficient inference targets across mobile, edge, and cloud environments, adapting to stochastic runtime variance. Wang et al. [73] focused on Transformer-based inference, demonstrating that batch scheduling and GPU DVFS can yield up to 2.7× energy efficiency improvements, albeit with moderate latency trade-offs. Additionally, Cañete et al. [74] proposed a proactive energy-aware horizontal autoscaling framework for edge infrastructures, which considers both idle and dynamic energy consumption and achieves up to 92.5% energy reduction while maintaining zero failed requests.

These works collectively highlight the growing emphasis on energy-aware autoscaling and the need for coordinated strategies across hardware and system levels.

Table 6 summarises the single-objective RL techniques for cloud autoscaling spanning 2015–2025.

Reward design often incorporates multiple objectives, as seen in Xu et al. [77], which uses DQN-based reinforcement learning to balance response time SLOs and cost through multi-faceted scaling (horizontal, vertical, and brownout), and in Qiu et al. [78], which optimises fine-grained resource allocation using hierarchical RL. Table 7 details the multi-objective RL approaches, including their deployment contexts and optimisation goals.

Key Insights: RL-based autoscalers outperform static and predictive methods in dynamic environments but face challenges such as state-space explosion, reward shaping, and safe exploration. Hybrid designs and hierarchical RL are promising directions for future research.

As shown in Table 8, RL-based approaches dominate in dynamic microservice environments, offering significant SLA and cost improvements [62,65,76]. Supervised learning excels in predictive scaling for VM and container workloads [19,26,31], while unsupervised methods complement these by clustering workload states and detecting anomalies for stability [39,51].

3.2. Evaluation Metrics and Benchmarks

Evaluating an autoscaler requires considering both performance and efficiency. In the notation of Section 1.3, performance is captured by QoS measures derived from

Q (R (t), D (t))

, while efficiency relates to the cost term

C (R (t), D (t))

and to how aggressively the policy changes

R (t)

over time.

A primary performance metric is SLA compliance. Let

L_{i}

denote the latency of request i and

L_{max}

an SLA threshold. We use

1 {\cdot}

to denote the indicator function, which equals 1 when its argument is true and 0 otherwise. The SLA violation rate over N requests can be written as

SLA_viol = \frac{1}{N} \sum_{i = 1}^{N} 1 {L_{i} > L_{max}},

or, equivalently, authors often report a tail latency quantile such as the 95th-percentile response time. An effective autoscaler keeps

SLA_viol

small (or keeps tail latency below the desired bound).

Cost or resource usage is usually aggregated over a horizon of T decision steps as

C_{tot} = \sum_{t = 1}^{T} C (R (t), D (t)),

where

C (R (t), D (t))

encodes cloud billing, CPU-hours, or the number and type of instances. Results are frequently expressed relative to a baseline policy, for example through a normalised cost ratio

C_{rel} = C_{tot} / C_{tot}^{baseline}

.

Stability concerns how often the autoscaler changes the allocation. A simple measure is the number of scaling actions

A = \sum_{t = 2}^{T} 1 {R (t) \neq R (t - 1)},

or, more generally, the variance of provisioned capacity over time. For comparable QoS, a lower A or lower variance indicates less thrashing and a more stable policy.

For learning-based methods, authors sometimes report a convergence or adaptation time rather than a single scalar metric. Convergence can be defined as the number of training episodes or wall-clock time required for the policy to reach a near-steady performance level, or the time needed to adapt to a significant workload shift. These quantities are typically discussed qualitatively or plotted as learning curves rather than formalised as a standard equation.

When autoscaling is driven by explicit predictions, an additional metric is the accuracy of the predictor. A common choice is the mean absolute error (MAE) of workload forecasts:

MAE = \frac{1}{T} \sum_{t = 1}^{T} |\hat{D} (t) - D (t)| .

Although such accuracy metrics are reported, their practical relevance is ultimately judged by their impact on the QoS and cost measures above.

Most studies evaluate their autoscaler by comparison to a baseline such as static provisioning or a threshold-based controller (for example, AWS’s default autoscaling configuration or Kubernetes’ default Horizontal Pod Autoscaler). Typical results combine QoS and cost improvements. FIRM [78], an intelligent fine-grained resource management framework, reports up to a 16× reduction in SLO violations and a 62% decrease in requested CPU limits compared to default Kubernetes autoscaling. RL-based controllers in [58,82] reduce SLA violations by around 25% and lower cloud costs by approximately 18% relative to HPA. RUNWILD [83] achieves about 1.4× faster response time (328 ms versus 453 ms) on bursty workloads. In these comparisons, baseline policies often overshoot or react slowly to changing load, either violating SLAs or over-provisioning resources, whereas ML-based autoscalers aim to track the QoS target more closely while reducing

C_{tot}

and keeping the number of scaling actions A manageable.

They also compare cumulative metrics such as total SLA violations or total cost over the duration of an experiment. Table 9 summarises the most common evaluation metrics used in the literature.

Benchmark Applications: Because experiments with real applications and large clusters are time-consuming and costly, many studies rely on a mix of simulation, benchmark applications, and trace-driven emulation. Simulation and emulation tools, such as CloudSim and AutoScaleSim, are widely used to quickly test scaling algorithms under controlled conditions [87]. AutoScaleSim, for example, provides built-in autoscaling models and configurable workloads, enabling comparative evaluation over long traces and large scales that would be difficult to reproduce on a physical testbed. However, pure simulation may miss platform-specific behaviors such as VM start-up latencies, noisy neighbor interference, and measurement noise; therefore, many works complement simulations with smaller-scale testbed experiments.

Benchmark applications and workloads range from synthetic suites to realistic multi-tier web systems. Classical web benchmarks, such as RUBiS (an auction site), DVD Store, and SPECweb-based e-commerce workloads, provide repeatable request patterns with diurnal cycles and bursts, and remain common choices for VM and container autoscaling studies. More recent work employs microservice benchmark suites such as DeathStarBench (social network) and TrainTicket, whose complex inter-service dependencies expose the challenges of scaling individual services in a distributed application [62,88]. Alongside these synthetic and benchmark workloads, several studies use real traffic traces, for example, Wikipedia access logs or traces collected from Microsoft cloud environments, to evaluate robustness under flash crowds and weekly demand cycles.

A smaller subset of papers deploys autoscalers on actual public cloud platforms, such as AWS or Azure, to capture platform-specific effects, including virtual machine start-up delays, billing granularity, and management overheads. Jamshidi et al. [51], for instance, evaluated a fuzzy Q-learning-based autoscaler on Azure virtual machines and demonstrated improved SLA compliance (e.g., lower response times) along with reduced resource provisioning compared to native Azure autoscaling.

Workload Scenarios: Beyond the choice of benchmark, most evaluations consider multiple workload scenarios to probe different aspects of autoscaler behaviour. Common scenarios include abrupt load spikes, which test how quickly the controller scales out while maintaining SLA compliance; sudden drops in load, which reveal whether the autoscaler can scale in without oscillations; and regular or diurnal patterns, which expose the potential of proactive strategies that exploit recurring behaviour. Studies often vary workload characteristics, such as CPU-bound versus I/O-bound workloads, to assess how sensitive a particular autoscaling strategy is to different resource bottlenecks. For each scenario, they typically report metrics such as time to scale, maximum latency observed, and the average number of servers or pods used.

Rossi et al. [76], for example, showed that an RL-based container scaler achieved approximately 25% lower 95th-percentile latency than a tuned threshold policy while using around 10% fewer CPU cores on average. At the same time, several authors have noted the lack of standardised evaluation setups.Tamiru et al. [86] conducted an experimental evaluation of the Kubernetes Cluster Autoscaler in real cloud environments, assessing its performance and cost implications under various configurations and workloads to highlight practical behaviors and limitations.

Overall, a rigorous evaluation of an autoscaler should include both performance (QoS) and cost aspects, exercised across a range of workload conditions and compared against well-tuned baselines. Table 10 summarises the main benchmarking applications used in the literature and their evaluation settings. Such a comprehensive evaluation increases confidence that observed improvements can be attributed to the autoscaling method itself rather than to artefacts of a particular experimental setup. In the next sections, as we examine frameworks and systems, we will see how these evaluation principles are applied in real-world environments and where gaps still remain.

4. Frameworks and Systems

This section covers the practical side of implementing autoscaling: the frameworks and architectures used to integrate ML into the autoscaling loop, how these solutions tie into existing orchestration tools, and how autoscaling strategies are evaluated. It addresses questions like: How is the autoscaling logic structured (e.g., centralized controller, MAPE-K loop)? How does it interface with platforms like Kubernetes or cloud provider APIs? What metrics and benchmarks are used to validate autoscaling efficacy?

4.1. ML-Driven Autoscaling Pipelines

Virtually all autoscalers, whether rule-based or ML-based, implement the classical MAPE-K control loop: Monitor, Analyze, Plan, Execute, and Knowledge. Figure 5 instantiates this loop for autoscaling. Monitor answers what is observed? by collecting metrics and events from the managed system; Analyze answers why is change needed? by detecting QoS risks and forecasting workload; Plan answers how should it scale? by selecting horizontal, vertical, or hybrid scaling actions; Execute answers where is the action applied? by invoking cloud or Kubernetes APIs on specific services or tiers; and Knowledge answers what is stored? by keeping historical traces, trained models, policies, and SLA objectives that inform the other phases. Table 11 summarises representative works according to where they introduce ML in the MAPE-K loop.

Machine learning can appear in several phases of this loop:

Monitor. ML is used to filter and enrich raw metrics, for example anomaly or outlier detection on CPU, latency, or error rates before they are passed to the analysis step.
Analyze. This is where ML most often resides: time series models and neural networks forecast workload or QoS, and anomaly or root-cause analysis methods identify bottlenecks and impending SLA violations.
Plan. In ML-driven autoscalers the planning logic itself can be learned. Reinforcement learning, model predictive control, or optimisation heuristics choose the scaling action (for example, the number of instances or the amount of extra CPU) based on predictions and the current state.
Execute. Execution typically performs the concrete actions (calling cloud or Kubernetes APIs, updating replica counts or resource limits). ML is rarely used here, apart from occasional coordination of multiple actions.
Knowledge. ML models, learned policies, trace databases, and SLA goals are stored and updated in the knowledge base. Supervised and RL methods use this data for training and experience replay, and the stored objectives encode the trade off between QoS, cost, and energy.

Several ML-driven autoscaling frameworks instantiate the MAPE K loop in different ways. Saxena and Singh [21] combine an Analyse phase based on an online multi-resource neural network for workload forecasting with a Plan phase that applies evolutionary optimisation to determine energy-efficient VM allocations, and then executes proactive autoscaling and placement decisions. MARLISE [64] implements a multi-agent deep reinforcement learning architecture in which independent agents observe local metrics of individual microservices and make vertical scaling decisions to satisfy performance constraints across distributed edge-cloud deployments. FIRM [78] adopts hierarchical reinforcement learning, with service-level agents scaling individual microservices and a cluster-level agent provisioning nodes based on tracing information about inter-service dependencies. AWARE [65] integrates RL-based autoscaling with the Kubernetes scheduler, using offline-trained policies together with runtime safety logic to enforce constraints on scaling decisions.

Table 12 summarises these frameworks, highlighting their core ML techniques and integration approaches.

4.2. Integration with Orchestration Tools

Autoscaling algorithms must interface with real cloud environments and orchestrators. Integration challenges include: how to obtain metrics (from monitoring systems), how to command scaling actions (through APIs or controllers), and how to align with existing features (like load balancers, cooldown settings, etc.). We discuss a few key integration points.

4.2.1. Kubernetes Integration

Kubernetes is widely used for deploying microservices in containers.

HPA & VPA: Autoscaling in Kubernetes is typically done via the Horizontal Pod Autoscaler (HPA) for horizontal scaling and the Vertical Pod Autoscaler (VPA) for vertical adjustments. The HPA runs as a controller in the cluster, periodically checking metrics (through the Metrics API). Research prototypes that use custom ML logic often implement their own controller to replace or augment HPA. For example, Wu et al. (2019) [81] developed a custom autoscaler that uses deep reinforcement learning to adjust replica counts; they integrated it by watching the same metrics and then setting the Deployment’s replica field (essentially doing HPA’s job with their logic).
KEDA: Another method is to use KEDA, which supports external metrics and event-driven triggers. KEDA is flexible—one can plug in a predictive model as an external metric source (e.g., a custom metrics adapter that provides “predicted load 5 min ahead”), and then let HPA scale on that metric as if it were any other input. This approach was used by Saxena and Singh (2021): they proposed a proactive framework using an online multi-resource neural network predictor for demand forecasting, combined with clustering for VM autoscaling decisions in cloud data centers [21].

An integration challenge with Kubernetes is that changes in one resource type might entail restarts. The following list summarizes a few of the anti-patterns:

Kubernetes cannot vertically resize a running Pod without restarting it (the VPA typically evicts and recreates Pods with new resources). Most academic works focusing on horizontal scaling avoid modifying vertical resource limits during experiments (to keep the app running continuously). However, very few focus on VPA. For example, Pham and Kim (2024) propose an Elastic Federated Learning framework [93] that integrates Kubernetes Vertical Pod Autoscaler (VPA) in a KubeEdge-based edge environment to dynamically adjust pod resources (CPU, RAM) based on historical and real-time usage data, enabling efficient handling of heterogeneous FL workloads while accelerating model convergence and preserving training progress.
Another aspect is cluster-level scaling. If an autoscaler rapidly increases Pods beyond current cluster capacity, it should also trigger cluster autoscaling (adding worker VMs via the Kubernetes Cluster Autoscaler) or risk unschedulable Pods. Some works integrate cluster scaling explicitly—e.g., Qiu et al. (2020) considered both container scaling and node provisioning in their FIRM framework, using a hierarchical RL (service-level agent for containers, top-level agent for adding nodes) [78]. As illustrated in Figure 6, the choice of scaling target (pod/node) dramatically affects inter-pod latency.
Actuation delays and platform constraints are also important. Orchestrators often have their own logic (cooldowns, max scaling speed). A custom autoscaler usually must be tuned with these in mind or disable them. For research, authors often turn off such features to evaluate their algorithm in isolation. Nguyen et al. (2020) [94] provide a comprehensive analysis of Kubernetes HPA operational behaviors, including the effects of metric scraping periods on scaling responsiveness and the default 5-minute downscale delay designed to prevent thrashing from continuous scaling actions.
Integration also involves getting the right data. Using a service mesh (Istio, Linkerd) or distributed tracing can provide detailed metrics and insight into inter-service dependencies. For example, Istio telemetry can report per-service request rates and latencies; an autoscaler might use that to perform critical path analysis (like FIRM’s SVM to find the current latency bottleneck service [78]). Some advanced frameworks use telemetry from systems like Jaeger or Zipkin—e.g., to feed a graph neural network that predicts how a surge in Service A will affect Service B and C down the line [88]. In practice, one might keep a cooldown to prevent thrashing (e.g., “don’t scale again for 2 min after a scale action”).

4.2.2. Cloud Provider Auto Scaling

On IaaS, services like AWS provides autoscaling primarily through EC2 Auto Scaling Groups (ASG) for VM instances and service-specific autoscalers (e.g., for Lambda serverless or DynamoDB). Allow one to set scaling policies (simple thresholds or target tracking). Traditionally, AWS scaling used user-defined thresholds or simple target tracking on metrics (CPU, queue length, etc.). Around 2018, AWS introduced predictive scaling features that leverage ML.

Integration and Techniques: AWS’s approach integrates ML largely in a centralized, proactive manner. The predictive scaling in EC2 ASG uses time-series ML models to forecast future demand (e.g., using algorithms similar to Amazon’s DeepAR or Prophet). This ML policy runs as part of the cloud control plane: it analyzes historical CloudWatch metrics and schedules scale-out actions ahead of anticipated traffic peaks, rather than waiting for a threshold breach. Academic work on AWS autoscaling also pursued ML-driven improvement. Ravi Chandra Thota (2022) [95] develops an “Intelligent Autoscaling in AWS” framework using supervised ML (regression and LSTM models) to predict EC2 workload; this achieved 92% accuracy and significantly reduced scaling latency and cost compared to static policies.

Microsoft Azure supports autoscaling through Azure Monitor Autoscale for VMs (VM Scale Sets) and App Services, where users set reactive rules or schedules. Until recently Azure’s built-in autoscaling was rule-based similar to AWS. However, Microsoft has explored ML for autoscaling in specific contexts (often internally or in research) [96].

Table 13 summarizes key examples of ML-based autoscaling across AWS, Azure, and Kubernetes, including the integration method and a representative work for each.

5. Discussion and Challenges

This section synthesises the survey findings into a set of practical question-and-answer style discussions. Each subsection addresses a recurring design question in machine learning-based autoscaling, links it to the taxonomy dimensions (Goal, Decision, Scaling, Control Scope, Deployment), and highlights open challenges. The questions are chosen to reflect the main trade-offs practitioners and researchers face when designing or evaluating autoscalers.

5.1. When Does a Learned Scaler Beat a Tuned Threshold?

Taxonomy nodes: Goal (QoS, cost, energy), Decision (rule-based vs learning-based), Control Scope (per-service vs application-wide).

Threshold-based autoscalers (for example, Kubernetes HPA or AWS Auto Scaling) are widely deployed due to their simplicity and transparency [94]. A tuned threshold policy can perform well when workloads are relatively stable, bottlenecks are well understood, and objectives are limited to a small set of metrics (such as average CPU utilisation or mean latency). In these regimes, the marginal benefit of machine learning may not justify added complexity.

As listed below, learned autoscalers become advantageous when at least one of the following conditions holds:

Workloads exhibit complex or non-stationary patterns (for example, bursty traffic, diurnal cycles, or workload mixes that change over time). For instance, Calheiros et al. [1] demonstrated that an ARIMA-based predictor achieved 91% accuracy on seasonal workloads, enabling proactive resource provisioning that would be difficult to replicate with static thresholds. Similarly, Shahin [24] proposed an LSTM-based autoscaler and empirically showed that it outperformed traditional threshold-based methods under sudden workload changes. In production environments, Qiu et al. [65] reported that their reinforcement learning–based AWARE framework adapted to new workloads 5.5× faster than transfer learning-baselines and reduced SLO violations by a factor of 16.9×, while improving CPU and memory utilization by 47.5% and 39.2%, respectively.
Objectives are multi-dimensional, combining QoS, cost, and possibly energy or carbon constraints. Chen et al. [6] observed that rule-based policies struggle with such multi-objective trade-offs, often requiring manual tuning and lacking adaptability. Horn et al. [27] addressed this by employing ML-based performance modeling to jointly optimize response time SLOs and resource efficiency in Kubernetes environments. Saxena et al. [23] further demonstrated that a neural network–driven autoscaler could achieve energy-efficient VM allocation while maintaining SLA compliance, highlighting the flexibility of learned policies in balancing competing objectives.
There are strong non-linear interactions between resources (CPU, memory, I/O) and end-to-end QoS that are difficult to encode as fixed rules. Wajahat et al. [19] developed MLscale, a neural network–based black-box performance model that accurately captured the non-linear relationship between resource metrics and response time. This enabled proactive autoscaling that minimized SLA violations more effectively than static heuristics. Similarly, Rossi et al. [76] showed that a reinforcement learning–based hybrid autoscaler could dynamically choose between horizontal and vertical scaling actions to address shifting bottlenecks, outperforming fixed strategies in both latency and resource efficiency.

These findings underscore the importance of aligning autoscaler design with workload characteristics and optimization goals. While threshold-based controllers remain a strong baseline for simple, stable workloads, learned autoscalers provide measurable benefits in dynamic, multi-objective, and non-linear environments. As recommended by [86], empirical evaluations should include well-tuned threshold baselines to ensure fair comparisons. The studies cited above not only characterize the conditions under which learned autoscalers excel but also provide direct empirical evidence of their superiority in such scenarios.

However, these gains are not universal. Learned scalers can underperform tuned baselines when:

Training data are sparse or unrepresentative.
Telemetry is noisy or delayed (see Section 5.2).
The scaling granularity is coarse and actuation delays dominate, limiting the benefit of fine-grained policies.

Guideline: Learned scalers are most beneficial for complex, multi-objective, and non-stationary workloads. For simple, stable workloads with strict safety constraints, well-tuned threshold controllers remain a strong baseline. Empirical comparisons should always include tuned threshold policies as reference points, using consistent workloads and metrics (for example, tail-latency shortfall, cost per request, oscillation index).

Open challenges: A systematic characterization of the conditions under which learned autoscalers consistently outperform tuned thresholds remains absent. Existing studies often benchmark against suboptimal or inadequately tuned baselines [86]. Standardized evaluation protocols are needed to rigorously quantify the advantages of learning-based approaches over simpler controllers and identify scenarios where each excels.

5.2. How Do Actuation Delays and Telemetry Lag Change the Winner?

Taxonomy nodes: Decision (reactive vs proactive), Deployment (control-plane and data-plane delays), Monitor/Knowledge (telemetry fidelity).

Actuation delays (e.g., VM boot times, pod startup latency) and telemetry lag (e.g., metric collection and aggregation delays) are critical factors that influence the effectiveness of autoscaling strategies. These delays can significantly impact the responsiveness and stability of autoscalers, particularly under dynamic workloads.

Threshold-based autoscalers, which rely on reactive logic, are especially sensitive to such delays. As observed by [94], Kubernetes’ Horizontal Pod Autoscaler (HPA) includes a default five-minute downscale delay to prevent oscillations, which can hinder timely responses to workload changes. Their study highlights how metric scraping intervals and cooldown settings affect scaling responsiveness, often leading to over- or under-provisioning during rapid demand shifts.

In contrast, predictive and learning-based autoscalers are better equipped to mitigate the effects of these delays. For instance, Calheiros et al. [1] demonstrated that ARIMA-based workload forecasting enabled proactive scaling, reducing the impact of delayed actuation by anticipating demand surges. Similarly, Shahin [24] proposed an LSTM-based autoscaler that dynamically adjusted resource allocations based on predicted workload, outperforming threshold-based methods in scenarios with sudden load changes.

Reinforcement learning (RL) approaches have also shown promise in delay-sensitive environments. For instance, Qiu et al. [65] introduced AWARE, a meta-RL framework that incorporates safe exploration and bootstrapping to adapt quickly to new workloads. Their experiments in production cloud systems revealed that AWARE achieved 5.5× faster adaptation and 16.9× fewer SLO violations compared to baseline methods, even in the presence of actuation and telemetry delays. These results underscore the robustness of RL-based autoscalers in environments where delays are non-negligible.

Moreover, Rossi et al. [76] explored hybrid scaling strategies using RL to balance horizontal and vertical scaling decisions. Their approach accounted for the latency associated with different scaling actions, enabling more stable and efficient resource provisioning under varying delay conditions.

These findings suggest that autoscaler design should explicitly account for system delays. Predictive models and delay-aware policies—such as those using workload forecasting or reinforcement learning—can initiate scaling actions ahead of time, reducing the risk of SLA violations. Conservative scale-in strategies, as employed by several ML-based controllers, further help mitigate oscillations caused by delayed feedback loops.

In summary, while threshold-based autoscalers may suffice in low-delay environments, learned autoscalers demonstrate superior performance in delay-prone settings. By anticipating future demand and adapting to system dynamics, they offer a more resilient and efficient approach to autoscaling under real-world constraints.

Guideline: Autoscaler design should be explicitly conditioned on measured actuation and telemetry delays. For environments with long node or pod start times, predictive or over-provisioning strategies are often necessary to avoid systematic SLO violations. Delay-aware policies—for example, those that initiate scale-out based on forecasts and use conservative scale-in rules until the system has stabilised—tend to reduce oscillations. Telemetry pipelines should be configured to minimise lag while maintaining sufficient smoothing to filter transient noise [40].

Open challenges: Few studies systematically stratify results by delay regimes (for example, short, medium, long boot and metric latencies). There is limited understanding of how different algorithms degrade as delays increase. Developing delay-aware autoscaling frameworks that reason explicitly about end-to-end latencies in the control loop, and that adapt their policies accordingly, remains an open research area.

5.3. Is Diagonal (Hybrid) Scaling Worth the Complexity?

Taxonomy nodes: Scaling (horizontal, vertical, hybrid), Control Scope (resource types and tiers).

Diagonal—or hybrid—scaling combines horizontal scaling (adding/removing instances) with vertical scaling (adjusting resources per instance) to leverage the strengths of both strategies. While horizontal scaling enhances elasticity and fault tolerance, it introduces coordination overhead and potential latency due to instance startup times. Vertical scaling, on the other hand, offers faster adjustments without the overhead of provisioning new instances but is constrained by hardware limits and may require restarts in containerized environments like Kubernetes [94].

Empirical studies have shown that hybrid scaling can outperform pure horizontal or vertical strategies, particularly under workloads with shifting bottlenecks or heterogeneous resource demands. For example, Santos et al. [67] proposed a reinforcement learning–based autoscaler capable of dynamically selecting between horizontal and vertical scaling actions. Their approach demonstrated improved response time and resource efficiency by adapting to workload characteristics in real time. This supports the claim that hybrid scaling is beneficial when resource bottlenecks vary across CPU, memory, or I/O dimensions.

Similarly, Horn et al. [27] implemented a multi-objective hybrid autoscaler for Kubernetes microservices, using ML-based performance modeling to balance response time SLOs and resource utilization. Their results showed that hybrid strategies could reduce SLA violations and improve efficiency compared to single-mode scaling policies. These findings highlight the value of hybrid scaling in environments where workloads exhibit diverse and evolving performance constraints.

However, the added complexity of hybrid scaling is not without trade-offs. In Kubernetes, vertical scaling often requires pod eviction and restart, which can disrupt service availability [94]. This operational overhead must be carefully managed to avoid introducing instability. Moreover, designing policies that coordinate horizontal and vertical actions without causing oscillations or resource contention remains a challenge.

Despite these concerns, recent advances in reinforcement learning have made hybrid scaling more tractable. For example, Qiu et al. [65] demonstrated that their AWARE framework could safely and efficiently manage scaling decisions in production environments, even under complex conditions involving multiple scaling dimensions. Their meta-RL approach enabled rapid adaptation and robust performance, suggesting that the barriers to hybrid scaling can be mitigated with intelligent control strategies.

Guideline: In summary, hybrid scaling is most effective in scenarios where both horizontal and vertical scaling primitives are available and where workloads present heterogeneous or shifting resource demands. In such cases, vertical scaling can be used for fine-grained, short-term adjustments, while horizontal scaling handles sustained load growth. Platform-specific constraints (for example, pod eviction during vertical resizes) must be considered when designing hybrid policies [93]. Where vertical scaling is slow, disruptive, or heavily constrained, the additional complexity of hybrid control may not be justified.

Open challenges: There is limited systematic evidence on when diagonal scaling meaningfully outperforms well-tuned pure strategies. Multi-resource optimisation (CPU, memory, I/O, network) under hybrid policies remains underexplored. Future work should focus on developing hybrid autoscalers that are restart-aware, capable of multi-resource optimization, and resilient to platform-specific constraints.

5.4. Centralised vs. Decentralised Control: Who Scales Better as Service Count Grows?

Taxonomy nodes: Control Scope (per-service vs application-wide), Deployment (centralised vs distributed), Goal (global vs local objectives).

As cloud-native applications increasingly adopt microservice architectures, the number of independently deployable components within a system can grow substantially. This proliferation raises critical questions about the scalability and effectiveness of autoscaling control strategies—particularly whether centralized or decentralized control is more suitable as service complexity increases.

Centralized autoscalers maintain a global view of the system, enabling coordinated scaling decisions that optimize for end-to-end objectives such as overall cost, latency, or energy efficiency. This approach is particularly effective in environments with tightly coupled services or shared bottlenecks. For example, Qiu et al. [65] demonstrated that their AWARE framework, which integrates reinforcement learning with centralized safety logic, achieved substantial improvements in SLO compliance and resource utilization in production cloud systems. By leveraging a global perspective, AWARE was able to coordinate scaling actions across services and avoid conflicting decisions that could arise in isolated controllers.

However, centralized control introduces scalability challenges. As the number of services increases, the overhead of collecting telemetry, computing global policies, and enforcing coordinated actions can become a bottleneck. This is especially problematic in latency-sensitive or geographically distributed deployments, where decision latency and fault tolerance are critical concerns.

Decentralized control strategies, including per-service autoscalers and multi-agent reinforcement learning (MARL), offer a promising alternative. These approaches assign control responsibilities to individual services or localized agents, enabling faster, context-aware decisions with reduced coordination overhead. For example, Prodanov et al. [64] proposed a multi-agent RL–based in-place scaling engine for edge–cloud systems, where each microservice agent independently adjusts its resource allocation. Their results showed that decentralized control improved responsiveness and scalability, particularly in edge environments with limited global visibility.

Similarly, Fodor et al. [62] introduced a MARL framework for application-agnostic microservice scaling. Their approach allowed agents to learn scaling policies based on local observations while implicitly coordinating through shared environmental feedback. This design enabled the system to adapt to dynamic workloads and inter-service dependencies without centralized orchestration.

Hybrid models that combine local autonomy with lightweight global coordination offer a middle ground. For instance, the FIRM framework by [78] employs hierarchical reinforcement learning, where service-level agents manage individual microservices and a cluster-level agent oversees node provisioning. This structure balances the benefits of local responsiveness with the ability to enforce global objectives, such as minimizing overall cost or maintaining system-wide SLOs.

Guideline: For applications with relatively few, strongly coupled services and shared bottlenecks (for example, a centralised database), centralised or hierarchical control is often preferable, as it can manage global trade-offs and capacity planning. For large-scale microservice or edge deployments with many loosely coupled services, decentralised controllers or MARL agents are more scalable, provided that they are augmented with lightweight global constraints or monitoring to detect pathological behaviours. The choice should be guided by the service graph structure, critical-path volatility, and the cost of coordination.

Open challenges: There is limited empirical evidence on how centralised and decentralised controllers perform as the number of services increases. Open questions include how to design reward structures and coordination mechanisms in MARL so that local decisions align with global objectives, how to manage telemetry visibility in large service graphs, and how to provide debugging and observability tools for distributed autoscaling policies [62,65].

5.5. What Is the Cost of Safety and Multi-Objective Guarantees?

Taxonomy nodes: Goal (cost, SLO, energy/carbon), Decision (constrained vs unconstrained learning), Deployment (safety enforcement layers).

In real-world deployments, autoscaling systems must not only optimize for performance and cost but also ensure safety, reliability, and compliance with service-level objectives (SLOs). These requirements introduce constraints that can limit the flexibility of scaling decisions and increase the complexity of autoscaler design.

Safety mechanisms in learning-based autoscalers typically include internal constraints—such as safe exploration in reinforcement learning (RL)—and external guardrails like resource caps, rate limits, or fallback policies. These mechanisms are essential to prevent performance degradation or instability during policy learning and adaptation. However, they often incur additional costs or reduce resource utilization efficiency.

Recent work has begun to quantify these trade-offs. For example, Qiu et al. [65] proposed AWARE, a meta-RL framework that integrates safety logic into the autoscaling loop. Their system combines offline-trained policies with runtime safety checks to ensure stable operation in production cloud systems. Empirical results show that AWARE achieves 16.9× fewer SLO violations and significantly improves CPU and memory utilization (by 47.5% and 39.2%, respectively) compared to baseline methods. These findings demonstrate that safety-aware learning does not necessarily compromise efficiency; rather, it can enhance both reliability and resource usage when properly integrated.

Multi-objective optimization further complicates autoscaler design. Most real-world applications require balancing multiple goals—such as minimizing cost, ensuring SLO compliance, and reducing energy consumption. Traditional threshold-based policies are ill-suited for such scenarios, as they typically target a single metric and lack the flexibility to adapt to changing priorities. For example, Chen et al. [6] highlighted this limitation, noting that rule-based autoscalers struggle with multi-objective trade-offs and often require manual tuning to maintain acceptable performance across competing goals.

Learning-based approaches, particularly RL, offer a natural framework for encoding and optimizing multiple objectives. For example, Yuan et al. [68] introduced GIRP, a multi-objective, multi-task RL system that jointly optimizes energy efficiency and latency in microservice environments. Their approach achieved 52% resource savings and a 43% reduction in power consumption while maintaining QoS. Similarly, Horn et al. [27] demonstrated that ML-based performance modeling could effectively balance response time SLOs and resource efficiency in Kubernetes clusters.

Guideline: The impact of safety mechanisms and multi-objective optimisation should be quantified explicitly. Studies should compare policies with and without guardrails, reporting both SLO compliance gains and cost overheads. Likewise, SLO–cost–energy trade-offs should be reported as families of operating points (for example, Pareto fronts) rather than single values, enabling stakeholders to choose policies that align with their priorities. In practice, many operators prioritise SLO satisfaction as a hard constraint and then seek to minimise cost and energy within the feasible region.

Open challenges: Despite these advances, the cost of safety and multi-objective guarantees remains underexplored. Few studies provide systematic quantification of the trade-offs between safety, cost, and performance. Moreover, most RL-based autoscalers encode multiple objectives into a single weighted reward function, which can obscure the underlying trade-offs and limit transparency. There is a need for autoscalers that can dynamically navigate Pareto frontiers—adjusting their policies in response to shifting priorities, such as peak vs. off-peak hours or changes in energy pricing. Integrating dynamic pricing, SLA penalties, and carbon targets into learning-based controllers in a robust and transparent way remains an open problem [61,86]. In addition, formal guarantees for safe RL in autoscaling contexts, and tools for explaining multi-objective trade-offs to operators, are still in their infancy.

5.6. What Are the Challenges in Achieving Energy-Efficient Autoscaling?

Taxonomy nodes: Goal (energy/carbon, cost, SLO), Decision (multi-objective optimization, predictive vs reactive control), Deployment (hardware heterogeneity, cluster-level vs device-level scaling).

As sustainability becomes a first-class concern in cloud and edge computing, autoscaling systems must evolve to optimize not only for performance and cost but also for energy efficiency. This introduces new challenges across the autoscaling stack, from hardware-level power management to system-level resource provisioning.

Recent work has explored energy-aware autoscaling at various levels of the stack. For example,

μ

-Serve [70] demonstrates that fine-grained GPU frequency scaling, combined with power-aware model partitioning and speculative scheduling, can reduce power consumption by up to 2.6× without violating SLOs. However,

μ

-Serve assumes a homogeneous GPU cluster and does not address horizontal scaling or heterogeneous hardware environments. In contrast, MArk [71] focuses on cost-effective inference serving by combining predictive autoscaling with multi-tier provisioning (IaaS and FaaS), but does not explicitly optimize for energy. At the edge, AutoScale [72] uses reinforcement learning to select energy-efficient execution targets across mobile, edge, and cloud resources, adapting to stochastic runtime variance. These systems highlight the need for coordinated energy-aware decisions across multiple layers of the autoscaling hierarchy.

Table 14 provides a comparative overview of recent systems addressing energy-efficient autoscaling, highlighting key differences in optimization techniques, workload environments, SLO guarantees, and scalability considerations.

Guideline: Energy efficiency should be treated as a primary optimization objective alongside cost and SLOs. Studies should report energy–performance trade-offs explicitly, using metrics such as energy per inference or power-delay product. Where possible, evaluations should include heterogeneous hardware configurations and quantify the impact of autoscaling decisions on both dynamic and idle power consumption. Integration of device-level power management (e.g., DVFS) with cluster-level autoscaling and workload placement strategies is essential for maximizing energy savings.

Open challenges: Despite promising results, energy-efficient autoscaling remains underexplored. Most existing systems optimize for energy within narrow scopes—either at the device level (e.g.,

μ

-Serve [70]) or at the system level (e.g., MArk [71])—but few integrate both. Handling hardware heterogeneity, especially in GPU clusters with varying capabilities and power profiles, remains an open problem. Moreover, many systems assume static workloads or require offline profiling, limiting adaptability. There is a need for autoscalers that can dynamically balance energy, cost, and SLOs in real time, potentially by navigating Pareto frontiers or incorporating external signals such as carbon intensity or energy pricing. Finally, transparent reporting of energy trade-offs and formal guarantees for energy-aware learning-based controllers are still in their infancy.

As summarised in Table 15, these five questions collectively span all five taxonomy dimensions and expose the main challenges that future work on ML-based autoscaling should address.

6. Future Directions

Machine learning-based autoscaling is still an evolving field. Building on the challenges identified in Section 5, this section outlines several directions for future research that can lead to more robust, delay-aware, and holistically optimised autoscaling solutions.

6.1. Unified ML-Orchestration Frameworks

Current autoscalers frequently optimise individual components or metrics in isolation. A key direction is the design of unified orchestration frameworks that integrate monitoring, prediction, and scaling decisions across multiple tiers (for example, front-end services, application servers, databases, and message queues) in real time. Such frameworks would be able to anticipate cross-tier effects—for instance, recognising when a surge in front-end load is likely to shift the bottleneck to the database—and co-ordinate scaling actions accordingly.

Interdisciplinary methods that combine control theory, time-series forecasting, and reinforcement learning can be embedded within extended MAPE-K loops for multi-tier, multi-resource contexts [11,78]. Promising directions include middleware or control planes that host ML models for workload prediction and performance modelling, driving integrated scaling decisions across clusters and regions. Industrial systems such as Google Autopilot [98] and IBM AWARE [65] illustrate initial steps towards this style of ML-driven orchestration, but their internal designs and evaluation methodologies remain only partially documented.

6.2. Edge Intelligence and Decentralised Scaling

With the growth of IoT, 5G, and latency-sensitive applications, autoscaling increasingly spans hybrid cloud–edge environments. Traditional autoscalers assume centralised control within a single cloud region, but emerging scenarios require decisions about where to scale (edge or cloud) in addition to how much. This raises questions around partitioning control logic, coping with partial observability, and managing heterogeneous resource types and constraints at the edge.

Future work should explore multi-agent reinforcement learning architectures for edge-cloud systems, where independent agents at edge nodes perform in-place scaling for latency-critical microservices close to users, while implicit coordination through shared dynamics optimizes aggregate workloads and resource efficiency in cloud backends [62,64]. Telemetry delays, bandwidth limits, and intermittent connectivity between edge and cloud introduce additional complexity. Techniques such as federated learning can support collaborative policy updates across sites without sharing raw data, improving privacy and reducing communication overhead [99]. Robustness under heterogeneous delay profiles and failure modes will be a central concern.

6.3. Multi-Agent and Federated Learning for Autoscaling

As service counts grow, monolithic scaling agents become difficult to design, train, and interpret. Multi-agent reinforcement learning (MARL) offers a way to assign agents to individual services or tiers, allowing policies to scale with application size while retaining local responsiveness [62,64]. Emerging approaches also explore federated learning for predictive models in proactive autoscaling or resource orchestration in distributed edge-cloud environments, enabling collaborative training while preserving data locality and privacy.

However, MARL and federated approaches bring new challenges in reward design, credit assignment, and stability, especially under non-stationary workloads and changing objectives. Future research should investigate coordination mechanisms that prevent local agents from undermining global SLOs, and adaptation strategies that allow global objectives (for example, cost vs latency priority) to be adjusted without retraining from scratch. Understanding how these techniques behave in the presence of actuation delays, telemetry lag, and partial observability remains particularly important.

6.4. Standardised Benchmarks and Evaluation Methodologies

The discussion in Section 5 highlighted that evaluation practices for autoscaling are highly heterogeneous. Workload diversity, ad hoc baselines, and limited reporting hinder fair comparison across studies. Standardised benchmarks and reproducible methodologies are therefore a key enabler for progress.

Simulation frameworks such as AutoScaleSim provide controlled environments with built-in models of autoscaling algorithms [87]. Microservice benchmarks such as DeathStarBench and TrainTicket are increasingly used as realistic testbeds [88]. Building on these, future work should establish benchmark suites that combine:

Representative application workloads (multi-tier and microservice);
Trace-based and synthetic load patterns (including diurnal, bursty, and non-stationary scenarios);
Standardised metrics (for example, tail-latency SLO shortfall, cost per request, oscillation indices, and where relevant energy or carbon metrics);
Containerised experiment packages and open artefact repositories for reproducibility [86].

Community-maintained baselines and reporting guidelines would help ensure that improvements are robust and not artefacts of particular workloads or configurations.

6.5. Sustainability and Green Autoscaling

Energy efficiency and carbon footprint reduction are becoming first-class objectives in cloud operations. Most current autoscalers treat cost and QoS as primary goals, with energy and carbon considered only indirectly. Carbon-aware autoscaling strategies that incorporate dynamic energy prices and grid carbon intensity into RL reward functions have demonstrated substantial energy savings with limited impact on performance [61].

Future work should explore integrating device-level energy optimizations, such as

μ

-Serve’s GPU DVFS, with cluster-level autoscaling and workload placement strategies, particularly in heterogeneous and multi-tenant environments [70,71,72], potentially alongside cost and SLO constraints. This will require models of the relationship between resource allocations, workload placement, and energy use, as well as interfaces that expose trade-offs to operators in an interpretable way. Exploiting temporal flexibility (for example, shifting or deferring non-critical workloads) and spatial flexibility (for example, moving workloads between regions with different carbon intensity) are promising directions, but will need to be reconciled with application-level SLOs and data-locality requirements.

As summarised in Table 16, these directions build directly on the challenges identified in Section 5 and provide a roadmap for advancing ML-based autoscaling.

6.6. Threats to Validity and Limitations

While this survey follows a systematic review protocol, several limitations should be acknowledged:

Coverage Bias: The search was restricted to selected scholarly databases (IEEE Xplore, ACM DL, SpringerLink, ScienceDirect, Scopus) and may exclude relevant works published in other venues or grey literature.
Time Window: The review considered studies published between 2015 and 2025, which omits earlier foundational work and very recent developments beyond the cutoff date.
Terminology Bias: The search strategy relied on keywords such as autoscaling, elastic scaling, and related terms. Studies addressing similar concepts under different terminology (e.g., adaptive resource management, dynamic provisioning) may have been missed.
Absence of Formal Quality Assessment: No formal risk-of-bias or methodological quality scoring was applied to the included studies. Consequently, the synthesis does not differentiate between high- and low-quality evidence.

These limitations highlight the need for cautious interpretation of findings and suggest opportunities for future surveys to adopt broader coverage, extended time frames, and formal quality appraisal frameworks.

7. Conclusions

This survey provides a structured review of machine learning-based autoscaling for elastic cloud systems from 2015 to 2025. We introduced a five-dimensional taxonomy (goal, decision logic, scaling mode, control scope, and deployment) and used it to classify 60 autoscaling approaches across supervised, unsupervised, and reinforcement learning. Beyond the algorithms, we analysed how these techniques are integrated into practical frameworks, such as Kubernetes-based controllers and major cloud provider services, focusing on their roles in workload prediction, proactive scaling, and adaptive policy optimisation. We also examined evaluation practices—workloads, metrics, and benchmarks—to distinguish areas with robust evidence from those that remain ad hoc.

Across this landscape, several recurring challenges emerged. Actuation delays and telemetry lag hinder the use of fast-reacting ML policies; hybrid horizontal–vertical scaling creates complex search spaces; and coordinating decisions across multi-service, microservice, and edge–cloud deployments is still largely unaddressed. Cost, service-level objectives, and energy goals are often treated in isolation or through oversimplified trade-offs, while reproducible evaluation is impeded by fragmented tooling and domain-specific benchmarks.

These gaps motivate future research, as detailed in Section 6, including unified ML-driven orchestration across the autoscaling stack, multi-agent and federated control for distributed services, standardised and openly accessible benchmarks, and sustainability-aware optimisation that treats energy and carbon as first-class objectives. By consolidating existing knowledge, organising it through a common taxonomy, and clarifying the limitations of current practice, this survey aims to provide a practical foundation for designing the next generation of autoscalers—systems that are performant under realistic workloads and also predictable, operable, and responsible in their resource utilisation.

Author Contributions

Conceptualization, V.S.M.; methodology, V.S.M. and S.S.; validation, V.S.M., V.K. and S.S.; formal analysis, V.S.M., V.K. and S.S.; investigation, V.S.M.; resources, V.S.M.; data curation, V.S.M.; writing—original draft preparation, V.S.M.; writing—review and editing, V.S.M., V.K. and S.S.; visualization, V.S.M. and S.S.; supervision, S.S.; project administration, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. This article is a systematic review of previously published research, and all data supporting the findings are available in the cited references.

Conflicts of Interest

The authors declare no conflict of interest. This research was conducted independently and does not reflect the views of Microsoft. No Microsoft funding, data, tools, or proprietary resources were used. The authors declare no commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

AI	Artificial Intelligence
API	Application Programming Interface
ARIMA	AutoRegressive Integrated Moving Average
AWS	Amazon Web Services
CPU	Central Processing Unit
DRL	Deep Reinforcement Learning
FaaS	Function-as-a-Service
GCP	Google Cloud Platform
HPA	Horizontal Pod Autoscaler
IaaS	Infrastructure-as-a-Service
IoT	Internet of Things
KEDA	Kubernetes Event-Driven Autoscaler
ML	Machine Learning
MAPE-K	Monitor, Analyze, Plan, Execute—Knowledge (Autonomic Loop)
MARL	Multi-Agent Reinforcement Learning
PaaS	Platform-as-a-Service
PCA	Principal Component Analysis
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QoS	Quality of Service
RL	Reinforcement Learning
SARSA	State–Action–Reward–State–Action
SLA	Service Level Agreement
SLO	Service Level Objective
VPA	Vertical Pod Autoscaler
VM	Virtual Machine
K8s	Kubernetes
RAM	Random Access Memory
DVFS	Dynamic Voltage and Frequency Scaling

References

Calheiros, R.N.; Masoumi, E.; Ranjan, R.; Buyya, R. Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE Trans. Cloud Comput. 2015, 3, 449–458. [Google Scholar] [CrossRef]
Alharthi, S.; Alshamsi, A.; Alseiari, A.; Alwarafy, A. Auto-scaling techniques in cloud computing: Issues and research directions. Sensors 2024, 24, 5551. [Google Scholar] [CrossRef]
Dragoni, N.; Giallorenzo, S.; Lluch Lafuente, A.; Mazzara, M.; Montesi, F.; Mustafin, R.; Safina, L. Microservices: Yesterday, Today, and Tomorrow. In Present and Ulterior Software Engineering; Mazzara, M., Meyer, B., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 195–216. [Google Scholar] [CrossRef]
Baldini, I.; Castro, P.; Chang, K.; Cheng, P.; Fink, S.; Ishakian, V.; Mitchell, N.; Muthusamy, V.; Rabbah, R.; Slominski, A.; et al. Serverless computing: Current trends and open problems. In Research Advances in Cloud Computing; Springer: Singapore, 2017; pp. 1–20. [Google Scholar] [CrossRef]
Li, Y.; Lin, Y.; Wang, Y.; Ye, K.; Xu, C. Serverless computing: State-of-the-art, challenges and opportunities. IEEE Trans. Serv. Comput. 2022, 16, 1522–1539. [Google Scholar] [CrossRef]
Chen, T.; Bahsoon, R.; Yao, X. A survey and taxonomy of self-aware and self-adaptive cloud autoscaling systems. ACM Comput. Surv. (CSUR) 2018, 51, 61. [Google Scholar] [CrossRef]
Garí, Y.; Monge, D.A.; Pacini, E.; Mateos, C.; Garino, C.G. Reinforcement learning-based application autoscaling in the cloud: A survey. Eng. Appl. Artif. Intell. 2021, 102, 104288. [Google Scholar] [CrossRef]
Al Qassem, L.M.; Stouraitis, T.; Damiani, E.; Elfadel, I.M. Containerized Microservices: A Survey of Resource Management Frameworks. IEEE Trans. Netw. Serv. Manag. 2024, 21, 3775–3796. [Google Scholar] [CrossRef]
Dogani, J.; Namvar, R.; Khunjush, F. Auto-scaling techniques in container-based cloud and edge/fog computing: Taxonomy and survey. Comput. Commun. 2023, 200, 120–150. [Google Scholar] [CrossRef]
Tran, M.N.; Vu, D.D.; Kim, Y. A Survey of Autoscaling in Kubernetes. In Proceedings of the 2022 Thirteenth International Conference on Ubiquitous and Future Networks (ICUFN), Barcelona, Spain, 5–8 July 2022; pp. 263–265. [Google Scholar] [CrossRef]
Zhong, Z.; Xu, M.; Rodriguez, M.A.; Xu, C.; Buyya, R. Machine Learning-based Orchestration of Containers: A Taxonomy and Future Directions. ACM Comput. Surv. 2022, 54, 217. [Google Scholar] [CrossRef]
Verma, S.; Bala, A. Auto-scaling techniques for IoT-based cloud applications: A review. Clust. Comput. 2021, 24, 2425–2459. [Google Scholar] [CrossRef]
Qu, C.; Calheiros, R.N.; Buyya, R. Auto-scaling Web Applications in Clouds: A Taxonomy and Survey. ACM Comput. Surv. 2018, 51, 73. [Google Scholar] [CrossRef]
Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the EASE ’14: 18th International Conference on Evaluation and Assessment in Software Engineering, New York, NY, USA, 13–14 May 2014. [Google Scholar] [CrossRef]
Hu, Y.; Deng, B.; Peng, F.; Wang, D. Workload Prediction for Cloud Computing Elasticity Mechanism. In Proceedings of the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 5–7 July 2016; IEEE: New York, NY, USA, 2016; pp. 244–249. [Google Scholar] [CrossRef]
Liu, C.; Liu, C.; Shang, Y.; Chen, S.; Cheng, B.; Chen, J. An adaptive prediction approach based on workload pattern discrimination in the cloud. J. Netw. Comput. Appl. 2017, 80, 35–44. [Google Scholar] [CrossRef]
Chen, Z.; Zhu, Y.; Di, Y.; Feng, S. Self-Adaptive Prediction of Cloud Resource Demands Using Ensemble Model and Subtractive-Fuzzy Clustering Based Fuzzy Neural Network. Comput. Intell. Neurosci. 2015, 2015, 919805. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, L.T.; Yan, Z.; Chen, Z.; Li, P. An Efficient Deep Learning Model to Predict Cloud Workload for Industry Informatics. IEEE Trans. Ind. Inform. 2018, 14, 3170–3178. [Google Scholar] [CrossRef]
Wajahat, M.; Karve, A.; Kochut, A.; Gandhi, A. MLscale: A machine learning based application-agnostic autoscaler. Sustain. Comput. Inform. Syst. 2019, 22, 287–299. [Google Scholar] [CrossRef]
Kim, I.K.; Wang, W.; Qi, Y.; Humphrey, M. Forecasting Cloud Application Workloads with CloudInsight for Predictive Resource Management. IEEE Trans. Cloud Comput. 2020, 10, 1848–1863. [Google Scholar] [CrossRef]
Saxena, D.; Singh, A.K. A Proactive Autoscaling and Energy-Efficient VM Allocation Framework Using Online Multi-Resource Neural Network for Cloud Data Center. Neurocomputing 2021, 426, 248–264. [Google Scholar] [CrossRef]
Xu, M.; Song, C.; Wu, H.; Gill, S.S.; Ye, K.; Xu, C. esDNN: Deep Neural Network Based Multivariate Workload Prediction in Cloud Computing Environments. ACM Trans. Internet Technol. 2022, 22, 75. [Google Scholar] [CrossRef]
Saxena, D.; Kumar, J.; Singh, A.K.; Schmid, S. Performance Analysis of Machine Learning Centered Workload Prediction Models for Cloud. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1313–1330. [Google Scholar] [CrossRef]
Shahin, A.A. Automatic Cloud Resource Scaling Algorithm based on Long Short-Term Memory Recurrent Neural Network. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar] [CrossRef]
Yu, G.; Chen, P.; Zheng, Z. Microscaler: Automatic Scaling for Microservices with an Online Learning Approach. In Proceedings of the 2019 IEEE International Conference on Web Services (ICWS), Milan, Italy, 8–13 July 2019; pp. 68–75. [Google Scholar] [CrossRef]
Yan, M.; Liang, X.; Lu, Z.; Wu, J.; Zhang, W. HANSEL: Adaptive horizontal scaling of microservices using Bi-LSTM. Appl. Soft Comput. 2021, 105, 107216. [Google Scholar] [CrossRef]
Horn, A.; Fard, H.M.; Wolf, F. Multi-objective hybrid autoscaling of microservices in Kubernetes clusters. In European Conference on Parallel Processing; Springer: Cham, Switzerland, 2022; pp. 233–250. [Google Scholar] [CrossRef]
Pintye, I.; Kovács, J.; Lovas, R. Enhancing Machine Learning-Based Autoscaling for Cloud Resource Orchestration. J. Grid Comput. 2024, 22, 67. [Google Scholar] [CrossRef]
Guruge, P.B.; Priyadarshana, Y.H.P.P. Time Series Forecasting-Based Kubernetes Autoscaling Using Facebook Prophet and Long Short-Term Memory. Front. Comput. Sci. 2025, 7, 1509165. [Google Scholar] [CrossRef]
Rahman, J.; Lama, P. Predicting the End-to-End Tail Latency of Containerized Microservices in the Cloud. In Proceedings of the 2019 IEEE International Conference on Cloud Engineering (IC2E), Prague, Czech Republic, 24–27 June 2019; pp. 200–210. [Google Scholar] [CrossRef]
Jeong, B.; Baek, S.; Park, S.; Jeon, J.; Jeong, Y.S. Stable and efficient resource management using deep neural network on cloud computing. Neurocomputing 2023, 521, 99–112. [Google Scholar] [CrossRef]
Iqbal, W.; Dailey, M.N.; Carrera, D. Unsupervised Learning of Dynamic Resource Provisioning Policies for Cloud-Hosted Multitier Web Applications. IEEE Syst. J. 2015, 10, 1435–1446. [Google Scholar] [CrossRef]
Yu, Y.; Jindal, V.; Yen, I.-L.; Bastani, F. Integrating Clustering and Learning for Improved Workload Prediction in the Cloud. In Proceedings of the 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), San Francisco, CA, USA, 27 June–2 July 2016; pp. 876–879. [Google Scholar] [CrossRef]
Nikravesh, A.Y.; Ajila, S.A.; Lung, C.H. An autonomic prediction suite for cloud resource provisioning. J. Cloud Comput. 2017, 6, 3. [Google Scholar] [CrossRef]
Daradkeh, T.; Agarwal, A.; Goel, N.; Kozlowski, A.J. Dynamic K-Means Clustering of Workload and Cloud Resource Configuration for Cloud Elastic Model. IEEE Access 2020, 8, 219430–219445. [Google Scholar] [CrossRef]
Shahidinejad, A.; Ghobaei-Arani, M.; Masdari, M. Resource provisioning using workload clustering in cloud computing environment: A hybrid approach. Clust. Comput. 2021, 24, 319–342. [Google Scholar] [CrossRef]
Ghobaei-Arani, M.; Shahidinejad, A. An Efficient Resource Provisioning Approach for Analyzing Cloud Workloads: A Metaheuristic-Based Clustering Approach. J. Supercomput. 2021, 77, 711–750. [Google Scholar] [CrossRef]
Sridhar, P.; Sathiya, R.R. Cloud Workload Forecasting via Latency-Aware Time Series Clustering-Based Scheduling Technique. Concurr. Comput. Pract. Exp. 2025, 37, e70151. [Google Scholar] [CrossRef]
Betti, P.; Thushantha, L.; Khan, Z.; Munir, K. Horizontal Autoscaling of Virtual Machines in Hybrid Cloud Infrastructures: Current Status, Challenges, and Opportunities. Encyclopedia 2025, 5, 37. [Google Scholar] [CrossRef]
Moghaddam, S.K.; Buyya, R.; Ramamohanarao, K. ACAS: An anomaly-based cause aware auto-scaling framework for clouds. J. Parallel Distrib. Comput. 2019, 126, 107–120. [Google Scholar] [CrossRef]
Zhang, X.; Meng, F.; Xu, J. PerfInsight: A Robust Clustering-Based Abnormal Behavior Detection System for Large-Scale Cloud. In Proceedings of the 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), San Francisco, CA, USA, 2–7 July 2018; IEEE: New York, NY, USA, 2018; pp. 896–899. [Google Scholar] [CrossRef]
He, Z.; Chen, P.; Li, X.; Wang, Y.; Yu, G.; Chen, C.; Li, X.; Zheng, Z. A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 1705–1719. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Zhu, S.; Yang, F.; Liang, S.; Zhao, Z. Research on Unsupervised Anomaly Data Detection Method Based on Improved Automatic Encoder and Gaussian Mixture Model. J. Cloud Comput. 2022, 11, 58. [Google Scholar] [CrossRef]
Ali, S.M.; Kecskemeti, G. SeQual: An Unsupervised Feature Selection Method for Cloud Workload Traces. J. Supercomput. 2023, 79, 15079–15097. [Google Scholar] [CrossRef]
Ali, S.M.; Kecskemeti, G. EFection: Effectiveness Detection Technique for Clustering Cloud Workload Traces. Int. J. Comput. Intell. Syst. 2024, 17, 198. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Wen, Y. Elastic Resource Provisioning Using Data Clustering in Cloud Service Platform. IEEE Access 2020, 8, 108436–108447. [Google Scholar] [CrossRef]
Rahmanian, A.A.; Ghobaei-Arani, M.; Tofighy, S. A learning automata-based ensemble resource usage prediction algorithm for cloud computing environment. Future Gener. Comput. Syst. 2018, 79, 54–71. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report TR 166; Cambridge University Engineering Department: Cambridge, UK, 1994. [Google Scholar]
Bahrpeyma, F.; Haghighi, H.; Zakerolhosseini, A. An adaptive RL based approach for dynamic resource provisioning in Cloud virtualized data centers. Computing 2015, 97, 1209–1234. [Google Scholar] [CrossRef]
Jamshidi, P.; Sharifloo, A.M.; Pahl, C.; Metzger, A.; Estrada, G. Self-Learning Cloud Controllers: Fuzzy Q-Learning for Knowledge Evolution. In Proceedings of the 2015 International Conference on Cloud and Autonomic Computing, Boston, MA, USA, 21–25 September 2015; pp. 208–211. [Google Scholar] [CrossRef]
Arabnejad, H.; Jamshidi, P.; Estrada, G.; El Ioini, N.; Pahl, C. An Auto-Scaling Cloud Controller Using Fuzzy Q-Learning—Implementation in OpenStack. In Service-Oriented and Cloud Computing; Aiello, M., Johnsen, E.B., Dustdar, S., Georgievski, I., Eds.; Springer: Cham, Switzerland, 2016; pp. 152–167. [Google Scholar] [CrossRef]
Arabnejad, H.; Pahl, C.; Jamshidi, P.; Estrada, G. A Comparison of Reinforcement Learning Techniques for Fuzzy Cloud Auto-Scaling. In Proceedings of the 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Madrid, Spain, 14–17 May 2017; IEEE: New York, NY, USA, 2017; pp. 64–73. [Google Scholar] [CrossRef]
Horovitz, S.; Arian, Y. Efficient Cloud Auto-Scaling with SLA Objective Using Q-Learning. In Proceedings of the 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), Barcelona, Spain, 6–8 August 2018; IEEE: New York, NY, USA, 2018; pp. 85–92. [Google Scholar] [CrossRef]
Nouri, S.M.R.; Li, H.; Venugopal, S.; Guo, W.; He, M.; Tian, W. Autonomic Decentralized Elasticity Based on a Reinforcement Learning Controller for Cloud Applications. Future Gener. Comput. Syst. 2019, 94, 765–780. [Google Scholar] [CrossRef]
Bitsakos, C.; Konstantinou, I.; Koziris, N. DERP: A deep reinforcement learning cloud system for elastic resource provisioning. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom), Nicosia, Cyprus, 10–13 December 2018; IEEE Computer Society: New York, NY, USA, 2018; pp. 21–29. [Google Scholar] [CrossRef]
Zhang, S.; Wu, T.; Pan, M.; Zhang, C.; Yu, Y. A-SARSA: A Predictive Container Auto-Scaling Algorithm Based on Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Web Services (ICWS), Beijing, China, 19–23 October 2020; IEEE: New York, NY, USA, 2020; pp. 489–497. [Google Scholar] [CrossRef]
Khaleq, A.A.; Ra, I. Intelligent Autoscaling of Microservices in the Cloud for Real-Time Applications. IEEE Access 2021, 9, 35464–35476. [Google Scholar] [CrossRef]
Rossi, F.; Cardellini, V.; Presti, F.L.; Nardelli, M. Dynamic Multi-Metric Thresholds for Scaling Applications Using Reinforcement Learning. IEEE Trans. Cloud Comput. 2023, 11, 1807–1821. [Google Scholar] [CrossRef]
Xue, S.; Qu, C.; Shi, X.; Liao, C.; Zhu, S.; Tan, X.; Ma, L.; Wang, S.; Wang, S.; Hu, Y.; et al. A Meta Reinforcement Learning Approach for Predictive Autoscaling in the Cloud. In Proceedings of the KDD ’22: 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14–18 August 2022; pp. 4290–4299. [Google Scholar] [CrossRef]
Hanafy, W.A.; Liang, Q.; Bashir, N.; Irwin, D.; Shenoy, P. CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-Efficiency. Proc. ACM Meas. Anal. Comput. Syst. 2023, 7, 57. [Google Scholar] [CrossRef]
Fodor, B.; Jakub, Á; Szűcs, G.; Sonkoly, B. A Multi-Agent Deep-Reinforcement Learning Approach for Application-Agnostic Microservice Scaling. In Proceedings of the 2023 IEEE Virtual Conference on Communications (VCC), New York, NY, USA, 28–30 November 2023; pp. 139–144. [Google Scholar] [CrossRef]
Bai, H.; Xu, M.; Ye, K.; Buyya, R.; Xu, C. DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-Based Clusters. IEEE Trans. Serv. Comput. 2024, 17, 2433–2446. [Google Scholar] [CrossRef]
Prodanov, J.; Bertalanič, B.; Fortuna, C.; Chou, S.-K.; Jurič, M.B.; Sanchez-Iborra, R.; Hribar, J. Multi-Agent Reinforcement Learning-Based In-Place Scaling Engine for Edge-Cloud Systems. In Proceedings of the 2025 IEEE 18th International Conference on Cloud Computing (CLOUD), Helsinki, Finland, 7–12 July 2025; pp. 32–42. [Google Scholar] [CrossRef]
Qiu, H.; Mao, W.; Wang, C.; Franke, H.; Youssef, A.; Kalbarczyk, Z.T.; Başar, T.; Iyer, R.K. AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23), Boston, MA, USA, 10–12 July 2023; pp. 387–402. [Google Scholar]
Park, J.; Choi, B.; Lee, C.; Han, D. Graph Neural Network-Based SLO-Aware Proactive Resource Autoscaling Framework for Microservices. IEEE/ACM Trans. Netw. 2024, 32, 1325–1340. [Google Scholar] [CrossRef]
Santos, J.; Reppas, E.; Wauters, T.; Volckaert, B.; De Turck, F. Gwydion: Efficient auto-scaling for complex containerized applications in Kubernetes through Reinforcement Learning. J. Netw. Comput. Appl. 2025, 234, 104067. [Google Scholar] [CrossRef]
Yuan, H.; Wang, T.; Fu, M.; Shi, Y. GIRP: Energy-Efficient QoS-Oriented Microservice Resource Provisioning via Multi-Objective Multi-Task Reinforcement Learning. IEEE Trans. Mob. Comput. 2025, 24, 5793–5807. [Google Scholar] [CrossRef]
Hua, Q.; Yang, D.; Qian, S.; Cao, J.; Xue, G.; Li, M. Humas: A Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling Framework in Large-Scale Data Centers. IEEE Trans. Comput. 2025, 74, 968–982. [Google Scholar] [CrossRef]
Qiu, H.; Mao, W.; Patke, A.; Cui, S.; Jha, S.; Wang, C.; Franke, H.; Kalbarczyk, Z.; Başar, T.; Iyer, R.K. Power-aware Deep Learning Model Serving with μ-Serve. In Proceedings of the 2024 USENIX Annual Technical Conference (USENIX ATC 24), Santa Clara, CA, USA, 10–12 July 2024; pp. 75–93. [Google Scholar]
Zhang, C.; Yu, M.; Wang, W.; Yan, F. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, USA, 10–12 July 2019; pp. 1049–1062. [Google Scholar]
Kim, Y.G.; Wu, C.J. AutoScale: Energy Efficiency Optimization for Stochastic Edge Inference Using Reinforcement Learning. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 7–21 October 2020; pp. 1082–1096. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Q.; Chu, X. Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs. In Proceedings of the 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), Rhodes, Greece, 2–6 November 2020; pp. 323–331. [Google Scholar] [CrossRef]
Cañete, A.; Djemame, K.; Amor, M.; Fuentes, L.; Aljulayfi, A. A proactive energy-aware auto-scaling solution for edge-based infrastructures. In Proceedings of the 2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC), Vancouver, WA, USA, 6–9 December 2022; pp. 240–247. [Google Scholar] [CrossRef]
Benifa, J.V.B.; Dejey, D. RLPAS: Reinforcement Learning-Based Proactive Auto-Scaler for Resource Provisioning in Cloud Environment. Mob. Netw. Appl. 2019, 24, 1348–1363. [Google Scholar] [CrossRef]
Rossi, F.; Nardelli, M.; Cardellini, V. Horizontal and Vertical Scaling of Container-Based Applications Using Reinforcement Learning. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy, 8–13 July 2019; IEEE: New York, NY, USA, 2019; pp. 329–338. [Google Scholar] [CrossRef]
Xu, M.; Song, C.; Ilager, S.; Gill, S.S.; Zhao, J.; Ye, K.; Xu, C. CoScal: Multifaceted Scaling of Microservices with Reinforcement Learning. IEEE Trans. Netw. Serv. Manag. 2022, 19, 3995–4009. [Google Scholar] [CrossRef]
Qiu, H.; Banerjee, S.S.; Jha, S.; Kalbarczyk, Z.T.; Iyer, R.K. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Online, 4–6 November 2020; pp. 805–825. [Google Scholar]
Wang, Z.; Zhu, S.; Li, J.; Jiang, W.; Ramakrishnan, K.K.; Yan, M. DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems. IEEE/ACM Trans. Netw. 2024, 32, 3267–3282. [Google Scholar] [CrossRef]
Mangalampalli, S.; Karri, G.R.; Kumar, M.; Khalaf, O.I.; Romero, C.A.T.; Sahib, G.M.A. DRLBTSA: Deep reinforcement learning based task-scheduling algorithm in cloud computing. Multimed. Tools Appl. 2024, 83, 8359–8387. [Google Scholar] [CrossRef]
Wei, Y.; Kudenko, D.; Deng, S.; Wu, L.; Fu, X.; Liu, X.; Wu, X.; Meng, X. A Reinforcement Learning Based Auto-Scaling Approach for SaaS Providers in Dynamic Cloud Environments. Math. Probl. Eng. 2019, 2019, 5080647. [Google Scholar] [CrossRef]
Khaleq, A.A.; Ra, I. Development of QoS-aware agents with reinforcement learning for autoscaling of microservices on the cloud. In Proceedings of the 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), Washington, DC, USA, 27 September–1 October 2021; IEEE: New York, NY, USA, 2021; pp. 13–19. [Google Scholar] [CrossRef]
Choochotkaew, S.; Chiba, T.; Trent, S.; Amaral, M. Run Wild: Resource Management System with Generalized Modeling for Microservices on Cloud. In Proceedings of the 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), Chicago, IL, USA, 5–10 September 2021; IEEE: New York, NY, USA, 2021; pp. 609–618. [Google Scholar] [CrossRef]
Golshani, E.; Ashtiani, M. Proactive auto-scaling for cloud environments using temporal convolutional neural networks. J. Parallel Distrib. Comput. 2021, 154, 119–141. [Google Scholar] [CrossRef]
Zhang, Y.; Hua, W.; Zhou, Z.; Suh, G.E.; Delimitrou, C. Sinan: ML-based and QoS-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual, 19–23 April 2021; pp. 167–181. [Google Scholar] [CrossRef]
Tamiru, M.A.; Tordsson, J.; Elmroth, E.; Pierre, G. An Experimental Evaluation of the Kubernetes Cluster Autoscaler in the Cloud. In Proceedings of the 2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Bangkok, Thailand, 14–17 December 2020; pp. 17–24. [Google Scholar] [CrossRef]
Aslanpour, M.S.; Toosi, A.N.; Taheri, J.; Gaire, R. AutoScaleSim: A simulation toolkit for auto-scaling Web applications in clouds. Simul. Model. Pract. Theory 2021, 108, 102245. [Google Scholar] [CrossRef]
Nguyen, H.X.; Zhu, S.; Liu, M. Graph-PHPA: Graph-based Proactive Horizontal Pod Autoscaling for Microservices using LSTM-GNN. In Proceedings of the 2022 IEEE 11th International Conference on Cloud Networking (CloudNet), Paris, France, 7–10 November 2022; pp. 237–241. [Google Scholar] [CrossRef]
Esposito, M.; Bakhtin, A.; Ahmad, N.; Robredo, M.; Su, R.; Lenarduzzi, V.; Taibi, D. Autonomic Microservice Management via Agentic AI and MAPE-K Integration. In Proceedings of the 19th European Conference on Software Architecture (ECSA 2025), Limassol, Cyprus, 15–19 September 2025; Bianculli, D., Sartaj, H., Andrikopoulos, V., Pautasso, C., Mikkonen, T., Perez, J., Bureš, T., De Sanctis, M., Muccini, H., Navarro, E., et al., Eds.; Springer: Cham, Switzerland, 2025; Volume 15982, pp. 105–118. [Google Scholar] [CrossRef]
Kumar, B.; Verma, A.; Verma, P. A multivariate transformer-based monitor-analyze-plan-execute (MAPE) autoscaling framework for dynamic resource allocation in cloud environment. Computing 2025, 107, 69. [Google Scholar] [CrossRef]
Karol Santos Nunes, J.P.; Nejati, S.; Sabetzadeh, M.; Nakagawa, E.Y. Self-adaptive, Requirements-driven Autoscaling of Microservices. In Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS ’24), Lisbon, Portugal, 15–16 April 2024; pp. 168–174. [Google Scholar] [CrossRef]
Jamshidi, P.; Pahl, C.; Mendonça, N.C. Managing Uncertainty in Autonomic Cloud Elasticity Controllers. IEEE Cloud Comput. 2016, 3, 50–60. [Google Scholar] [CrossRef]
Pham, K.Q.; Kim, T. Elastic Federated Learning with Kubernetes Vertical Pod Autoscaler for edge computing. Future Gener. Comput. Syst. 2024, 158, 501–515. [Google Scholar] [CrossRef]
Nguyen, T.T.; Yeom, Y.J.; Kim, T.; Park, D.H.; Kim, S. Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors 2020, 20, 4621. [Google Scholar] [CrossRef]
Thota, R.C. Intelligent Auto-Scaling in AWS: Machine Learning Approaches for Predictive Resource Allocation. Int. J. Sci. Res. Manag. (IJSRM) 2022, 10, 999–1005. [Google Scholar] [CrossRef]
Poppe, O.; Guo, Q.; Lang, W.; Arora, P.; Oslake, M.; Xu, S.; Kalhan, A. Moneyball: Proactive auto-scaling in Microsoft Azure SQL database serverless. Proc. VLDB Endow. 2022, 15, 1279–1287. [Google Scholar] [CrossRef]
Guo, Y.; Ge, J.; Guo, P.; Chai, Y.; Li, T.; Shi, M.; Tu, Y.; Ouyang, J. PASS: Predictive Auto-Scaling System for Large-scale Enterprise Web Applications. In Proceedings of the ACM Web Conference (WWW ’24), Singapore, 13–17 May 2024; pp. 2747–2758. [Google Scholar] [CrossRef]
Rzadca, K.; Findeisen, P.; Swiderski, J.; Zych, P.; Broniek, P.; Kusmierek, J.; Nowak, P.; Strack, B.; Witusowski, P.; Hand, S.; et al. Autopilot: Workload autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys ’20), Heraklion, Greece, 27–30 April 2020. [Google Scholar] [CrossRef]
Bao, G.; Guo, P. Federated learning in cloud-edge collaborative architecture: Key technologies, applications and challenges. J. Cloud Comput. 2022, 11, 94. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Number of autoscaling publications (2015–2025) retrieved from Google Scholar using the search strategy described in Section 1.4, prior to applying inclusion and exclusion criteria.

Figure 2. PRISMA flow diagram for study selection.

Figure 3. Overview of the survey structure. The paper progresses from foundations (Section 1 and Section 2), including the autoscaling taxonomy, to ML-based autoscaling approaches (Section 3), frameworks and systems (Section 4), and finally discussion of practice, challenges, and future directions (Section 5 and Section 6), before concluding in Section 7.

Figure 4. Cloud autoscaling systems organised around five key questions: goal, decision, scaling, control scope, and deployment.

Figure 5. Autoscaling control loop organised as MAPE-K. Monitor observes system metrics, Analyze explains why change may be needed, Plan decides how to scale, Execute applies actions to the managed system, and Knowledge stores traces, models, and objectives used across the loop.

Figure 6. Comparison of pod-to-pod communication latency paths in Kubernetes: pod-scaling path (blue) vs. node-scaling triggered through node network stacks (red).

Table 1. Gap Analysis of Existing Autoscaling Surveys (Reverse Chronological Order).

Survey Paper	Year	What Is Covered	What Is Not Covered
Containerized Microservices: A Survey of Resource Management Frameworks [8]	2024	Frameworks for container/microservice resource allocation and autoscaling; resource models; hardware-aware scaling; SLA and cost considerations.	No serverless; minimal ML/RL depth; no multi-agent RL; no benchmarks/evaluation frameworks; limited provider integration; sustainability barely mentioned.
Auto-Scaling Techniques in Cloud Computing: Issues and Research Directions [2]	2024	Broad taxonomy; ML, RL, fuzzy logic, time-series; reactive/proactive; QoS, cost, energy; AWS/Azure comparison.	Limited microservices/serverless focus; no benchmarks; sustainability lightly touched; no multi-agent RL.
Auto-scaling techniques in container-based cloud and edge/fog computing: Taxonomy and survey [9]	2023	Container autoscaling in cloud-edge/fog; taxonomy; latency & resource efficiency; predictive heuristics.	No serverless; shallow ML coverage; no multi-agent RL; no benchmarks; minimal provider details; sustainability absent.
A Survey of Autoscaling in Kubernetes [10]	2022	HPA, VPA, custom metrics; pod/container scaling mechanisms.	No formal taxonomy; almost no ML; no benchmarks; no provider integration; sustainability absent; no multi-agent RL.
Machine Learning-based Orchestration of Containers: Taxonomy and Future Directions [11]	2022	ML orchestration for containers; supervised, RL, DL; taxonomy; QoS and resource utilization focus.	No serverless; no benchmarks; no multi-agent RL; sustainability absent; limited provider details.
Auto-scaling techniques for IoT-based cloud applications: A review [12]	2021	IoT-specific autoscaling; fuzzy logic and basic RL; taxonomy; latency/QoS focus.	No microservices/serverless; limited ML depth; no benchmarks; no provider integration; sustainability absent.
Auto-scaling Web Applications in Clouds: A Taxonomy and Survey [13]	2018	Classic reactive/proactive taxonomy; VM-level web apps; QoS and cost; elasticity challenges.	No microservices/serverless; minimal ML; outdated scope; no benchmarks; sustainability absent; no multi-agent RL.

Table 2. Mathematical notation.

Symbol	Description
t	Decision time step, $t = 1, 2, \dots, T$
$R (t)$	Resource allocation vector at time t
$D (t)$	Workload/demand at time t
$\hat{D} (t)$	Predicted workload at time t
$C (R (t), D (t))$	Cost function at time t
$Q (R (t), D (t))$	QoS metric at time t
$Q_{max}$	QoS threshold (SLO bound)
$π_{θ}$	Learned policy parameterised by $θ$
$r_{t}$	Reward at time t
$γ$	Discount factor, $γ \in [0, 1)$
$1 {\cdot}$	Indicator function

Table 3. Autoscaler decision mechanisms and their characteristics.

Category	Example Methods	Description	Timing
Rule based	Threshold rules, step policies, schedules	Manually defined rules on metrics such as CPU, memory, or latency; simple and widely deployed in cloud and Kubernetes autoscalers.	Reactive
Supervised ML	Regression, time series forecasting (ARIMA, LSTM, Prophet)	Predicts future demand or performance; the autoscaler adjusts resources based on forecasted values.	Proactive
Unsupervised ML	Clustering, anomaly detection	Finds patterns or outliers in workload metrics without labels; can trigger scaling on abnormal conditions or specialise policies for clusters of workloads.	Reactive or proactive
Reinforcement learning	Q-learning, DQN, actor-critic, PPO	Learns scaling policies from trial and error using reward signals that combine QoS and cost; can handle hybrid (horizontal and vertical) actions and long-term effects.	Reactive and proactive (hybrid)

Table 4. Research Papers Using Supervised Machine Learning Techniques for Autoscaling (2015–2025).

Year	Paper Title	ML Technique	Workload Type	Autoscaling Objective	Deployment	Training Mode
2025	Time Series Forecasting-Based Kubernetes Autoscaling Using Facebook Prophet and LSTM [29]	Prophet + LSTM hybrid	Containers (Kubernetes)	HTTP request prediction, SLA compliance	Cloud	Offline
2024	Enhancing Machine Learning-Based Autoscaling for Cloud Resource Orchestration [28]	Statistical feature selection + ML	Cloud services (IaaS)	QoS-aware resource management	Cloud	Offline
2023	Performance Analysis of Machine Learning Centered Workload Prediction Models for Cloud [23]	Comparative ML models (LSTM, CNN, ensemble)	Cloud services (IaaS)	Workload prediction accuracy	Cloud	Offline
2023	Stable and Efficient Resource Management Using Deep Neural Network on Cloud Computing [31]	Deep Neural Network	Containers (Kubernetes pods)	Resource utilization, overload prevention	Cloud	Offline
2022	Machine Learning-Based Adaptive Auto-scaling Policy for Resource Orchestration in Kubernetes Clusters [11]	LSTM Recurrent Neural Network	Containers (Kubernetes pods)	Resource utilization (performance)	Cloud	Offline
2022	Multi-objective Hybrid Autoscaling of Microservices in Kubernetes Clusters [27]	ML-based performance modeling	Microservices (Kubernetes)	Response time SLO + resource efficiency	Cloud	Offline
2022	esDNN: Deep Neural Network Based Multivariate Workload Prediction in Cloud Computing Environments [22]	Deep Neural Network (GRU-based)	Cloud services (VMs)	Workload prediction, auto-scaling	Cloud	Offline
2021	Hansel: A Bi-LSTM Based Proactive Auto-Scaler [26]	Bi-LSTM	Cloud services (VMs)	Workload prediction, SLA compliance	Cloud	Offline
2021	A Proactive Autoscaling and Energy-Efficient VM Allocation Framework Using Online Multi-Resource Neural Network [21]	Multi-resource Neural Network	Virtual Machines (VMs)	Energy efficiency, proactive scaling	Cloud	Online
2020	Forecasting Cloud Application Workloads with CloudInsight for Predictive Resource Management [20]	Ensemble model (multiple predictors)	Cloud applications	Cost efficiency, SLA compliance	Cloud	Offline
2019	MLscale: A Machine Learning-Based Application-Agnostic Autoscaler [19]	Neural network with regression	Cloud resources (VMs)	Performance metrics (response time)	Cloud	Online
2019	Predicting the End-to-End Tail Latency of Containerized Microservices in the Cloud [30]	Machine learning regression	Containers, microservices	Tail latency	Cloud	Offline
2019	Microscaler: Automatic Scaling for Microservices with an Online Learning Approach [25]	Online Bayesian regression	Microservices	Latency SLO	Cloud	Online
2018	An Efficient Deep Learning Model to Predict Cloud Workload for Industry Informatics [18]	Deep Neural Network	Cloud services (IaaS)	Workload prediction, industrial applications	Cloud	Offline
2017	An Adaptive Prediction Approach Based on Workload Pattern Discrimination in the Cloud [16]	SVM, Linear Regression	Cloud tasks (IaaS, VMs)	Workload (throughput, latency)	Cloud	Offline
2016	Workload Prediction for Cloud Computing Elasticity Mechanism [15]	Random Forest, ARIMA, SVM	Cloud services (IaaS)	Elastic scaling, prediction accuracy	Cloud	Offline
2015	Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS [1]	ARIMA	Web applications (SaaS)	QoS, proactive provisioning	Cloud	Offline
2015	Self-Adaptive Prediction of Cloud Resource Demands Using Ensemble Model and Subtractive-Fuzzy Clustering Based Fuzzy Neural Network [17]	Ensemble + Fuzzy Neural Network	Cloud services (IaaS)	Resource demand prediction	Cloud	Offline

Table 5. Survey of Unsupervised Machine Learning Techniques for Cloud Autoscaling (2015–2025).

Year	Paper Title	ML Technique	Workload Type	Autoscaling Objective	Deployment	Training Mode
2025	Cloud Workload Forecasting via Latency-Aware Time Series Clustering-Based Scheduling Technique [38]	Dynamic fuzzy c-means clustering	Cloud services (IaaS)	Latency-aware scheduling, resource optimization	Cloud	Offline
2025	Horizontal Autoscaling of Virtual Machines in Hybrid Cloud Infrastructures [39]	K-means clustering	Virtual machines (IaaS)	Response time, throughput, SLA compliance	Hybrid	Offline
2024	EFection: Effectiveness Detection Technique for Clustering Cloud Workload Traces [45]	Adaptive clustering with internal validation	Cloud services (IaaS)	Workload classification, resource optimization	Cloud	Offline
2023	A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems [42]	Graph neural network + LSTM (TopoMAD)	Cloud systems (VMs/containers)	Performance anomaly detection, system reliability	Cloud	Offline
2023	SeQual: An Unsupervised Feature Selection Method for Cloud Workload Traces [44]	Silhouette-based feature selection + clustering	Cloud services (IaaS)	Workload characterization, user identification	Cloud	Offline
2022	Research on Unsupervised Anomaly Data Detection Method Based on Improved Autoencoder and Gaussian Mixture Model [43]	Deep autoencoder + GMM (MemAe-gmm-ma)	Cloud services (IaaS)	Anomaly detection, cloud security	Cloud	Offline
2021	Resource Provisioning Using Workload Clustering in Cloud Computing Environment: A Hybrid Approach [36]	QoS-based K-means + fuzzy logic	Cloud services (VMs)	SLA compliance, resource optimization	Cloud	Offline
2021	An Efficient Resource Provisioning Approach for Analyzing Cloud Workloads: A Metaheuristic-Based Clustering Approach [37]	GA + Fuzzy C-means + Gray Wolf Optimizer	Cloud services (IaaS)	QoS-aware resource provisioning	Cloud	Offline
2020	Dynamic K-Means Clustering of Workload and Cloud Resource Configuration for Cloud Elastic Model [35]	Enhanced K-means with kernel density estimation	Cloud services (IaaS)	Elastic scaling, workload-resource mapping	Cloud	Offline
2020	Elastic Resource Provisioning Using Data Clustering in Cloud Service Platform [46]	Clustering ensemble method	Cloud services (IaaS)	Dynamic resource provisioning, task scheduling	Cloud	Online
2019	ACAS: An Anomaly-Based Cause Aware Auto-Scaling Framework for Clouds [40]	Isolation Forest (anomaly detection)	Virtual Machines (VMs)	SLA-aware scaling	Cloud	Online
2018	A Learning Automata-Based Ensemble Resource Usage Prediction Algorithm [47]	Ensemble prediction with clustering	Cloud services (IaaS)	Resource usage prediction accuracy	Cloud	Online
2018	PerfInsight: A Robust Clustering-Based Abnormal Behavior Detection System for Large-Scale Cloud [41]	Clustering-based anomaly detection	Cloud services (VMs)	Abnormal behavior detection, system reliability	Cloud	Online
2017	An Autonomic Prediction Suite for Cloud Resource Provisioning [34]	Unsupervised clustering of resource usage profiles	Cloud services (IaaS)	Improved provisioning accuracy, SLA compliance	Cloud	Offline
2016	Integrating Clustering and Learning for Improved Workload Prediction in the Cloud [33]	K-means clustering + neural network	Cloud services (IaaS)	Workload prediction, resource provisioning	Cloud	Offline
2015	Unsupervised Learning of Dynamic Resource Provisioning Policies for Cloud-Hosted Multitier Web Applications [32]	Unsupervised clustering + online learning	Web applications (multi-tier)	Dynamic provisioning, SLO compliance	Cloud	Online

Table 6. Single-Objective RL Techniques for Cloud Autoscaling (2015–2025).

Year	Paper Title	RL Technique	Workload Type	Autoscaling Objective	Deployment	RL Method
2025	Multi-Agent RL-Based In-Place Scaling Engine for Edge-Cloud [64]	Multi-Agent Deep RL	Edge-Cloud microservices	In-place scaling latency reduction	Edge-Cloud	Off-policy
2025	Gwydion: Efficient Auto-Scaling for Complex Containerized Applications [67]	RL (OpenAI Gym-based)	Microservices (Kubernetes)	Latency-aware horizontal scaling	Cloud	Off-policy
2023	AWARE: RL-Based Autoscaling in Production Cloud Systems [65]	Meta-RL with safe exploration	Mixed workloads	Minimize SLO violations	Cloud	Off-policy
2023	Multi-Agent Deep-RL for Application-Agnostic Microservice Scaling [62]	Multi-Agent Deep RL (MADDPG)	Microservices	Application-agnostic horizontal scaling	Cloud	Off-policy
2023	Dynamic Multi-Metric Thresholds for Scaling Using RL [59]	Deep Q-Learning	Cloud applications	Adaptive scaling thresholds	Cloud	Off-policy
2022	A Meta RL Approach for Predictive Autoscaling [60]	Meta-RL (PPO-based)	Cloud services	Predictive scaling generalization	Cloud	On-policy
2021	Intelligent Autoscaling of Microservices for Real-Time Applications [58]	Actor-Critic, DQN, SARSA, Q-learning	Microservices	Response time optimization	Cloud	Mixed
2020	A-SARSA: Predictive Container Auto-Scaling Based on RL [57]	SARSA + ARIMA prediction	Containers	SLA violation reduction	Cloud	On-policy
2019	RLPAS: RL-Based Proactive Auto-Scaler [75]	Parallel SARSA	VMs	SLA reduction	Cloud	On-policy
2019	Autonomic Decentralized Elasticity Based on RL Controller [55]	Distributed Q-learning	Web apps	SLA reduction	Cloud	Off-policy
2019	Horizontal and Vertical Scaling Using RL [76]	Model-based RL	Containers	Response time optimization	Cloud	Off-policy
2018	Efficient Cloud Auto-Scaling with SLA Using Q-Learning [54]	Q-learning	Web apps	SLA-aware scaling	Cloud	Off-policy
2018	DERP: Deep RL for Elastic Resource Provisioning [56]	Deep Q-Network	NoSQL cluster	Throughput optimization	Cloud	Off-policy
2017	Comparison of RL Techniques for Fuzzy Cloud Auto-Scaling [53]	Fuzzy SARSA, Fuzzy Q-learning	OpenStack VMs	SLA compliance	Cloud	Mixed
2016	Auto-Scaling Cloud Controller Using Fuzzy Q-Learning [52]	Fuzzy Q-learning	OpenStack VMs	Response time optimization	Cloud	Off-policy
2015	Adaptive RL-Based Approach for Dynamic Resource Provisioning [50]	Continuous Q-learning	VMs	Energy minimization	Cloud	Off-policy
2015	Self-Learning Cloud Controllers: Fuzzy Q-Learning [51]	Fuzzy Q-learning	Cloud VMs	Knowledge evolution	Cloud	Off-policy

Table 7. Multi-Objective RL Techniques for Cloud Autoscaling (2015–2025).

Year	Paper Title	RL Technique	Workload Type	Autoscaling Objectives	Deployment	RL Method
2025	GIRP: Energy-Efficient QoS-Oriented Microservice Resource Provisioning [68]	Multi-objective Multi-task DDPG	Microservices	Energy efficiency + latency minimization	Cloud	Off-policy
2025	Humas: Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling [69]	Adaptive RL	Microservices (large-scale)	Resource heterogeneity + rolling updates	Cloud	Off-policy
2024	DRPC: Distributed RL for Scalable Resource Provisioning [63]	TD3 (distributed)	Containers (Kubernetes)	QoS + resource utilization	Cloud	Off-policy
2024	DeepScaling: Autoscaling Microservices With Stable CPU [79]	Deep learning-based	Microservices (production)	CPU stability + SLA compliance	Cloud	Off-policy
2024	GNN-Based SLO-Aware Proactive Resource Autoscaling [66]	GNN + RL	Microservices	SLO compliance + resource efficiency	Cloud	Off-policy
2024	DRLBTSA: Deep RL-Based Task-Scheduling Algorithm [80]	Deep Q-Network (DQN)	Cloud tasks (heterogeneous)	Makespan + SLA violations + energy	Cloud	Off-policy
2022	CoScal: Multifaceted Scaling of Microservices with RL [77]	DQN-based multi-faceted scaling	Microservices	Response time SLO + cost optimization	Cloud	Off-policy
2020	FIRM: Fine-Grained Intelligent Resource Management [78]	DDPG + SVM-guided RL	Microservices	SLO violation reduction + fine-grained control	Cloud	Off-policy
2019	RL-Based AutoScaling Approach for SaaS Providers [81]	Q-learning	SaaS apps on VMs	Cost minimization + SLA satisfaction	Cloud	Off-policy

Table 8. Comparison of ML Paradigms for Autoscaling.

Method	Typical Workload Types	Evaluation Environments	Typical Gains Reported
Reinforcement Learning (RL)	Microservices, containerized apps	Kubernetes clusters, cloud testbeds, production systems	SLA violation reduction (up to 30%), cost savings (15–25%), energy reduction (up to 43%), improved adaptability under dynamic workloads [62,63,65,68,76]
Supervised Learning	VM-based workloads, multi-tier web apps, microservices	Simulation tools (CloudSim, AutoScaleSim), controlled testbeds	Accurate workload prediction; proactive scaling reduces latency spikes by 10–20% [19,26,29,31]
Unsupervised Learning	Hybrid cloud workloads (primarily VMs); containers/microservices	Simulation + emulation; testbeds discussed in reviewed works	Reduced oscillations and improved SLA compliance via proactive methods; clustering for workload grouping enables tailored scaling decisions [38,39,45]

Table 9. Evaluation Metrics Reported in ML-Based Autoscaling Literature (2015–2025).

Year	Reference	SLA Compliance	Cost/Resource Usage	Stability	Convergence Time	Prediction Accuracy
2023	Jeong et al. [31]	SLA + energy-aware scaling	Cost and carbon footprint	Not reported	RL adaptation time	N/A
2022	Xue et al. [60]	SLO violation reduction	50% resource savings	Resource Control Stability 0.91–0.95	Fast meta-RL adaptation	Workload Root Mean Squared Error 112.59
2021	Golshani & Ashtiani [84]	SLA compliance under bursty traffic	Cost normalized to baseline	Stability index	Adaptation time	Forecast RMSE
2021	Zhang et al. [85]	SLO violation reduction	Resource efficiency (CPU limits)	Not reported	Not reported	N/A
2021	Yan et al. [26]	SLA compliance	Cost vs. baseline	Not reported	Not reported	MAE/RMSE (workload prediction)
2021	Shahidinejad et al. [36]	SLA compliance via QoS-based clustering	Cost savings vs. baseline	Scaling decisions count	Not reported	N/A
2020	Tamiru et al. [86]	SPEC Cloud metrics (under/over-provisioning)	Monetary cost comparison	Instability of elasticity	Not reported	N/A
2019	Wajahat et al. [19]	79% reduction in SLA violations (vs. reactive baselines)	23% lower resource costs (vs. reactive baselines)	Not reported	Not reported	RMSE/MAPE on response time prediction (specific values vary by workload; e.g., MAPE 5–15%)
2019	Rossi et al. [59]	95th-percentile latency vs. HPA	Avg. CPU cores (10% savings)	Variance in replica count	RL convergence (episodes)	N/A
2019	Benifa and Dejey [75]	SLA violation penalty in reward	VM-hours vs. baseline	Scaling frequency	Training episodes to convergence	N/A
2017	Arabnejad et al. [53]	SLA compliance	Cost reduction vs. baseline	Not reported	Not reported	N/A
2015	Calheiros et al. [1]	QoS impact analysis	Resource utilization	Prediction accuracy (RMSE/MAE)	Scaling overhead	Response time

Table 10. Benchmarking Applications and Evaluation Settings in Autoscaling Literature (2015–2025).

Year	Reference	Benchmark Application	Real Cloud Platform	Interaction Mode	Workload Characteristics	Autoscaler Type Tested	Open-Source Availability
2023	Jeong et al. [31]	Sustainability-focused workloads	Yes (AWS)	Real deployment	Mixed CPU/energy profiles	RL-based	No
2023	Betti et al. [62]	Custom microservice benchmark (complex inter-service dependencies)	No	Simulation/testbed (Kubernetes-like environment)	Complex inter-service graph	Multi-agent deep RL (MADDPG)	Yes
2021	Zhang et al. [85]	DeathStarBench (social network)	No	Kubernetes testbed	Microservices, diurnal patterns	ML-based (Sinan)	Yes
2020	Tamiru et al. [86]	No (evaluation of configurations)	No	Real cloud testbed (GKE)	Mixed (representative applications: e.g., web serving, batch processing)	Experimental evaluation of Kubernetes Cluster Autoscaler	Yes
2019	Rossi et al. [59]	RUBiS, DVD Store	No	Kubernetes testbed	Diurnal cycles, bursty traffic	RL-based	Yes
2019	Gao et al. [88]	DeathStarBench (social network)	No	Testbed	Microservices, inter-service dependencies	GNN-based autoscaler	Yes
2017	Arabnejad et al. [53]	Synthetic workloads	No (OpenStack testbed)	Real deployment	Traffic variability, sudden spikes	Fuzzy Q-learning	No

Table 11. ML Usage Across MAPE-K Phases.

Phase	ML Focus	Example Use/Works
Monitor	Anomaly detection; metric imputation	Proactive anomaly sensing via agentic AI [89]
Analyze	Forecasting; bottleneck and root-cause analysis	Multivariate forecasting for proactive analysis [90]; SLO-driven and uncertainty-aware analysis using fuzzy logic [91,92]
Plan	Action selection; model predictive control (MPC)	Neural network prediction + evolutionary optimization for capacity planning [21]; deep RL for scaling decisions [81]; fuzzy + RL for adaptive autoscaling [92]
Execute	Action ordering/coordination	Agentic AI coordinating execution with human-in-the-loop safeguards [89]
Knowledge	Storage of models and histories	SLO-oriented knowledge base for adaptation [91]; evolving knowledge for uncertainty-aware autoscaling [92]

Table 12. Summary of ML-Driven Autoscaling Frameworks.

Framework	ML Technique	Integration Approach
Saxena & Singh [21]	Neural network forecasting + Evolutionary optimization	Proactive VM autoscaling and energy-efficient placement; evaluated on Google Cluster Dataset
MARLISE [64]	Multi-Agent Deep RL (DQN/PPO)	Independent agents for vertical scaling of individual microservices; implicit coordination via shared environment in edge-cloud deployments
FIRM [78]	Hierarchical RL	Service-level and cluster-level coordination; uses tracing for dependency-aware scaling
AWARE [65]	RL + Safety Logic	Integrated with Kubernetes scheduler; offline-trained RL with runtime safety checks

Table 13. ML-Based Autoscaling Across Major Cloud Platforms (2015–2025).

Platform	Integration Method	Example Work (Reference)
AWS (EC2 + ALB)	PASS: Predictive auto-scaling for large-scale web applications (forecast-driven horizontal scaling with warmup-aware provisioning and QoS guarantees)	Li et al. (2024)—Deployed system at Tencent using deep learning workload prediction and risk-aware scaling decisions [97]
Azure (VMs/Scale Sets)	ML-enhanced autoscale rules	Arabnejad et al. (2017)—Fuzzy Q-learning controller deployed on Azure/OpenStack for cost-SLA optimal VM scaling [52]
Azure (PaaS/Serverless)	Platform-managed predictive autoscaling	Poppe et al. (2022)—“Moneyball” predictive scaler for Azure SQL Database (serverless tier), eliminates reactivation latency [96]
Kubernetes (HPA)	External predictive metrics fed into HPA	Yu et al. (2019)—Microscaler uses online Bayesian regression to forecast load and drive proactive HPA decisions [25]
Kubernetes (Custom Controller)	Fully custom autoscaler replacing HPA	Wu et al. (2019)—Deep RL agent as drop-in HPA replacement for microservices [81]; Rossi et al. (2019)—Q-learning-based custom controller
Kubernetes (Hybrid)	Combined horizontal + vertical scaling via RL	Pan et al. (2022)—RL agent selects scale-out/in or resource resize actions [67]; Qiu et al. (2023)—AWARE meta-RL framework dynamically adjusts replicas and resources (16× fewer SLA violations than default) [65]

Table 14. Comparison of related works on energy-efficient autoscaling.

System (Year)	Energy Efficiency Technique	Workload & Environment	SLO Guarantees	Scalability & Context
$μ$ -Serve (2024) [70]	Fine-grained GPU DVFS; model partitioning; speculative scheduling	Deep learning inference (CNNs, Transformers) in homogeneous GPU clusters	Yes—strict SLO preservation via MIAD frequency control	Evaluated on 8-node (16 GPU) cluster; no horizontal scaling or heterogeneity support
A proactive energy-aware auto-scaling solution for edge-based infrastructures (2022) [74]	Proactive horizontal autoscaling; energy-aware node selection	Edge computing with heterogeneous nodes	Yes—predictive scaling with 0% failed requests	Simulated up to 500 nodes; focuses on idle/dynamic energy; no device-level DVFS
AutoScale (2020) [72]	RL-based execution target selection (edge/cloud/mobile)	Edge inference (mobile devices, cloud offloading)	Yes—latency and accuracy included in RL reward	Per-device RL agents; handles heterogeneity; not cluster-scale
Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs (2020) [73]	Batch scheduling + GPU DVFS for Transformer inference	Transformer-based NLP inference on GPUs	Partial—allows latency increase for energy savings	Single-node focus; tested on multiple GPU types; no autoscaling or cluster-level control
MArk (2019) [71]	Predictive autoscaling; multi-tier provisioning (IaaS + FaaS); batching	ML inference on AWS (image, language models)	Yes—predictive scaling + serverless fallback for SLO compliance	Scales across cloud instances; cost-focused; no energy metrics; supports heterogeneous provisioning

Table 15. Summary of key discussion questions and challenges in ML-based autoscaling.

Question	Primary Taxonomy Nodes	Key Approaches	Guideline (Q&A Answer)	Open Challenges
When does a learned scaler beat a tuned threshold?	Goal, Decision	Threshold policies; supervised prediction; RL-based autoscalers	Learned scalers are most beneficial for complex, non-stationary, and multi-objective workloads; tuned thresholds remain strong baselines for simple, stable cases	Systematic characterisation of regimes where ML reliably outperforms tuned thresholds; standardised baselines and reporting
How do actuation delays and telemetry lag change the winner?	Decision, Deployment, Monitor/Knowledge	Reactive vs predictive control; delay-aware policies	Design autoscalers with explicit knowledge of actuation and telemetry delays; use forecasting and conservative scale-in under long delays	Stratified evaluations across delay regimes; explicit delay modelling in learning algorithms and control loops
Is diagonal (hybrid) scaling worth the complexity?	Scaling, Control Scope	Horizontal, vertical, hybrid (diagonal) scaling; RL-based hybrid controllers	Use hybrid scaling when both primitives are available and workloads have shifting bottlenecks; otherwise, well-tuned pure strategies may suffice	Multi-resource optimisation under hybrid policies; stability and restart-aware designs; general evidence on benefit vs complexity
Centralised vs decentralised control: who scales better as service count grows?	Control Scope, Deployment	Centralised controllers; per-service policies; MARL; hierarchical models	Apply centralised or hierarchical control for tightly coupled services; use decentralised control for large, loosely coupled microservice or edge deployments with light global coordination	Scaling behaviour with service count; MARL reward design and coordination; observability and debugging in distributed autoscaling
What is the cost of safety and multi-objective guarantees?	Goal, Decision, Deployment	Safe/constrained RL; external guardrails; multi-objective optimisation; carbon-aware policies	Quantify SLO, cost, and energy trade-offs; treat safety as a first-class objective and report Pareto frontiers rather than single operating points	Quantifying the “cost of safety”; dynamic movement along Pareto frontiers; integrating pricing, penalties, and carbon targets into robust ML controllers
What are the challenges in achieving energy-efficient autoscaling?	Goal, Decision, Deployment	Device-level DVFS; predictive autoscaling; energy-aware scheduling; RL-based execution scaling	Combine device-level power management with cluster-level autoscaling; report energy–performance trade-offs explicitly; evaluate across heterogeneous hardware and dynamic workloads	Integration of vertical and horizontal scaling; handling hardware heterogeneity; dynamic energy-aware control under workload and system variability; transparent reporting of energy trade-offs and carbon impact

Table 16. Summary of future research directions in ML-based autoscaling.

Direction	Focus Area	Main Open Issues
Unified ML-Orchestration Frameworks	Integrated monitoring, prediction, and scaling across tiers; extended MAPE-K loops	Co-ordination of multi-tier scaling actions under shared bottlenecks and delays; design of ML-driven control planes with clear interfaces and guarantees
Edge Intelligence and Decentralised Scaling	Hierarchical control for edge–cloud; decentralised and federated updates	Robust operation under heterogeneous delays, bandwidth limits, and partial observability; partitioning of control between edge and cloud
Multi-Agent and Federated Learning	MARL for service-level scaling; federated learning for cross-cluster policy sharing	Reward design and coordination that align local agents with global objectives; stability and adaptability under non-stationary workloads
Standardised Benchmarks and Evaluation	Common workloads, metrics, and reproducible simulation and deployment environments	Community-agreed benchmark suites, baselines, and reporting guidelines; long-term maintenance of artefacts and reference implementations
Sustainability and Green Autoscaling	Carbon-aware autoscaling; multi-objective optimisation including energy and emissions	Joint optimisation of cost, performance, and carbon footprint; modelling and exploiting temporal/spatial flexibility under realistic constraints

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Machiraju, V.S.; Kumar, V.; Sharma, S. ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation. Math. Comput. Appl. 2026, 31, 49. https://doi.org/10.3390/mca31020049

AMA Style

Machiraju VS, Kumar V, Sharma S. ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation. Mathematical and Computational Applications. 2026; 31(2):49. https://doi.org/10.3390/mca31020049

Chicago/Turabian Style

Machiraju, Vishwanath Srikanth, Vijay Kumar, and Sahil Sharma. 2026. "ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation" Mathematical and Computational Applications 31, no. 2: 49. https://doi.org/10.3390/mca31020049

APA Style

Machiraju, V. S., Kumar, V., & Sharma, S. (2026). ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation. Mathematical and Computational Applications, 31(2), 49. https://doi.org/10.3390/mca31020049

Article Menu

ML-Based Autoscaling for Elastic Cloud Applications: Taxonomy, Frameworks, and Evaluation

Abstract

1. Introduction

1.1. Motivation and Significance

1.2. Objectives and Contributions

1.3. Problem Statement

1.4. Review Method

1.4.1. Search Engines

1.4.2. Search Limits

1.4.3. Inclusion Criteria

1.4.4. Exclusion Criteria

1.5. Survey Organization

2. Taxonomy and Background

2.1. Goal (What Is Optimised?)

2.2. Decision (How Does It Decide?)

2.3. Scaling (How Does It Scale?)

2.4. Control Scope (What Is Controlled?)

2.5. Deployment (Where Does It Run?)

3. ML-Based Autoscaling Approaches

3.1. ML Techniques for Autoscaling

3.1.1. Supervised Learning

3.1.2. Unsupervised Learning

3.1.3. Reinforcement Learning (RL)

3.2. Evaluation Metrics and Benchmarks

4. Frameworks and Systems

4.1. ML-Driven Autoscaling Pipelines

4.2. Integration with Orchestration Tools

4.2.1. Kubernetes Integration

4.2.2. Cloud Provider Auto Scaling

5. Discussion and Challenges

5.1. When Does a Learned Scaler Beat a Tuned Threshold?

5.2. How Do Actuation Delays and Telemetry Lag Change the Winner?

5.3. Is Diagonal (Hybrid) Scaling Worth the Complexity?

5.4. Centralised vs. Decentralised Control: Who Scales Better as Service Count Grows?

5.5. What Is the Cost of Safety and Multi-Objective Guarantees?

5.6. What Are the Challenges in Achieving Energy-Efficient Autoscaling?

6. Future Directions

6.1. Unified ML-Orchestration Frameworks

6.2. Edge Intelligence and Decentralised Scaling

6.3. Multi-Agent and Federated Learning for Autoscaling

6.4. Standardised Benchmarks and Evaluation Methodologies

6.5. Sustainability and Green Autoscaling

6.6. Threats to Validity and Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI