1. Introduction
With the rapid development of technology in the last decade, cloud computing has been gaining popularity in not only industry but also the academic environment [
1]. A core element in cloud computing is virtualization technology, which helps cloud providers run multiple operating systems and applications on the same computing device, such as servers. Traditionally, hypervisor-based virtualization is a technique used to create virtual machines (VMs) with specific resources (e.g., CPU cores, RAM, and network bandwidth) and a guest operating system [
2]. Some worldwide hypervisors include ESX, Xen, and Hyper-V. Container-based virtualization is a lightweight alternative to VMs [
3], which can help decrease the start-up time and consume less resources than VMs [
1]. Some well-known examples of this technology include LXC, Docker, Kubernetes, and Kata.
The main feature of cloud computing is elasticity, which allows application owners to manage unpredictable workloads, which are inherent in internet-based services [
4], by provisioning or de-provisioning resources based on the demand to improve performance while reducing costs [
5]. Autoscaling refers to a process that dynamically acquires and releases resources and can be categorized into two types: reactive and proactive. Reactive approaches perform scaling action by analyzing the current state of the system by setting predefined rules or thresholds. Proactive approaches analyze the historical data, predict the future, and perform scaling decisions. Selecting an appropriate autoscaling method can affect the quality of parameters, such as CPU utilization, memory, and response time, because over-provisioning leads to resource wastage and is expensive due to pay-per-use pricing, while under-provisioning degrades the system performance. Many works [
6,
7,
8] related to hypervisor-based autoscaling solutions have been proposed, whereas the works related to container-based solutions are still on an early stage [
9].
Many existing reactive autoscaling solutions use rules with thresholds that are based on infrastructure-level metrics, such as CPU utilization and memory [
10,
11,
12]. Although these solutions are easy to implement, choosing the right values for thresholds is difficult because the workload continuously fluctuates depending on the user behavior. For proactive solutions, many works using time series data analysis have been proposed [
6,
7,
13,
14,
15], whereas only a few have used machine learning algorithms for designing autoscalers [
8,
16,
17].
Artificial intelligence technology, especially deep learning, has been implemented in many fields including computer vision, speech recognition, biomedical signal analysis, etc. [
18]. It also brings a promising future to architect general autoscalers [
19]. This paper aims to design a proactive autoscaling framework for Kubernetes that can provide fast and accurate scaling decisions to provision resources ahead of time. The proactive autoscaling framework uses the time series analysis method to predict the resource in the future. There are various kinds of AI approaches for time series analysis such as recurrent neural network (RNN), artificial neural network (ANN), and long short-term memory (LSTM). In contrast to other AI approaches, the bidirectional long short-term memory (Bi-LSTM) is an extension of LSTM, which is more suitable for the time series data because it can preserve the future and the past information. Thus, Bi-LSTM may have better accuracy in prediction as compared to RNN and LSTM. More technical details for how Bi-LSTM works are discussed in
Section 4. The major contributions of this paper are as follows:
First, a bidirectional long short-term memory (Bi-LSTM)-based [
20] proactive autoscaling system architecture designed for Kubernetes is proposed, and it is shown that the proposed autoscaling system achieves a higher accuracy in resource estimation than the default horizontal pod autoscaler of Kubernetes.
Second, Bi-LSTM prediction model for time-series analysis is implemented, showing that it can outperform existing proactive autoscaling time-series prediction models.
Third, a resource removal strategy (RRS) is proposed to remove part of the resource when the workload is decreased so it can better handle a burst of workload occurring in the near future.
The results of the experiments conducted using different real trace workload datasets indicate that Bi-LSTM achieves better accuracy with not only the LSTM model but also a well-known state-of-the-art statistical method, ARIMA model, in terms of short- and long-term forecasting. In addition, the prediction speed is 530 to 600 times faster than that of the ARIMA model and almost equal compared to the LSTM model when examined with different workloads. Furthermore, compared to the LSTM model, the Bi-LSTM model performs better in terms of autoscaler metrics such as provision accuracy and elastic speedup.
The remainder of this paper is organized as follows. In
Section 2, we review the recent autoscaling studies conducted using time-series analysis. In
Section 3 and
Section 4, the system design and concept of the proactive custom autoscaler are introduced. The prediction model is discussed in
Section 5.
Section 6 presents the experiment conducted for evaluating the proposed prediction model followed by the discussion in
Section 7. We conclude the paper in
Section 8.
2. Related Works
In this section, we review the recent studies related to autoscaling applications conducted using a time-series analysis technique in the cloud system. Time-series analysis has applications in many area such as weather forecasting, earthquake prediction, and mathematical finance. It uses previous and current observed values to predict future values [
21]. In this study, it is used to predict future workload or required resources [
22]. An example is the number of requests arriving at the system within one-minute intervals.
Table 1 presents a review of important autoscaling methods.
Al-Dhuraibi et al. [
11] presented an architecture called ELASTICDOCKER, which uses a reactive approach based on threshold-based rules to perform scaling actions. ELASTICDOCKER vertically scales both memory and virtual CPU cores resulting from the workload. The disadvantage of vertical scaling is that it has limited resources for hosting the machine capacity. To address this problem, ELASTICDOCKER performs live migration of the container when the hosting machine does not have enough resources. The experimental result showed that this approach helps reduce the cost and improve the quality of experience for end-users.
In Kubernetes (K8S), the Horizontal Pod Autoscaler (HPA) [
12] is a control loop that reactively scales the number of pods based on CPU utilization, regardless of whether the workload or application is performing. It maintains an average of CPU utilization across all pods at a desired value by increasing or decreasing the number of pods, for example,
.
The traditional approaches for time-series analysis and prediction include statistical methods such as autoregression (AR), moving average (MA), autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA).
Zhang et al. [
6] proposed an architecture using a reactive workload component for hybrid clouds, which classifies the incoming workload into the base workload and trespassing workload. The first workload is handled using ARIMA, while the second is handled using a public cloud.
However, the limitation of reactive approach is that it only can react to the change in workload after it actually happens. Thus, the system has to spend a certain amount of time reconfiguring to adapt to the new workload. In contrast, the proactive approach analyzes the historical data and forecasts the future resource required for the scaling action. We review some recent studies using time-series analysis for the proactive approach below.
Li and Xia [
13] proposed an autoscaling platform, called the cloud resource prediction and provisioning scheme (RPPS), which can scale web applications in hybrid clouds. RPPS uses ARMA to analyze the past resource usage (CPU utilization) and forecast the future usage to provide the required resources. For the evaluation, they compared RPPS with the reactive approach of Kubernetes HPA [
12] and showed that their approach achieved better results with varying workload.
Rodrigo et al. [
7] proposed a cloud workload prediction module using ARIMA, which can dynamically provision VMs for serving the predicted requests. For the performance evaluation, they used real traces of HTTP web server requests from Wikimedia Foundation [
26]. This model provides 91% accuracy for seasonal data; however, it is not suited for non-seasonal workload. In addition, the authors did not compare this model with any approach.
Ciptaningtyas et al. [
14] also used ARIMA to predict the amount of future workload requested, as it can achieve higher accuracy for short-term forecasting. They performed evaluations using the following four ARIMA models, which had the same degree of differencing (d) as 1 and order of moving average (q) as 0 while varying the lag order value (p) from 1 to 4. The results showed that the model with lag order of 4 had the lowest prediction error for the incoming requests compared to the other models.
Messias et al. [
15] proposed a prediction model using genetic algorithms (GAs) to combine many statistical methods. They used three logs extracted from real web servers to evaluate their model. The results showed that the proposed model yields the best results. However, with the NASA web server logs [
27], the ARIMA model achieves the best result with the best prediction accuracy. The ARIMA model’s configuration was chosen by the auto.ari-ma() function of the R package proposed in [
28] to automatically choose the value for
.
The prediction model proposed in [
23] used Artificial Neural Network (ANN) to forecast the task duration and resource utilization. To prepare the ANN dataset, a crawler was actualized to gather the number of files and the size of the repositories from GitHub and the task length. Then the ANN model was trained offline. The proposed model incurred less than 20% prediction error compared to a simple linear prediction model. However, as it was trained offline, this model is not suitable for real applications.
Prachimutita et al. in [
8] introduced a new autoscaling framework using Artificial Neural Network (ANN) and Recurrent Neural Network (RNN) models to predict workload in the future. Then, based on the forecasted workload, they converted to the needed RAM and CPU core to keep services working under the Service Level Agreement (SLA). The performance was evaluated by using access logs from the 1998 World Cup Web site [
26] for multiple-step ahead forecasting. The result showed the accuracy of ARIMA model became worse with more predicted step ahead. Furthermore, the LSTM model archived better accuracy than the MLP model.
Imdoukh et al. in [
16] also proposed a machine learning-based autoscaling architecture. The resource estimator used the LSTM model and conducted an experiment on the 1998 World Cup website dataset. They compared the obtained results with those obtained by not only the ARIMA model but also the ANN model. The result showed that the proposed LSTM model has a slightly higher prediction error compared to ARIMA model in one-step forecasting, but the prediction speed is faster 530 to 600 times than the ARIMA in one-step forecasting.
Tang et al. [
17] proposed a container load prediction model by using the Bi-LSTM approach, which uses the container’s past CPU utilization load to predict the future load. The accuracy of the proposed model exhibited the lowest prediction error compared to the ARIMA and LSTM models. However, the authors did not mention how to configure the parameters of the proposed model. Moreover, the paper only focuses on predicting the future load and does not apply it to solve to autoscaling problems.
Ming Yan et al. [
24] proposed a hybrid elastic scaling method by combining both reactive and proactive approaches for Kubernetes. The proactive approach uses Bi-LSTM model to learn the physical host and pod resource usage history (CPU utilization, Memory usage) to predict the future workload. Then, the Bi-LSTM prediction model is combined together with the online reinforcement learning with reactive model to achieve elastic scaling decisions. Through the experiments, it is shown that it can help the system to meet microsevice SLA in edge computing environments. The Bi-LSTM model also had the smallest prediction error for the root mean square error (RMSE) metric as compared to the ARIMA, LSTM, and RNN. However, an oscillation mitigation solution has not been addressed.
Laszlo Toka et al. [
25] proposed using an AI-based forecast method for proactive scaling policy. The AI-based forecast method consists of AR, Hierarchical temporal memory (HTM), and LSTM methods. Each model is used to learn and predict the rate of incoming web requests. They also proposed a backtesting plugin called HPA+ to automatically switch between methods in the AI-based model and the HPA. If the performance of AI-based model becomes poor, the HPA+ will switch to HPA and vice versa. The results show that the HPA+ can decrease the number of rejected requests significantly at the the cost of slightly more resource usage.
Most reactive approaches use rule-based model solutions. In these solutions, users have to predefine rules and set the thresholds. Although these rules are easy to implement, choosing the proper value for the thresholds is difficult because the workload continuously fluctuates depending on the user behavior. Moreover, reactive approaches can only react to workload changes after they happen and spends a certain time to reconfigure the system to adapt to the new workload. For proactive approaches, statistical methods (AR, MA, ARMA, ARIMA, etc.) are utilized to predict future workload and prepare scaling actions ahead of time. However, these techniques are relatively slow in matching dynamic workload demands. Due to the limitations of the reactive approach, we focus our work on the proactive approach only. The development of artificial intelligence, especially in deep learning, has achieved tremendous success in various fields from natural language processing, computer vision [
19], and disease classification [
29]. However, the use of deep learning in container autoscaling is still in its early stage. Thus, motivated by the huge success of deep learning, we propose a proactive custom autoscaler that uses the Bi-LSTM model to scale the number of pods for dynamic workload changes in Kubernetes automatically.
4. Proactive Custom Autoscaler
As mentioned earlier, the system architecture is based on MAPE loop, and our main contributions are Bi-LSTM forecasting service (as the analysis phase) and the adaptation manager service (as the planning phase), which are regarded as core of autoscaler.
Figure 2 shows the architecture of the autoscaler with four main phases. Next, we describe each phase in this autoscaler in detail.
4.1. Monitor Phase
In this phase, the monitoring server, described in
Section 3, continuously receives different types of data for analysis and planning phases to determine appropriate scaling actions. In this paper, the different types of data are collected from two sources: (1) the networking data of the load balancer, such as number of HTTP requests per second through the application metric collector, and (2) pod information of all running nodes such as the number of pod replicas, which are retrieved from the master node (using kube-apiserver).
After collecting the data, the monitoring server begins to aggregate and sends it to Prometheus time-series database. By adding a time series database, we can maintain all collected data as historical record and use it to train the prediction model to improve model’s prediction accuracy.
4.2. Analysis Phase—Bi-LSTM Forecasting Service
In the analysis phase, the Bi-LSTM forecasting service period obtains the latest collected metric data with window size w through Prometheus’s restful API. The trained model then uses these latest data to predict the next data in sequence . In our implementation, the Bi-LSTM neural network model was used to predict workload of HTTP requests in the future. Thus, we will discuss the Bi-LSTM neural network model in more detail.
First, we discuss the standard (known as Forward or Uni-) Long Short-Term Memory (LSTM). LSTM is a member of the deep RNN family, which is used in many fields, including sentiment analysis, speech recognition, and time-series analysis. It is a modified version of the RNN architecture, and was first introduced by Hochreiter and Schmihuber [
34] in 1996 and popularized by many studies thereafter. Unlike the feed-forward neural network, RNNs take the current and previous time-step output data as input to build the network [
35]. There were networks with loops in them, which allow the history data to be maintained. However, traditional RNNs incur long-term dependencies due to the gradient vanishing problem. In RNNs, the gradient expands exponentially over time [
36]. To address this problem, LSTM, which is a special type of RNN, has been proposed [
37]. However, LSTM can only handle information in one direction, which is a limitation because it ignores the continuous data changes. Therefore, LSTM can only capture partial information.
The regular RNN or LSTM traverses input sequence with forward direction of the original time series. The Bi-LSTM [
20] are an extension of the LSTM models with two LSTMs applied to the input data. The first LSTM is trained on the input sequence with its original direction (forward). The second LSTM is fed with the reverse form of the input sequence (backward). This architecture of Bi-LSTM is two LSTM networks stacked on top of each other. It bidirectionally runs the input from two directions, the first one from past to future, the second from future to past. The difference from forward LSTM is that running backward helps preserve information from the future. By combining two hidden states, one can preserve information from the past and the future. Applying the LSTM twice leads to improved learning long-term dependencies and thus consequently will improve the accuracy of the model [
38].
The Bi-LSTM processes data in both directions with two separate hidden LSTM layers to get more data. The Bi-LSTM equations are expressed as (
1)–(
3) below:
where
and
,
and
, and
and
are different weight matrices.
f denotes the activation functions, for example, Tanh sigmoid or logistic sigmoid; and
,
, and
are the forward, backward, and output layer biases, respectively.
Given an input sequence , a Bi-LSTM computes the forward hidden sequence, , the backward hidden sequence, , and the output sequence, , by iterating the forward layer from to T and the backward layer from to 1 and then updating the output layer.
For the t-th time step, both parallel hidden layers in both forward and backward directions process
as input. The hidden states
and
are the output results of two hidden layers in both forward and backward directions at time step
t, respectively. Then the output value
depends on the sum of the hidden states
and
, as described in Equation (
3).
4.3. Planning Phase—Adaption Manager Service
In this phase, The Adaption Manager Service calculates the number of pods required for provisioning or de-provisioning based on the predicted workload of HTTP requests from the previous step. It is designed to scale in and scale out pods to meet future workload.
Algorithm 1 explains our scaling strategy after the predictor provides the amount of HTTP requests in the next interval,
. Because of oscillation problem, the scaler performs opposite actions frequently within a short period [
9], which wastes resources and costs. To address this problem, we set the cooling down time (CDT) to 60 s (1 min) after each scaling decision, which is fine-grained compared to the container’s scenery. We denote
as the maximum workload that a pod can process in a minute and
as the number of minimum pods should be maintained and the system cannot scale down the number of pods that lower than this value. During the system’s operation, for each minute, we calculate the number of pods required,
, in the next time step and compare it with that of the current pods,
. If the required pods exceed the current pods
, the scaling out command is triggered and the scaler increases the number of pods to meet the resource demand in the near future. Otherwise, if the number of required pods is lower than the number of current pods, the scaler removes an amount of surplus pods,
, which follows the resource removal strategy (RRS), to stabilize the system while handling it faster if a burst of workload occurs in the the next interval. First, we use the max function to the choose higher value between the
and
value for updating the
value. This must be done because the system cannot scale down to a lower number than
value. Second, the number of surplus pods, which follows the RRS, is calculated as per Equation (
4):
Algorithm 1: Adaption Manager Service algorithm. |
![Applsci 11 03835 i001]() |
Then, we update the value of by using the current pods’ value minus the surplus pods’ value achieved in previous step. By doing this, we only need to remove a part of surplus resources (pods), not all of them. Finally, we execute the scale in command with the final updated value. Using the RRS strategy not only helpss the system to stabilize with the low workload but also to adapt faster with the burst of workload because it spends less time to create and update the number of pods. In our experiments, we set RRS to 0.60 (60%) and to 10.
4.4. Execution Phase
In this last phase of MAPE loop, the Kubernetes engine (kube-apiserver) receives command from Adaption manager service of planning phase and changes number of pod replicas.
7. Discussion
Through the experiments, it was observed that the Bi-LSTM is more suitable for estimating the resources in the Kubernetes cluster than statistical ARIMA and the feed-forward LSTM. More specifically, the Bi-LSTM prediction model was compared with the ARIMA prediction model in
Table 4. It is shown that the Bi-LSTM has a higher prediction accuracy as compared to ARIMA. It reduces prediction error by 3.28% in one-step prediction and 6.64% in five-step prediction on RMSE metric. The prediction speed is 530 and 55 times faster than ARIMA in one-step and five-step predictions.
In addition, the Bi-LSTM is compared with LSTM in
Table 5 and
Table 6. It is shown that the Bi-LSTM achieves higher prediction accuracy than LSTM. It reduces prediction error by 7.79% in one-step prediction and 20% in five-step prediction on RMSE metric. It also achieves comparable prediction speed to LSTM in one-step and five-step predictions. It is observed from
Table 6 that Bi-LSTM has more errors in prediction when decreasing resources, and has less error in prediction when increasing the resources as compared to LSTM. Obviously, Bi-LSTM outperforms LSTM in the overall prediction accuracy as shown in
Table 5. It can also be observed from the elastic speedup results in
Table 6 that the Bi-LSTM autoscaler has a higher autoscaling gain than the LSTM autoscaler in both one-step and five-step predictions.
The proposed architecture in this work is based on the existing components of the Kubernetes. However, this architecture can also be implemented in other container-based platforms, such as using Docker Swarm for orchestration and other container engines like OpenVZ, LXC. This is because the Bi-LSTM forecasting service and the Adaption manager service can analyze and process the monitored data (networking data, and pods or containers information) to calculate appropriate scaling decisions and are independent not only from the container virtualization technology but also from the cloud providers. Furthermore, we can change and update the Bi-LSTM forecasting service to learn and predict more hardware metrics such as CPU utilization and memory usage for more accurate scaling decisions.