PSO-Based Ensemble Meta-Learning Approach for Cloud Virtual Machine Resource Usage Prediction

: To meet the increasing demand for its services, a cloud system should make optimum use of its available resources. Additionally, the high and low oscillations in cloud workload are another signiﬁcant symmetrical issue that necessitates consideration. A suggested particle swarm optimization (PSO)-based ensemble meta-learning workload forecasting approach uses base models and the PSO-optimized weights of their network inputs. The proposed model employs a blended ensemble learning strategy to merge three recurrent neural networks (RNNs), followed by a dense neural network layer. The CPU utilization of GWA-T-12 and PlanetLab traces is used to assess the method’s efﬁcacy. In terms of RMSE, the approach is compared to the LSTM, GRU, and BiLSTM sub-models.


Introduction
Cloud service providers in cloud computing environments own vast quantities of varied resources that cloud customers can lease on a pay-as-you-go basis. Typically, cloud service providers aim to enhance resource usage to increase profits, whereas cloud customers aim to increase cost performance to rent adequate resources for their apps. In spite of this, cloud users' applications may experience varying workloads and also have varying quality of service (QoS) needs on the resources offered by cloud providers, necessitating either greater or reduced resource demands over time [1,2]. On the other hand, there is always going to be a latency before the resource is actually usable, which could pose a problem for programs that need to scale their resource usage on the fly. Therefore, it is indeed crucial for cloud providers and users alike to face the challenge of supplying adequate resources with efficient usage QoS assurance [3]. Overprovisioning wastes resources, which increases energy usage and the operational expense of cloud providers, while underprovisioning can cause a reduction in quality of service and the loss of cloud customers.
This problem can be mitigated with accurate foresight of the future workload behavior of resources through careful monitoring. This problem can be addressed by effectual monitoring and keeping a log of how much time and energy are spent on different resources including processor, memory, storage space, and network bandwidth. Afterwards, these indications of past usage can be analyzed and put to use in predictive analysis. This foresight is crucial for effectively supplying cloud resources. The cloud resource management, for instance, may make well-informed decisions about matters such as virtual machine (VM) scaling and VM migration if it has a good idea of how much of a demand is expected in the future and how much of that demand can be forecast using past usage data. Therefore, no QoS drops, maximum resource utilization, and low energy waste during operation of • First, we propose an ensemble learning-based workload prediction method based on recurrent neural network (RNN) variants and particle swarm optimization (PSO). • Second, we provide a comparison of the proposed model with PSO-based LSTM, Bi-LSTM, and GRU prediction models for time series prediction algorithms.

•
We conducted a series of experiments to evaluate the performance of the proposed PSObased ensemble prediction model on the dataset of the GWA-T-12 and PlanetLab traces.
The remainder of the paper is organized as follows. The forecasting problem is described in Section 2. Then, in Section 3, the recent major advancements in workload prediction methods using machine learning are briefly reviewed. The PSO-based ensemble prediction model is deeply covered in Section 4. The assessment criteria that were employed to gauge and contrast the effectiveness of the proposed work are described in Section 5, and then, the finding of the experiments is presented in Section 6. The final section, Section 7, discusses the paper's conclusion.

Problem Definition
In Figure 1, we see a typical cloud computing setup that includes some sort of workload estimation module. Both the forecasting model's estimation data and the resource manager's history data are fed into the system as inputs. Using this information, the resource manager module can optimize the service providers' resource use and energy consumption by implementing appropriate scaling policies. The system's forecasting module is crucial because it can foresee changes in the demand for resources.

•
Second, we provide a comparison of the proposed model with PSO-based LSTM, Bi-LSTM, and GRU prediction models for time series prediction algorithms.

•
We conducted a series of experiments to evaluate the performance of the proposed PSO-based ensemble prediction model on the dataset of the GWA-T-12 and PlanetLab traces.
The remainder of the paper is organized as follows. The forecasting problem is described in Section 2. Then, in Section 3, the recent major advancements in workload prediction methods using machine learning are briefly reviewed. The PSO-based ensemble prediction model is deeply covered in Section 4. The assessment criteria that were employed to gauge and contrast the effectiveness of the proposed work are described in Section 5, and then, the finding of the experiments is presented in Section 6. The final section, Section 7, discusses the paper's conclusion.

Problem Definition
In Figure 1, we see a typical cloud computing setup that includes some sort of workload estimation module. Both the forecasting model's estimation data and the resource manager's history data are fed into the system as inputs. Using this information, the resource manager module can optimize the service providers' resource use and energy consumption by implementing appropriate scaling policies. The system's forecasting module is crucial because it can foresee changes in the demand for resources. Assume a cloud data center has both physical computers (P) and virtual machines (V), each of which has access to a set amount of hardware resources (CPU, memory, and network speed). Running several different |A| programs in the virtual machine requires a wide range of system resources. Only if a virtual machine has sufficient memory, CPU speed, and storage space can applications be installed on it. It is possible for the cloud system to supply virtual machine resources on request with automatic scaling on the basis of trustworthy workload forecasts, thereby protecting quality of service and decreasing energy usage. A prediction model would be created to discern patterns in workload variance and intrinsic attributes on the basis of workload traces, allowing for future workload predictions for a virtual machine. List of variables and notations are described in Table 1. Assume a cloud data center has both physical computers (P) and virtual machines (V), each of which has access to a set amount of hardware resources (CPU, memory, and network speed). Running several different |A| programs in the virtual machine requires a wide range of system resources. Only if a virtual machine has sufficient memory, CPU speed, and storage space can applications be installed on it. It is possible for the cloud system to supply virtual machine resources on request with automatic scaling on the basis of trustworthy workload forecasts, thereby protecting quality of service and decreasing energy usage. A prediction model would be created to discern patterns in workload variance and intrinsic attributes on the basis of workload traces, allowing for future workload predictions for a virtual machine. List of variables and notations are described in Table 1. Most current cloud-based software is made up of a set of interconnected/cooperating service parts that each fulfill a particular function [12][13][14]. These service parts are dispersed across many different types of virtual machines that work together to provide these services. As a result, we provide a workload prediction approach based on multiple virtual machine workload trace datasets rather than just one VM workload. A time series is a group of observations of clearly defined data points gathered over time from repeated measurement. It is assumed that the problem of workload forecasting is a time series regression problem and time series forecasting assumes a time series consists of a pattern and some random error [18]. Workload trace data with N variables of length T are represented as X = (x 1 , x 2 , x 3 , . . . , x T ) T ∈ R T×N where for each t ∈ {1, 2, 3, . . . , T}, x t ∈ R N represents observations of all variables at time t. The measurement x n t denotes the observation of the n th variable of x t . The variable N considered in the workload trace is CPU usage that is tracked and saved over time from multiple VMs of a cloud data center. It is measurements of CPU usages of VMs for hosting applications. In other words, x 1,t , x 2,t , x 3,t , . . . , x k,t may be used to denote the related variables (multiple VM CPU usage) and the following form may be used to represent the prediction of x 1,t+h at the end of period t.
As shown in Equation (1), the extraction of sequence information from the workload traces of the virtual machines was required to predict the time information at x 1,t+h , where h is the desired horizon after the current time stamp that we manually set. The proposed forecasting model's objective is to reduce the difference (error) between the predicted CPU usage valuex 1,t+h and the actual CPU usage x 1,t+h for each time.

Recent Key Contribution
Recently, cloud service providers have tackled the difficulties of providing quality of service (QoS) and competing for available resources. In recent years, many studies on time series prediction for cloud resource on-demand prediction have been published. Statistical and artificial neural network methods have been adopted to solve this issue, and their efficacy has been the subject of numerous studies [3]. Multiple studies have made an effort to predict cloud-based CPU utilization. Workload in the cloud evolves over time and is connected across periods of time. The CPU load can then be anticipated by analyzing past CPU utilization patterns. Only forecasting models developed utilizing machine learning algorithms, including deep learning approaches, are provided in this section because their baselines and the proposed model share the same domain. However, a comprehensive study of resource management strategies is provided in [19][20][21][22].

Deep Learning-Based Approaches
With the help of a neural network and differential evolution with built-in adaptation, the prediction accuracy was improved [23]. The evolutionary method improves precision by probing the space of possible answers from different angles and employing a wide range of possible answers. This decreases the likelihood of becoming trapped in local optima in comparison to gradient-based machine learning. In [24], a better prediction model based on a neural network was introduced to facilitate precise scaling operations. The suggested strategy classified the VMs based on how they were used before predicting how they would be used in the future. The system used a multilayer perceptron classification technique to accomplish this. Containerized cloud applications can take advantage of the Performanceaware Automatic scaler for Cloud Elasticity (PACE) architecture, which was released in [25] to automatically scale resources in accordance to fluctuating workload. Through the use of convolutional neural networks (CNNs) with K-means, a flexible scaling strategy that anticipates upcoming workload requirements has been developed. The authors of [26] looked at a VM prediction model that forecasts the usage of VMs' resources such as CPU and memory to identify if a program is CPU and/or memory heavy. The research takes a Bayesian approach to predicting VM resource consumption during the workweek. It was recommended in [27] that a machine learning-based model be used to predict load and energy consumption, as well as to manage data center resources. Many machine learning techniques, such as the gated recurrent network (GRU), ElasticNet (EN), ARD regression (ARDR), linear regression, and ridge regression, were explored. The results of the experiments show that GRU is superior to other networks. Using a regressive predictive multilayer perceptron (MLP) model, [28] attempted to foresee VM power consumption in a data center of a cloud. The model achieves 91% accuracy in making forecasts. The paper [29] proposes a preemptive deep learning-based strategy for automatic scaling docker containers in accordance with fluctuating workload variations during runtime. A forecasting/estimation model which uses a recurrent long short-term memory (LSTM) neural network was proposed to foresee upcoming workload of HTTP. With respect to elastic acceleration and autoscaling associated with provisioning, the proposed model outperforms an artificial neural network. CPU and memory utilization, as well as energy consumption, were forecasted for the next time slot using a multiobjective evolutionary algorithm in [30]. In [11], a method for predicting VM workload is given that makes use of a generative deep belief network (DBN) made up of multiple layers of generative restricted Boltzmann machines (RBMs) with a regression layer model. The authors of [31] presented a solution for anticipating cloud data center workload. The paper's prediction relied on an LSTM network. Predicting cloud data center workload was proposed in [13]. The paper's prediction was made based on an LSTM network. Predicting a VM's upcoming workload using correlation data from its workload trace history was carried out in [32]. The study [1] suggests a BiLSTM deep learning model to accurately predict CPU usage prediction to improve resource provisioning and reduce energy consumption of cloud data centers. In [33], a multivariate BiLSTM model-based autoscaling frame has been proposed to optimize resource provision of cloud data centers and, as per the experiment, the model outperforms other univariate models. GRU-CNN workload prediction has been proposed in [34] to predict incoming workload requests in a cloud data center. The findings suggest that deep learning approaches are superior to traditional methods for making long-term workload predictions.

Hybrid Approaches
Fitting a certain workload pattern is usually easy for a prediction model that uses a single forecasting model, but it struggles with real-world data when the pattern varies rapidly over time [35]. These conditions persist despite the over-and underprovisioning of resources. Two online learning ensemble learning algorithms [36] were created to anticipate workloads. Changes in a cloud workload demand pattern are reflected rapidly by the models. In order to deal with workloads that are constantly shifting, a novel cloud workload prediction framework [37] was created. It uses multiple predictors to build an ensemble model that can accurately anticipate real-world workloads. CloudInsight, a framework for predicting workloads, was created using a number of forecasters [38]. It integrates eight distinct forecasting algorithms/models from the fields of time series analysis, regression analysis, and machine learning in an effort to improve forecast accuracy. A cloud workload forecast which relies on a weighted wavelet support vector machine is proposed for predicting the server load sequence in a cloud data center [39]. Parameter optimization utilizing a particle swarm optimization (PSO)-based approach was developed in this study.
Several methods have been put forth and used to forecast the workloads of VMs in cloud computing centers as shown in Table 2. It was found that the aforementioned works failed to model and predict the workload demand for multiple VMs. The fact that VMs are interdependent and host related applications that use the resources of multiple VMs is denied by the fact that they were designed and trained for a single VM workload. In addition, the majority of the literature modeled the VM prediction problem using a single method. So that higher forecasting accuracy could be achieved, an integration of different methods and approaches was used to model and anticipate the workloads. To deal with the aforementioned workload challenges, an ensemble strategy is proposed in this work for cloud data center VM workload anticipation.

Cloud Virtual Machine Resource Usage Prediction Based on PSO Ensemble Model
The proposed PSO ensemble model consists of three primary steps: (1) modeling the base prediction model with common recurrent neural network models, i.e., GRU, LSTM, and BiLSTM, (2) optimizing the weights of the base models using a meta-heuristic optimization algorithm, i.e., particle swarm optimization (PSO), and (3) integrating the sub-models/base models. The ultimate forecasting is generated by the meta/learning model, which is fed predictions from the individual networks.

Particle Swarm Optimization
Kennedy, an American social psychologist, created the PSO algorithm, which is a meta-heuristic. On the basis of Hepper's model for mimicking bird or fish swarms [42,43], Kennedy and Eberhart improved an algorithm of particles flying in the solution space and reaching the optimum solution. With no gradient information and with fewer parameters, PSO is straightforward and simple to carry out. It can be said that PSO is appropriate for both engineering applications and scientific research. We encountered some non-linear issues in our study, and the PSO method-which is now widely used-was developed to find the ideal solution.
Each solution in particle swarm optimization is treated as a particle, and each iterative process determines the present position by applying its own fitness function. Additionally, a particle's velocity determines every step's distance and direction of motion.
Velocity of a particle with its position P i , i = {1, 2, . . . , N}, where N is the overall quantity of particles, updates each iterative process in accordance with the rules: In this formula, k denotes the number of current iterations; V k id and X k id stand for velocity and position, respectively; c 1 and c 2 stand for a learning coefficient of the local best position p k id and global best position p k gd of the particle p i ; r 1 , r 2 stand for two arbitrary numbers in a range of 0 to 1 to raise the chance of search; and w stands for inertia weight to regulate the searchability in a solution space. Compared to other methods, PSO is a good option because of its quick convergence, low computational cost, and global search optimality. These PSO characteristics are crucial for reducing the complexity of the proposed ensemble model. It has been successfully applied in many applications; in [44], PSO was used to model offshore wind power forecast. The paper [43] used PSO in a deep learning predictive model to predict GES mapping. PSO has been used in an LSTM which is used for modeling ship motion prediction in [45]. PSO has been used as a parameter optimization algorithm to find the optimal combination of the parameters of a deep learning model in [39].
Base model weights are calculated using PSO in this research paper. As a measure of fitness, we calculate the sum of squared errors. Every particle's present fitness value is compared to its best fitness value of the past, and if the current value is lower, the particle is updated with the new value and position. A comparison is made between the current fitness value and the global best position. Then, if the current value is lower, the global best position is updated to include the current value and also the position of the current particle. The following three points are supported by experimental evidence. Since the proposed ensemble model makes use of PSO, it is superior to alternatives that rely on slower learning algorithms [46]. Therefore, the proposed model minimizes the mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error (RMSE).

Ensemble Learning Techniques
An ensemble approach includes various forecasting models, each of which is stated as a base forecaster or expert forecaster, to forecast the expected occurrence of an event. The estimations of each expert are combined using a new model to calculate the ensemble's final result. Figure 2 shows the theoretical layout of an ensemble-based anticipating approach. The objective of the ensemble learning approach is to boost prediction accuracy or overall performance by combining decisions from various base predictors into a new model. MaxVoting, averaging, extreme gradient boosting (XGB), weighted averaging, gradient boosting machine (GBM), stacking, blending, adaptive boosting (AdaBoost), bagging, boosting [47], and other ensemble learning methods are just a few examples. Various ensemble models have distinct traits and can be applied to address distinct challenges across a range of domains. A straightforward illustration of the ensemble learning approach is that a diverse set of individuals is more likely to reach wiser decisions than an individual is. The same rule holds true for machine learning (ML) models and deep learning (DL) models; a variety of ML or DL models seem to be more likely to achieve improved results than a standalone model [49] because every model provides a distinctive advantage and may be used to complement one another to make up for their weaknesses. A straightforward illustration of the ensemble learning approach is that a diverse set of individuals is more likely to reach wiser decisions than an individual is. The same rule holds true for machine learning (ML) models and deep learning (DL) models; a variety of ML or DL models seem to be more likely to achieve improved results than a standalone model [49] because every model provides a distinctive advantage and may be used to complement one another to make up for their weaknesses.

Long Short-Term Memory Neural Networks
Human beings may not constantly have to rethink their ideas. Every word is understood in the context of the ones that came before it. Therefore, memory is crucial for recognition, and classic neural networks lack this memory capability. LSTM, a unique kind of RNN that has the capability of recognizing past history, was proposed by [50]. As LSTM has some memory cells, it can handle a variety of time series problems that other machine learning models found challenging. Since it can store significant historical data in the memory cell state and ignore irrelevant data, LSTM is especially successful and well-suited when used in the time series domain. Its structure generally consists of three gates, i.e., input gate, forget gates, and output gates. The LSTM unit handles this intricate job by modifying input data which are kept in the cell state. The forget gate layer is the first gate, which determines the data to be deleted versus kept from the memory cell state. The forget gate layer formula is as follows: where f t stands for the forget gate at time t, W f for the forget neuron's weights, h t−1 for the result from the prior cell state at time t − 1, x t for the input value at time t, b f for the forget gate's bias, and σ for the neuron's sigmoid function.
The new information that the neuron needs to store is calculated by the input gate, which involves two steps for making a decision. The values that will be updated are first decided upon by the sigmoid function i t . The tanh layer C t produces a new result value that is then used update the current cell state. The definitions of the formulas are: where i t stands for the gate input at time t, W i stands for the input weight, and C t stands for candidate of the cell state at time t; b i stands for the input gate bias, and b c stands for the bias of the cell state.
The information that will be output is decided upon by the output gate layer, which is the last gate. The equation is stated as: where W o stands for the output's weight of neurons, b o for output gate biases, and o t for the output gate at time t [50]. By using the product of outcomes of the forget gate f t and the cell state at the previous time step c t−1 plus the product of input gate output i t and the candidate value C t , the cell state is updated. The cell state update can be made to the state of the cell using the following equation: The outcomes from the output unit gate and the cell state at the current time step can then be used to update the h t which is the hidden cell state (or final output) at the current time state. The equation is defined as follows: The LSTM model used in this study is fed information about the workload of four different VMs over the course of an hour. The input is set up with a time window of 48 pasts. Four VM workload traces are included in the input data. In Section 4.5, the specifics of the LSTM structures will be covered.

Bidirectional Long Short-Term Memory Neural Networks
A bidirectional LSTM (BiLSTM) model for sequence processing comprises two LSTMs, one of which receives input data in a forward manner while the other handles the reverse direction. Figure 3 shows how it is made up of bidirectional recurrent long short-term memory units that function as a forward and backward pair.
The outcomes from the output unit gate and the cell state at the current time step can then be used to update the ℎ which is the hidden cell state (or final output) at the current time state. The equation is defined as follows: The LSTM model used in this study is fed information about the workload of four different VMs over the course of an hour. The input is set up with a time window of 48 pasts. Four VM workload traces are included in the input data. In Section 4.5, the specifics of the LSTM structures will be covered.

Bidirectional Long Short-Term Memory Neural Networks
A bidirectional LSTM (BiLSTM) model for sequence processing comprises two LSTMs, one of which receives input data in a forward manner while the other handles the reverse direction. Figure 3 shows how it is made up of bidirectional recurrent long shortterm memory units that function as a forward and backward pair. The foundation of the bidirectional LSTM dependent network is the LSTM layer at its core. The fundamental principle of the bidirectional LSTM unit is to process time series data from the forward and reverse directions using two distinct hidden layers in order to model the effects of past and upcoming information upon the present hidden state, respectively. The forward LSTM unit's internal calculation procedure is described in Equations (5)-(9). The following formula illustrates how the reverse LSTM unit updates the hidden state information primarily using the future information ℎ : ℎ ⃖ = ⃖ ; ℎ ⃖ ; ⃖ The foundation of the bidirectional LSTM dependent network is the LSTM layer at its core. The fundamental principle of the bidirectional LSTM unit is to process time series data from the forward and reverse directions using two distinct hidden layers in order to model the effects of past and upcoming information upon the present hidden state, respectively. The forward LSTM unit's internal calculation procedure is described in Equations (5)-(9). The following formula illustrates how the reverse LSTM unit updates the hidden state information primarily using the future information h t+1 :

Gated Recurrent Unit Neural Networks
Gated recurrent units (GRUs) were first presented in [51] to tackle the vanishing gradient issue of the classical RNNs through the use of update gates and reset gates. For the same reason as LSTM, it is proposed to fix issues with long-term memory, the gradient in backpropagation, and the gating units that control the flow of data within the unit. A gated recurrent unit does not have its own memory cell, which is a major contrast to the LSTM layer. Before proceeding to the next step, the GRU's update gate determines the amount of information from the past that must be transmitted to the future. According to the update gate equation: where x t stands for the input at time t, W z stands for the weights for the update gate, h t−1 stands for the stored values at t − 1 units, U z stands for weights of the past hidden state h t−1 , and z t represents the update gate at a time step t. A hidden state h t−1 that was transmitted to the GRU from the past node and includes data about it has a current input of x t . The GRU receives the result/output y t of the currently hidden node and the hidden state h t passed to the next node by combining x t and h t−1 .
The reset gate is the second main part of the GRU. How much of the previous data needs to be forgotten is determined by the reset gate. Here is the definition of the reset gate equation: In order to determine how much of the old memory needs to be preserved, the computation of the nominee hidden layer, which can be thought of as the new data at the time, is used. For example, if it is 0, then only the data of the current value are included.
The internal process the GRU uses to update its memory is one of its strengths.
where h t−1 and h t are the state memory variable at the past state t − 1 and current state t, respectively, and z represents the update gate state. The GRU, as opposed to LSTM, which calls for multiple gating, uses a single z to accomplish forgetting and selection memory. As a result, the GRU may be more effective than LSTM during the training phase.

Architecture Overview
Ensemble learning combines different machine learning and deep learning models in order to carry out forecasting or classify data. Combining the forecasts from many neural networks [52] helps to decrease the variation of forecasts and generalization error. Figure 4 depicts the blended ensemble model block diagram, which is an ensemble model utilized in this study, as well as an overview of the architecture. The suggested prediction model is described here; it makes use of an ensemble method to boost forecast precision. The proposed PSO-based ensemble approach consists of three steps: Step 1: The historical time series data of multiple VMs on CPU usage are preprocessed to generate matrix data with equal lengths.
Step 2: The matrix is used as the input of the BiLSTM, GRU, and LSTM models and optimized using PSO, by which intermediate prediction results can be obtained and used to generate a new data matrix for training the meta-model.
Step 3: Based on the new data matrix, the meta-model network is designed to generate the final prediction results.
There are four essential components: data preprocessing, which scales and organizes the dataset; expert, which contains base predictor models; optimizer, which optimizes the weights associated with each base prediction; and forecasting, which evaluates the accuracy of the model. Figure 4 shows that data are passed along from one stage or component to the next. Only if the accuracy indicated by the labels on the arrows (yes/no) is high will the next phase be executed.  Figure 5 depicts the whole workflow of the ensemble model. The algorithm's inputs were the CPU utilization data of four distinct virtual machines. After applying the Savitzky-Golay filter [53] to eliminate noise and smooth the data trend, min-max normalization was applied to scale the dataset to workload values in the range [0, 1]. Each expert and base model receives the supplied data. After applying PSO to optimize the weights of each expert model, we estimated the future workload data. To determine the final result of the predictive ensemble model, each expert's prediction is also fed into the metalearner, which produces the final prediction. The proposed PSO-based ensemble approach consists of three steps: Step 1: The historical time series data of multiple VMs on CPU usage are preprocessed to generate matrix data with equal lengths.
Step 2: The matrix is used as the input of the BiLSTM, GRU, and LSTM models and optimized using PSO, by which intermediate prediction results can be obtained and used to generate a new data matrix for training the meta-model.
Step 3: Based on the new data matrix, the meta-model network is designed to generate the final prediction results.
There are four essential components: data preprocessing, which scales and organizes the dataset; expert, which contains base predictor models; optimizer, which optimizes the weights associated with each base prediction; and forecasting, which evaluates the accuracy of the model. Figure 4 shows that data are passed along from one stage or component to the next. Only if the accuracy indicated by the labels on the arrows (yes/no) is high will the next phase be executed. Figure 5 depicts the whole workflow of the ensemble model. The algorithm's inputs were the CPU utilization data of four distinct virtual machines. After applying the Savitzky-Golay filter [53] to eliminate noise and smooth the data trend, min-max normalization was applied to scale the dataset to workload values in the range [0, 1]. Each expert and base model receives the supplied data. After applying PSO to optimize the weights of each expert model, we estimated the future workload data. To determine the final result of the predictive ensemble model, each expert's prediction is also fed into the meta-learner, which produces the final prediction.
ate the final prediction results.
There are four essential components: data preprocessing, which scales and organizes the dataset; expert, which contains base predictor models; optimizer, which optimizes the weights associated with each base prediction; and forecasting, which evaluates the accuracy of the model. Figure 4 shows that data are passed along from one stage or component to the next. Only if the accuracy indicated by the labels on the arrows (yes/no) is high will the next phase be executed.  Figure 5 depicts the whole workflow of the ensemble model. The algorithm's inputs were the CPU utilization data of four distinct virtual machines. After applying the Savitzky-Golay filter [53] to eliminate noise and smooth the data trend, min-max normalization was applied to scale the dataset to workload values in the range [0, 1]. Each expert and base model receives the supplied data. After applying PSO to optimize the weights of each expert model, we estimated the future workload data. To determine the final result of the predictive ensemble model, each expert's prediction is also fed into the metalearner, which produces the final prediction.

Ensemble Expert Learning with PSO Optimization
Throughout the course of our investigations, we have investigated a vast array of setup possibilities. Our empirical data guided the selection of all of our parameters, including epoch size, neuron count, and layer count. The initial level of the blended ensemble model consists of three RNNs: the GRU, the BiLSTM, and the LSTM model. Training data, validation data, and test data were previously separated from the dataset. For the training of the blended ensemble model, each dataset is required. The GRU model, BiLSTM model, and LSTM model are sub-models of level 1 that are trained using the training data. For the level 1 models, each sub-model's weight is optimized using PSO. Following the initial phase of training, the learned level 1 models are applied to the validation data which effectively serve as the actual training data for the second level of models. In addition, the testing data are used to determine the ultimate forecast and accuracy of the proposed model.
First, for sub-model 1, the GRU model is trained using the training data. This GRU model consists of only four layers, each containing 50 neurons. The model is trained with 100 iterations with 0.2 dropouts per hidden layer. PSO is utilized for training model weights. After training the GRU model, the validation dataset is used to produce the initial prediction. The initial predictions made by the GRU utilizing validation are those based on GRU validation predictions.
The BiLSTM model, which is the second sub-model, is then trained. Similarly, the BiLSTM model that we construct has two layers, one of which is dense, with 50 neurons per layer. Additionally, we use 0.2 dropouts per hidden layer and the model is trained for a total of 100 iterations. Using the PSO for training the model weights, the BiLSTM model is trained in the same manner as the GRU model. After training, the validation dataset is provided to the BiLSTM in order to provide BiLSTM validation predictions.
Finally, the LSTM sub-model is trained using the train dataset as the third base/submodel. The model consists of only three layers, each containing 32 neurons. In that the PSO is used to train the weights of the model, the LSTM model undergoes similar training steps as the GRU and BiLSTM models. Once we have a trained LSTM model, we feed the validation sample data into the trained model to obtain LSTM validation predictions.
We create a new dataset with the form of p × m (p indicates the total quantity of forecasts and m represents the total quantity of models) by combining the GRU validation prediction, BiLSTM validation prediction, and LSTM validation prediction. Then, this newly generated dataset is utilized for fitting the second stage meta-learner model. A meta-learner is another term for the second level. It consists of a dense neural network with one hidden layer of 32 neurons and rectified linear unit (ReLu) function is used as an activation function. As is carried out for the validation dataset, again, the testing samples of the original dataset are used to test the sub-models. Then, again, the result generates another p × m testing dataset which is used to test the meta-model. Therefore, on the basis of the newly generated test predictions from the sub-models, the meta-learner will subsequently generate the final predictions of our proposed PSO-based ensemble model.

Evaluation Metrics
Various error measurements were utilized to evaluate the precision of predictions. We employed the three statistical metrics [54] mean absolute error (MAE), which is less biased for large mistakes and outliers but may not effectively capture huge errors, standard deviation (SD), and variance (V). In the trials, root mean square error (RMSE) and mean absolute percent error (MAPE) were also used to gauge the prediction efficacy of the predictive performance of our suggested ensemble model. RMSE is defined as follows: where n is the number of forecasts, x i is a vector of recorded values associated with the variable being forecasted,x i is the estimated/forecasted values.

Experiment and Dataset
We trained and evaluated our proposed ensemble model using CPU utilization information for four distinct virtual machines extracted from the GWA-T-12 dataset and PlanetLab traces. Our experiment's data were obtained from the bitbrains GWA-T-12 and PlanetLab traces. Bitbrains is the long-term, large-scale workload trail of a distributed cloud data center in the real world. The data center's virtual machines host business-critical applications. We employed a storage area network (SAN) containing information for 1250 virtual machines. The consumption of system resources such as memory, the central processor unit, network input/output, and disk input/output was monitored every five minutes. PlanetLab, an environment for exploring cloud-based software, was the second dataset used. Currently, PlanetLab has over a thousand nodes spread across seven hundred locations worldwide, all of which report their virtual machines' CPU utilization every five minutes in real time. We have randomly selected four VM samples from the PlanetLab and the GWA-T-12 dataset's fast storage trace that exhibit typical workload patterns. All of the CPU usage of the specified VMs was used as input data for the prediction model. There was a 60:20:20 split between the training set, the validation set, and the test set. The CPU utilization dataset was standardized to the range [0-1] using a min-max scaler before we used it to train our model. The CPU usage history window of the four virtual machines that exhibit a similar pattern served as the input vector for the ensemble learning model. The objective was to forecast the CPU utilization of the selected four VMs at one-hour intervals. Using RMSE, MAE, and MAPE, the suggested ensemble model's prediction accuracy was tested. We used the Python programing language and TensorFlow deep learning framework to evaluate our model. Table 3 illustrates the Pearson's correlation scores of the virtual machines that showed common trends for the GWA-T-12 dataset. Clearly, of the data presented in the table, the VM CPU usage had a significant positive association. This means that the VMs are interdependent that the CPU consumption of one VM affects the other VMs' CPU usage. Thus, the CPU consumption of VMs served as inputs for the multiple input ensemble model used to evaluate the accuracy of prediction.

Prediction Performance Results
The experimental result of the sub-models and the proposed PSO-based ensemble model is illustrated in Tables 4-7. The proposed model has a lower forecasting error than the sub-models LSTM, GRU, and BiLSTM. It reduces prediction error by 79.72-91.95% and 82.64-86.91% on the RMSE metric as is shown in Tables 4 and 5 for the GWA-T-12 dataset and PlanetLab trace, respectively. Throughout the experiments, we compared the predicted CPU utilization of the four VMs with their actual CPU utilization and calculated the MAE, MAPE, and RMSE values. The predicted values can be used to anticipate the CPU usage of the VMs for the next hour. As a result, VM resources can be effectively managed due to the efficient allocation of resources, such as CPU, to the VMs hosting the workloads on demand. The proposed blended PSO ensemble model is compared with the sub-models LSTM, BiLSTM, and GRU, which are widely used by cloud service provider centers for predicting future resource workloads in order to automatically provide resources on demand and revoke user resources for the system in advance in order to meet service level agreements. On the basis of the experimental findings, the proposed PSO ensemble model outperforms the competition. The MAE is 2.207, the MAPE is 0.545, and the RMSE is 3.146, as shown in Table 6. A similar comparison is made between the three sub-models and the performance of the suggested model on the mean CPU usage of the PlanetLab traces of data using the MAE, MAPE, and RMSE. The experimental data demonstrate that, in comparison to the individual models, the suggested PSO ensemble model reduces error by 86.91%. The outcomes show that the suggested method generates forecasts with a higher degree of accuracy. The findings of the experiment are detailed in Tables 6 and 7. As demonstrated in  Tables 4-7, the proposed blended PSO ensemble model is superior compared to all other models in every metric. The blended PSO-based ensemble model significantly improves MAE, MAPE, and RMSE values.   6-9 show that the predicted values of the RMSE learning curve of both the train and validation data are close to each other and our model converges faster and smoother than the other models for the GWA-T-12 dataset. The RMSE values of train and test data have small variations, which means the model has been generalized adequately, as shown in Figure 9.    6-9 show that the predicted values of the RMSE learning curve of both the train and validation data are close to each other and our model converges faster and smoother than the other models for the GWA-T-12 dataset. The RMSE values of train and test data have small variations, which means the model has been generalized adequately, as shown in Figure 9.

Conclusions
The widespread adoption of cloud computing has had a significant effect on the efficiency of organizations in a variety of sectors. Autoscaling, the dynamic adjustment of cloud infrastructure by adding or removing resources on the go, is crucial. Workload requirements, however, are difficult to forecast because they alter continuously over time.
To predict the demand on resources, various studies have suggested using time series forecasting. These research efforts have mostly concentrated on one aspect of time series prediction: the univariate case. The proposed model does not rely on a central neural network to create workload forecasts. However, there are numerous variables that might affect the VMs' CPU consumption, therefore, it tends to fluctuate a lot. Predicting future trends in VMs' CPU utilization is difficult for the reason that it is usually impossible for a

Conclusions
The widespread adoption of cloud computing has had a significant effect on the efficiency of organizations in a variety of sectors. Autoscaling, the dynamic adjustment of cloud infrastructure by adding or removing resources on the go, is crucial. Workload requirements, however, are difficult to forecast because they alter continuously over time.
To predict the demand on resources, various studies have suggested using time series forecasting. These research efforts have mostly concentrated on one aspect of time series prediction: the univariate case. The proposed model does not rely on a central neural network to create workload forecasts. However, there are numerous variables that might affect the VMs' CPU consumption, therefore, it tends to fluctuate a lot. Predicting future trends in VMs' CPU utilization is difficult for the reason that it is usually impossible for a

Conclusions
The widespread adoption of cloud computing has had a significant effect on the efficiency of organizations in a variety of sectors. Autoscaling, the dynamic adjustment of cloud infrastructure by adding or removing resources on the go, is crucial. Workload requirements, however, are difficult to forecast because they alter continuously over time. To predict the demand on resources, various studies have suggested using time series forecasting. These research efforts have mostly concentrated on one aspect of time series prediction: the univariate case. The proposed model does not rely on a central neural network to create workload forecasts. However, there are numerous variables that might affect the VMs' CPU consumption, therefore, it tends to fluctuate a lot. Predicting future trends in VMs' CPU utilization is difficult for the reason that it is usually impossible for a model to understand all of the characteristics of the data. The suggested model is a multimodel ensemble that can simultaneously learn diverse data features while simultaneously suppressing noise, and this is the main contribution of the paper. This study offered a method for predicting the workload of several virtual machines sharing common trends, which may be used to optimize cloud computing. In order to predict workload demand, or CPU utilization, this study proposes a deep learning architecture that combines multiple recurrent neural networks into one. The mixed ensemble model accounts for the temporal variations that have an effect on CPU utilization. To test the viability of the approach and the proposed blending ensemble model, an experiment was performed on several RNN models in a similar context and with the same settings. The efficacy of the blended ensemble model was measured in a variety of ways, including absolute percentage error, mean absolute error and root mean squared error. Considering our experiment, our blended ensemble model performed considerably better against state-of-the-art methods in all benchmarks, suggesting that the proposed model holds tremendous promise for anticipating cloud resource demands for cloud service providers. The method's efficiency was measured by analyzing the CPU usage on GWA-T-12 and PlanetLab traces. The method was assessed in relation to the LSTM, GRU, and BiLSTM components of the deep learning models. The findings shows that the technique has a lower root mean squared error than the other models by a margin of 79.72% to 91.95%.
Due to the large number of parameters involved in deep learning model training, enhancing training performance has been a topic of intense interest and critical need in the field. This work investigates the potential of the particle swarm optimization algorithm to enhance the training and accuracy efficiency of the ensemble model. Providers of cloud computing services must be able to accurately forecast cloud workload in order to efficiently meet their customers' demands while minimizing operating expenses. Specifically, the experiments provide preliminary validation of the performance of the proposed model for anticipating cloud workloads. The findings suggest that cloud service providers can utilize the given ensemble model to distribute VM resources in advance based on workload predictions.
In the future, we will try to build a more robust forecasting model which will considers various performance metrics from different datasets such as Google cluster and Alibaba. This will provide large-scale flexibility in the forecasting model. Again, we will try to design an energy-efficient autoscaling framework, based on the CPU usage knowledge obtained from the proposed PSO-based ensemble model.