Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center

Li, Huixi; Shen, Yongluo; Xi, Huidan; Xiao, Yinhao

doi:10.3390/app12199654

Open AccessArticle

Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center

by

Huixi Li

^1,2,

Yongluo Shen

^1,2,

Huidan Xi

³ and

Yinhao Xiao

^1,2,*

¹

School of Information Science, Guangdong University of Finance and Economics, Guangzhou 510320, China

²

Guangdong Intelligent Business Engineering Technology Research Center, Guangdong University of Finance and Economics, Guangzhou 510320, China

³

School of Information Engineering, Hunan Industry Polytechnic, Changsha 410208, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9654; https://doi.org/10.3390/app12199654

Submission received: 12 August 2022 / Revised: 16 September 2022 / Accepted: 17 September 2022 / Published: 26 September 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The recent COVID-19 pandemic has accelerated the use of cloud computing. The surge in the number of users presents cloud service providers with severe challenges in managing computing resources. Guaranteeing the QoS of multiple users while reducing the operating cost of the cloud data center (CDC) is a major problem that needs to be solved urgently. To solve this problem, this paper establishes a cost model based on multiple computing resources in CDC, which comprehensively considers the hosts’ energy cost, virtual machine (VM) migration cost, and SLAV penalty cost. To minimize this cost, we design the following solution. We employ a convolutional autoencoder-based filter to preprocess the VM historical workload and use an attention-based RNN method to predict the computing resource usage of the VMs in future periods. Based on the predicted results, we trigger VM migration before the host enters an overloaded state to reduce the occurrence of SLAV. A heuristic algorithm based on the complementary use of multiple resources in space and time is proposed to solve the placement problem. Simulations driven by the VM real workload dataset validate the effectiveness of our proposed method. Compared with the existing methods, our proposed method reduces the energy consumption of the hosts and SLAV and reduces the total cost by 26.1~39.3%.

Keywords:

cloud computing; cost; server consolidation; VM migration; SLAV; energy consumption

1. Introduction

The worldwide pandemic of the COVID-19 virus has forced people to change the way they live, socialize, and work. To maintain social distance, more human activities than ever have migrated online. In December 2021, Kamouri et al. [1] estimated that approximately 56% of the U.S. workforce could move to online or remote work. McKinsey Consumer Pulse [2] demonstrates that the shares of e-commerce in the major countries increased at two to five times the rate before COVID-19. Compared to the pre-lockdown levels, the usage of Internet services increased from 40% to 100% [3]. The usage of video-conferencing services, such as Zoom, has increased ten times [4]. In addition, to contain the pandemic, massive amounts of residential and medical data were sent to data centers for analysis. For example, China uses health QR codes nationwide to monitor and manage population movements with big data. Behind these large-scale digital activities, the support of cloud computing data centers (CDCs) is inseparable. To meet the corresponding computing power needs, people have now embraced cloud computing in all aspects.

With the explosive increase in the use of cloud computing, cloud service providers (CSPs) are also facing severe challenges in the management of CDCs. The problem mainly comes from two aspects: the cost of maintaining the operation of the CDC and the guarantee of Quality of Service (QoS) for users. With the surge in the number of users, to maintain the business needs of existing users and satisfy new users’ experience of cloud services, CSPs must increase investment in their CDCs. For example, SaaS service provider Salesforce increased the cost of its data center by 43% in 2022, which is USD 284 million, to meet higher customer service requirements [5]. Such a large financial investment is a great economic burden for the operation of an enterprise. Therefore, a feasible way to alleviate the economic pressure is to optimize the management of the CDC and improve the utilization of various computing resources in the CDC. The problem, however, brought about when resource utilization is improved is that the competition for different computing resources among the virtualized computing devices of users is enhanced. When the competition on the host with limited resources is intense to a certain extent, it will inevitably lead to a situation in which the user’s QoS drops significantly. It can be said that there is a certain contradiction between reducing the operating cost of the CDC and ensuring the QoS of users. We need to design a feasible way to achieve the tradeoff between the two.

Users run a wide variety of services in the CDC, and the corresponding virtual machine (VM) workloads are constantly changing due to the dynamic changes in services over time. During peak business hours, VMs have increased demands on computing resources such as CPU, memory, disk, and network. During periods of slack in business, the demand for these computing resources by VMs is also greatly reduced. To adapt to the dynamic changes of such workloads, the commonly used solution for managing hosts and VMs in CDCs is to perform server consolidation periodically [6]. By leveraging VM live migration, server consolidation migrates some VMs on hosts with high resource usage to other hosts with low resource usage to achieve load balancing in the CDC as much as possible. On the other hand, SLA is used to quantitatively describe the QoS for the use of VMs [7]. Hence, to ensure QoS, SLA needs to be taken into account to a certain extent when performing server consolidation.

In general, a round of server consolidation is divided into three sub-steps to implement. The first is host workload detection, which picks out overloaded and underloaded hosts in the cluster. Next is VM selection for overloaded hosts. To achieve the goal of reducing the workload on these hosts and the occurrence of SLAV, suitable VMs need to be selected for migration. Finally, the VM placement method is implemented to map an appropriate destination host for each VM to be migrated. VM placement problems often have specific optimization goals, such as using the smallest number of hosts to mount a given number of VMs. After implementing VM placement, underloaded hosts need to be addressed. By migrating all VMs on the underloaded host to other suitable hosts as much as possible, and shutting down or switching these underloaded hosts to an energy-saving state, the host energy cost of the CDC can be further reduced. The operating cost of the CDC involves many aspects, including host energy consumption, VM migration cost in server consolidation, SLAV penalty, cooling, security, storage, and network bandwidth overhead. The parts directly related to the computing resources used by users, including CPU, memory, disk, and network, are host energy consumption, VM migration overhead, and SLAV penalty. Therefore, in this paper, we comprehensively consider the operating cost of a CDC based on multiple computing resources, establish a relevant cost model, and propose a corresponding server consolidation solution. Our contributions are as follows:

(A): We formally identify the multi-resource-based cost model for server consolidation, which involves host energy consumption, VM migration, and SLAV. Based on the cost model, the optimization problem is given.
(B): A convolutional auto-encoder-based filter is leveraged to denoise the VM workload trace. Then, we propose an attention-based RNN method to predict the future workloads of the VMs. Based on the prediction results, a host workload detection policy is proposed.
(C): To minimize the total cost of server consolidation, we propose a VM selection policy and a VM placement algorithm which consider the multi-resource demands of VMs in the present and future.
(D): We conduct simulations to evaluate the performance of our proposed solution ARP-TSCP. The simulations’ results indicate that ARP-TSCP can reduce host energy consumption by 18.5~30.3%, SLAV cost by 38~52%, and total cost by 26.1~39.3% as compared to the baseline methods.

The remainder of the paper is organized as follows. In Section 2, we review the related work. In Section 3, we formally propose the cost model and then define the optimization problem to be solved. In Section 4, we propose the heuristic algorithms to solve this problem. In Section 5, we evaluate the performance of the proposed method by real VM workload, trace-driven simulations. In Section 6, we conclude the paper and discuss future works.

2. Related Work

In this section, we first survey the energy consumption models and cost models in cloud server consolidation, then we review the server consolidation solutions.

2.1. Sever Consolidation Cost Models

The cost of server consolidation in the cloud is mainly related to the host energy consumption, VM migration, and SLAV.

CPU is one of the most important components in a host, so many previous works [8,9,10,11,12,13,14,15,16,17] have proposed the host energy models based only on CPU usage or processor performance. However, the composition of a host’s energy consumption is not only related to the CPU factor. Therefore, several works [18,19,20,21,22,23,24] have established the host energy consumption models based on the usage of multiple resources, including CPU, memory, disk, and network. These works are of great help to us in building the proposed cost model of this paper. However, these models only consider the energy consumption when a host is running independently and do not consider the additional energy consumption caused by VM migration in server consolidation.

When addressing server consolidation, costs of VM live migration are primarily considered to be related to CPU [6]. Maziku et al. [25] and Dargie et al. [26] respectively pointed out that the duration of VM migration is related to the memory size of the VM being migrated and the network bandwidth of the host. Therefore, in addition to the CPU, other resources used by the VM migration process should also be included in the cost calculation.

Buyya et al. [6] proposed a CPU-based SLAV calculation method, which was widely adopted by many subsequent works [27,28,29,30,31,32,33,34,35]. However, the QoS of using VMs cannot be measured merely by CPU performance, and SLAV must involve the use of multiple resources. Guan et al. [36] proposed SLAV for GPUs for cloud gaming. Some SLAs [37,38] related to network factors have been proposed, but these models are not entirely built in the context of CDC.

2.2. Server Consolidation Solutions

Multiple solutions have been proposed to handle server consolidation.

Lin et al. [24] proposed a host energy consumption model based on CPU, memory, and disk and a centralized server consolidation structure called DEM. The master node is aware of the energy consumption of slave nodes that host VMs to trigger server consolidation. However, this work does not design a VM placement algorithm and does not consider VM migration and SLAV. Li et al. [39] proposed a server consolidation method based on multi-resource constraints. The goal of this method is to reduce energy consumption while ensuring user QoS. This work considers multi-resource constraints in the host workload detection and the design of the VM placement algorithm. However, in terms of the host energy consumption model and SLAV, only the CPU factor is considered. YADAV et al. [31] established a server consolidation model based on CPU usage, aiming to reduce energy consumption and SLAV at the same time. They proposed an adaptive host overloaded detection method and a bandwidth-aware-based VM selection strategy. Naeen et al. [40] proposed a stochastic process-based server consolidation policy to minimize data center costs while satisfying QoS requirements. The proposed energy consumption model, SLAV, and heuristic VM algorithm are based on CPU usage. They also take into account the cost of the host switching between different states. Sayadnavard et al. [41] proposed a server consolidation method based on multi-resource constraints. The goal is to minimize the number of hosts used by VM placement. When selecting a destination host for each VM to be migrated, the Markov chain model is used to determine whether that host would be overloaded soon. Yuan et al. [42] proposed to use the culture multiple-ant-colony algorithm to solve the server consolidation problem, thereby reducing the energy consumption of the hosts. The energy consumption model is based on CPU and does not specifically consider SLAV factors. Mamun et al. [43] proposed a server consolidation problem in the CDC with wireless network structures. They proposed an energy consumption model consisting of CPU-oriented host energy consumption and energy consumption of various network devices in the network topology. A network-ware VM placement method is proposed to reduce energy consumption. However, they did not consider SLAV.

3. Cost Model and Problem Description

In this section, we first formally describe the multiple computational resources based cost model in CDC, and based on this, we describe the problem definition.

3.1. Cost Model

In CDC, the cost related to computational resources mainly involves hosts, VM migrations, and SLAV penalties.

Before giving a specific cost model, we first describe the time and objects of the entire system. There are N heterogeneous hosts in CDCs, forming the host set

H = {h_{1}, h_{2}, \dots, h_{N}}

. The total resources provided by a host

h_{i}

is denoted as

C_{i} = (c_{i}^{c p u}, c_{i}^{m e m}, c_{i}^{d i s k}, c_{i}^{n e t})

, where

c_{i}^{c p u}

,

c_{i}^{m e m}

,

c_{i}^{d i s k}

, and

c_{i}^{n e t}

are the total available usage of CPU, memory, Disk R/W, and network I/O throughput, respectively. There are M running VMs, forming VM set

V = {v_{1}, v_{2}, \dots, v_{M}}

. When a user creates a VM

v_{j}

, the submitted resource requirements are denoted as

D_{j} = (d_{j}^{c p u}, d_{j}^{m e m}, d_{j}^{d i s k}, d_{j}^{n e t})

, where

d_{j}^{c p u}

,

d_{j}^{m e m}

,

d_{j}^{d i s k}

, and

d_{j}^{n e t}

are the total requirements of CPU, memory, Disk R/W, and network I/O throughput, respectively.

The life cycle of the CDC is divided into L small and equally long consecutive time segment

{t_{1}, t_{2}, \dots, t_{L}}

, and each time segment has a length of T. In a certain time segment

t_{k}

, if host

h_{i}

is in the working state, then we have

λ_{i, k} = 1

; otherwise,

λ_{i, k} = 0

. The amount of resources that

h_{i}

can provide are denoted as

R_{i, k} = (r_{i, k}^{c p u}, r_{i, k}^{m e m}, r_{i, k}^{d i s k}, r_{i, k}^{n e t})

. In

t_{k}

, the resources required by VM

v_{j}

are denoted as

S_{j, k} = (s_{j, k}^{c p u}, s_{j, k}^{m e m}, s_{j, k}^{d i s k}, s_{j, k}^{n e t})

.

We summarize the total cost of CDC for a given lifetime by analyzing the performance of each computing device in each time segment. In addition to the energy consumption cost of the hosts, it is also necessary to consider the cost of VM migration during each server consolidation and the penalty caused by the occurrence of SLAV. We will discuss them separately in the following subsections.

3.1.1. Cost of Host Energy Consumption

Given a host

h_{i}

, its cost

C_{h_{i}}

during operation is mainly related to the electricity fee

E P

and its power

p_{h_{i}}

, which is:

C_{h_{i}} = \sum_{k = 1}^{L} (E P \times p_{h_{i}} \times T \times λ_{i, k}) .

(1)

It should be noted that if

h_{i}

is powered off or in a power-saving state, its power consumption is negligible, so it will not incur any electricity-related costs.

In CDCs, in order to meet the various business demands of a large number of different users, almost all hosts are high-power and high-performance components. Therefore, it is impractical to model host power consumption only by analyzing the CPU. Basmadjian et al. [44] calculated that the average energy consumption of CPU in the host is about 37%. It should be noted that when users create or apply for a VM in a CDC, they need to directly specify the relevant parameters of the CPU, memory, disk, and network devices. Here, we can think that the host power consumption is related to the four resources: CPU, memory, disk, and network interface card (NIC). Therefore, in this paper, we build the power consumption model of the host based on these four resources:

CPU power model

Buyya et al. [6] leveraged a linear CPU power consumption model, that is, the current energy consumption of the CPU is directly related to its usage rate. When the CPU usage increases, the power of the host increases linearly. Although the linear model is simple to calculate, the result is lower than the actual power consumption value and the accuracy is not enough [24]. Therefore, we adopt an exponential model to fit the power consumption of the CPU.

Under normal circumstances, even if the working host is in idle state (the CPU usage rate is 0), the CPU will also generate a certain amount of energy consumption. This part of the energy consumption needs to be considered. Hence, the CPU power

p_{i, k}^{c p u}

of host

h_{i}

in

t_{k}

can be described as:

p_{i, k}^{c p u} = p_{i d l e, i}^{c p u} + (p_{p e a k, i} - p_{i d l e, i}) \times {(u_{i, k}^{c p u})}^{α_{i}^{c p u}},

(2)

where

p_{i d l e, i}^{c p u}

is the CPU power of

h_{i}

in idle state,

p_{p e a k, i}

is the CPU power of

h_{i}

when it is running at full load,

u_{i, k}^{c p u}

is the usage rate of the CPU at any time in

t_{k}

, and

α_{i}^{c p u}

is the fit index. Here, we assume that the CPU usage at any time in

t_{k}

is equal and is

u_{i, k}^{c p u}

. For a given CPU model, the powers in idle state and at full load are fixed values. According to [45], the value of

α_{i}^{c p u}

varies with the CPU usage. However, in general, the CPU power can be described more accurately when

α_{i}^{c p u} = 0.75

[45].

Memory power model

The power consumption of memory is related to its reading and writing. Accurately monitoring the read and write speed of memory in real time is too expensive. An alternative low-overhead solution is to estimate the throughput of memory through last level cache (LLC) miss counter available in processors. However, this approach is still impractical because the specific instructions associated with LLC misses are different in different CPU models [22].

In fact, in all public host or VM workload trace datasets, the provided memory-related data are the memory usage in certain periods of time. Therefore, we use the current footprint of memory (current usage of memory)

u_{i, k}^{m e m}

to estimate the power

p_{i, k}^{m e m}

of host

h_{i}

:

p_{i, k}^{m e m} = p_{i d l e, i}^{m e m} + α_{i}^{m e m} \times u_{i, k}^{m e m},

(3)

where

p_{i d l e, i}^{m e m}

is memory power of

h_{i}

in idle state,

α_{i}^{c p u}

is the fit index, and

u_{i, k}^{m e m}

is the usage rate of memory at any time in

t_{k}

. Here, we assume that the memory usage at any time in

t_{k}

is equal and is

u_{i, k}^{m e m}

. According to statistics, for a DDR memory system, when

α_{i}^{m e m} = 0.3

W/GB, the memory power can be estimated more accurately [46].

Disk energy consumption model

The power consumption of disk

p_{i, k}^{d i s k}

is directly related to its read and write throughput, we have:

p_{i, k}^{d i s k} = α_{i}^{r} \times u_{i, k}^{r} + α_{i}^{w} \times u_{i, k}^{w} + p_{i d l e, i}^{d i s k},

(4)

where

u_{i, k}^{r}

and

u_{i, k}^{w}

are the disk read and write number of bytes of

h_{i}

in any time in

t_{k}

, respectively;

α_{i}^{r}

and

α_{i}^{w}

are the disk read and write fit index, respectively; and

p_{i d l e, i}^{d i s k}

is the disk power in idle state. Here, we assume that the disk read and write number of bytes at any time in

t_{k}

are equal. Kansal et al. [22] discovered that the difference between disk read power and write power is negligible. Hence, Equation (4) can be updated as:

p_{i, k}^{d i s k} = p_{i d l e, i}^{d i s k} + α_{i}^{d i s k} \times u_{i, k}^{d i s k},

(5)

where

u_{i, k}^{d i s k} = u_{i, k}^{r} + u_{i, k}^{w}

, and

α_{i}^{d i s k}

is the fit index.

In the actual operation of host, the power consumption of the disk is related to random read and write and sequential read and write. Different read and write modes generate different read and write power consumption. Hence,

α_{i}^{d i s k}

is divided into

α_{i}^{s e q}

(sequential read and write) and

α_{i}^{r n d}

(random read and write). In practice, it is impractical to monitor and detect the read/write mode of the disk. Therefore, for convenience, we can assume that the probability of sequential reads and writes caused by all users is equal to that of random reads and writes. We obtain:

α_{i}^{d i s k} = (α_{i}^{s e q} + α_{i}^{r n d}) / 2 .

(6)

Lin et al. [24] found that setting

α_{i}^{s e q} = 0.07

W/MB/s and

α_{i}^{r n d} = 0.22

W/MB/s can more accurately estimate the power consumption of the disk.

Network power model

We mainly discuss the power of NICs. Under the framework of the TCP/IP protocol, we discuss the power NIC

p_{i, k}^{N I C}

, which is mainly based on the number of IP packets (the size of an IP packet is 500 bytes, denoted as

i p_s i z e

) that it can handle per second. We have:

p_{i, k}^{N I C} = p_{i d l e, i}^{N I C} + α_{i}^{N I C} \times u_{i, k}^{N I C},

(7)

where

α_{i}^{N I C}

is the power consumption of each IP packet processed by

h_{i}

’s NIC,

u_{i, k}^{N I C} = s_{j, k}^{n e t} / i p_s i z e

is the number of IP packets processed at any time in

t_{k}

, and

p_{i d l e, i}^{N I C}

is the power of NIC in idle state. Here, we assume that the number of IP packets processed by the NIC at any time in

t_{k}

is equal. Garcia-Saavedra et al. [47] found that the value of

α_{i}^{N I C}

should be

1.26 \times 10^{- 5}

joule/packet.

Hence, combining with Equations (2), (3), (5) and (7), we obtain:

p_{h_{i}} = p_{i, k}^{c p u} + p_{i, k}^{m e m} + p_{i, k}^{d i s k} + p_{i, k}^{N I C} .

(8)

Hence, combing with Equation (1), we obtain the energy consumption cost of all hosts

C_{H}

:

C_{H} = \sum_{i = 1}^{N} C_{h_{i}} = \sum_{i = 1}^{N} \sum_{k = 1}^{L} [E P \times (p_{i, k}^{c p u} + p_{i, k}^{m e m} + p_{i, k}^{d i s k} + p_{i, k}^{N I C}) \times T \times λ_{i, k}] .

(9)

3.1.2. Cost of VM Migration

We assume that at the beginning of each time segment, the CDC performs server consolidation to balance between the CSP’s cost and the user’s performance. VM migration is an important part of server consolidation. The majority of data transferred during VM live migration are from the memory of VM. Although the VM generates several dirty pages during live migration, Dargie et al. [26] found that the energy consumption of VM live migration is positively related to the memory size of the migrating VM. The larger the VM memory at the time of migration, the longer the migration time and the more energy consumption.

In CDC, migrating a VM

v_{j}

from host

h_{i}

to another host

h_{i^{'}}

requires additional resources provided by

h_{i}

to support the process. We assume that

h_{i}

reserves enough resources to complete the migration of

v_{j}

, and

h_{i^{'}}

also reserves enough resources to receive

v_{j}

.

The work conducted by Buyya et al. [6] on server consolidation is based on the assumption that a VM consumes an additional 10% of CPU usage during migration. In this paper, we extend this assumption to the VM usage with respect to all four computing resources in migration. In addition, in order to speed up batch VM migration and reduce the impact on user services during batch VM migration, we assume that CDC deploys an exclusive NIC and network for migration tasks, and here the size of the exclusive migration bandwidth of

h_{i}

is

{M I G_N E T}_{i}

. Therefore, the VM migrations cloud be completed within a time segment. This part of the cost is denoted as

C_{k}^{m i g_n e t}

. In this paper, the total cost of VM migration of a CDC in a given life cycle is denoted as

C_{m i g}

. Since the power model of CPU is a nonlinear exponential model, and the power model of other resources are linear,

C_{m i g}

can be described as:

C_{m i g} = \sum_{k = 1}^{L} (C_{k}^{v c p u} + C_{k}^{m i g} + C_{k}^{m i g_n e t}),

(10)

where

C_{k}^{v c p u}

is the cost caused by CPU when migrating VMs within

t_{k}

, and

C_{k}^{m i g}

is the cost caused by non-CPU resources when migrating VMs within

t_{k}

.

C_{k}^{m i g}

is described as:

C_{k}^{m i g} = \sum_{i = 1}^{N} \sum_{j = 1}^{M} \sum_{i^{'} = 1}^{N} (E P \times γ_{j, i, i^{'}, k} \times p_{j, k}^{m i g} \times t_{j, k}^{m i g}),

(11)

where

γ_{j, i, i^{'}, k}

is a 0-1 indicator,

p_{j, k}^{m i g}

is the power of migrating

v_{j}

, and

t_{j, k}^{m i g}

is the time spent migrating

v_{j}

. If

v_{j}

is migrated from

h_{i}

to

h_{i^{'}}

, then

γ_{j, i, i^{'}, k} = 1

; otherwise

γ_{j, i, i^{'}, k} = 0

. Furthermore, it is obviously that

t_{j, k}^{m i g} < T

. Since VM memory is the majority data transferred during migration, there are:

t_{j, k}^{m i g} = \frac{s_{j, k}^{m e m}}{m i g_b w_{i, k}},

(12)

where

m i g_b w_{i, k}

is the migration bandwidth size allocated by

h_{i}

to

v_{j}

. We assume that the dedicated migration bandwidth is evenly allocated to each migrated VM within

t_{k}

on

h_{i}

. Hence, for a given

h_{i}

and a

h_{i^{'}}

, we have:

m i g_b w_{i, k} = \frac{M I G_N E T_{i}}{\sum_{i = 1}^{N} \sum_{j = 1}^{M} \sum_{i^{'} = 1}^{N} γ_{j, i, i^{'}, k}},

(13)

then

t_{j, k}^{m i g} = \frac{s_{j, k}^{m e m} \times \sum_{i = 1}^{N} \sum_{j = 1}^{M} \sum_{i^{'} = 1}^{N} γ_{j, i, i^{'}, k}}{m i g_b w_{i, k} \times M I G_N E T_{i}} .

(14)

Then, we substitute Equation (14) into Equation (11).

We let

p_{j, k}^{v m e m}

,

p_{j, k}^{v d i s k}

, and

p_{j, k}^{v n e t}

be the memory, disk, and network power of

v_{j}

within

t_{k}

, respectively. The migration cost regarding memory of

v_{j}

is

0.1 \times p_{j, k}^{v m e m} = 0.1 \times α_{i}^{m e m} \times s_{j, k}^{m e m}

, the migration cost regarding memory of

v_{j}

is

0.1 \times p_{j, k}^{v d i s k} = 0.1 \times α_{i}^{d i s k} \times s_{j, k}^{d i s k}

, and the migration cost regarding network of

v_{j}

is

0.1 \times p_{j, k}^{v n e t} = 0.1 \times α_{i}^{N I C} \times u_{i, k}^{N I C}

. Hence, there are

p_{j, k}^{m i g} = 0.1 \times (α_{i}^{m e m} \times s_{j, k}^{m e m} + α_{i}^{d i s k} \times s_{j, k}^{d i s k} + α_{i}^{N I C} \times u_{i, k}^{N I C}) .

(15)

Based on the power model in Equation (7), we discuss

C_{k}^{m i g_n e t}

.

C_{k}^{m i g_n e t} = 2 \times \sum_{i = 1}^{N} (E P \times λ_{i, k} \times p_{i, k}^{m i g_n e t} \times T),

(16)

where

p_{i, k}^{m i g_n e t}

is the power of the dedicated NIC for migration on

h_{i}

. Since

h_{i}

is sending VM data and

h_{i^{'}}

is receiving corresponding data, the NIC power consumption of both needs to be considered. We obtain

p_{i, k}^{m i g_n e t} \times T = (α_{i}^{N I C} \times u_{i, k}^{m i g_n e t} + p_{i d l e, i}^{N} I C) \times T,

(17)

where

u_{i, k}^{m i g_n e t} \times T

is the total transferred VM data from

h_{i}

to

h_{i^{'} 1}

in time segment

t_{k}

. Then we have:

u_{i, k}^{m i g_n e t} \times T = \sum_{j = 1}^{M} (γ_{j, i, i^{'}, k} \times s_{j, k}^{m e m}) .

(18)

We substitute Equation (18) into Equation (16) and have:

C_{k}^{m i g_n e t} = 2 \times \sum_{i = 1}^{N} [E P \times λ_{i, k} \times (α_{i}^{N I C} \times \sum_{j = 1}^{M} γ_{j, i, i^{'}, k} \times s_{j, k}^{m e m} + T \times p_{i d l e, i}^{N} I C)] .

(19)

We discuss

C_{k}^{v c p u}

in the following paragraph. A 0-1 indicator

β_{i, j, k}

is used to mark whether VM

v_{j}

is running on host

h_{i}

at the beginning of

t_{k}

. If

v_{j}

is running on

h_{i}

, then

β_{i, j, k} = 1

; otherwise,

β_{i, j, k} = 0

. We assume that the power consumption of the hosts in CDC is mainly used to keep VMs running. Hence, regarding

h_{i}

,

(p_{p e a k, i} - p_{i d l e, i}) \times {(u_{i, k}^{c p u})}^{α_{i}^{c p u}}

in Equation (2) can be presented as:

(p_{p e a k, i} - p_{i d l e, i}) \times {(u_{i, k}^{c p u})}^{α_{i}^{c p u}} = (p_{p e a k, i} - p_{i d l e, i}) \times {[\sum_{j = 1}^{M} (β_{i, j, k} \times s_{j, k}^{c p u})]}^{α_{i}^{c p u}} .

(20)

Then, we take VM migrations on

h_{i}

into consideration:

\begin{matrix} (p_{p e a k, i} - p_{i d l e, i}) \times {[\sum_{j = 1}^{M} (β_{i, j, k} \times s_{j, k}^{c p u}) + 0.1 \times \sum_{j = 1}^{M} (γ_{j, i, i^{'}, k} \times s_{j, k}^{c p u})]}^{α_{i}^{c p u}} \\ = (p_{p e a k, i} - p_{i d l e, i}) \times {\sum_{j = 1}^{M} [(β_{i, j, k} + 0.1 \times γ_{j, i, i^{'}, k}) \times s_{j, k}^{c p u}]}^{α_{i}^{c p u}} . \end{matrix}

(21)

It should be noted that if

γ_{j, i, i^{'}, k} = 1

, then

β_{i, j, k} = 1

.

Then, we substitute Equation (21) into Equation (2). We denote the new

C_{H}

caused by updated

p_{i, k}^{c p u}

as

C_{H}^{'}

.

3.1.3. SLAV Penalty

In CDCs, once SLAV emerges, then the CSP must compensate the relevant users in some form. This is an effective method for CSPs to guarantee QoS after users pay for cloud services. This part of the overhead needs to be included in the cost of CDC.

We expand the CPU-based SLA definition [13] to CPU SLAV, memory SLAV, disk SLAV, and network SLAV, which are denoted as

S L A V_{c p u}

,

S L A V_{m e m}

,

S L A V_{d i s k}

,

S L A V_{n e t}

, respectively.

S L A V_{c p u} = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}^{s, c p u}}{T_{i}^{a, c p u}} \times \frac{1}{M} \sum_{j = 1}^{M} \sum_{k = 1}^{L} \frac{u d_{i, k}^{d, c p u}}{s_{i, k}^{r, c p u}},

(22)

where

T_{i}^{s, c p u}

is CPU SLAV duration caused by CPU overloaded on

h_{i}

, and

T_{i}^{a, c p u}

is the working duration of

h_{i}

, and

d_{i}^{d, c p u}

is the size of the unsatisfied demand for CPU resources as a result of

v_{j}

migration in

t_{k}

.

Similarly, we formally define

S L A V_{m e m}

,

S L A V_{d i s k}

and

S L A V_{n e t}

as follows:

S L A V_{m e m} = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}^{s, m e m}}{T_{i}^{a, m e m}} \times \frac{1}{M} \sum_{j = 1}^{M} \sum_{k = 1}^{L} \frac{u d_{i, k}^{d, m e m}}{s_{i, k}^{r, m e m}},

(23)

S L A V_{d i s k} = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}^{s, d i s k}}{T_{i}^{a, d i s k}} \times \frac{1}{M} \sum_{j = 1}^{M} \sum_{k = 1}^{L} \frac{u d_{i, k}^{d, d i s k}}{s_{i, k}^{r, d i s k}},

(24)

and

S L A V_{n e t} = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}^{s, n e t}}{T_{i}^{a, n e t}} \times \frac{1}{M} \sum_{j = 1}^{M} \sum_{k = 1}^{L} \frac{u d_{i, k}^{d, n e t}}{s_{i, k}^{r, n e t}},

(25)

where the definitions of corresponding parameters are similar to those of

S L A V_{c p u}

, and no further explanation will be given here.

In a given life cycle of CDC, the penalty cost caused by the appearance of SLAV is

C_{S L A V} = p u n_{c p u} \times S L A V_{c p u} + p u n_{m e m} \times S L A V_{m e m} + p u n_{d i s k} \times S L A V_{d i s k} + p u n_{n e t} \times S L A V_{n e t},

(26)

where

p u n_{c p u}

,

p u n_{m e m}

,

p u n_{d i s k}

, and

p u n_{n e t}

are the CPU, memory, disk, and network SLAV penalty price index, respectively.

3.2. Problem Description

In Section 3.1, we discuss the cost models involved in the operating cost of computing equipment in CDC, and they are host energy consumption cost

C_{H}

, the VM migration cost

C_{m i g}

, and SLAV penalty cost

C_{S L A V}

. In this paper, our goal is to minimize the associated operating cost C of CDC. Based on above models, we have:

M I N C = C_{H}^{'} + \sum_{k = 1}^{L} (C_{k}^{m i g} + C_{k}^{m i g_n e t}) + C_{S L A V} .

(27)

We name this problem as the Minimizing Computing Resources Cost problem (MCRC problem) in server consolidation. The constraints of MCRC problem are:

\sum_{i = 1}^{N} β_{i, j, k} = 1, \forall j, \forall k,

(28)

\sum_{i^{'} = 1}^{N} γ_{j, i, i^{'}, k} = 1, i \neq i^{'}, \forall j, \forall k,

(29)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{c p u} \leq s_{i}^{c p u}, \forall i, \forall k,

(30)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{m e m} \leq s_{i}^{m e m}, \forall i, \forall k,

(31)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{d i s k} \leq s_{i}^{d i s k}, \forall i, \forall k,

(32)

and

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{n e t} \leq s_{i}^{n e t}, \forall i, \forall k .

(33)

Equation (28) indicates that any VM can only run on a unique host in any given time segment. Equation (29) indicates that the destination host of any VM migration exists and is unique. In Equations (30)–(33) indicate that the CPU, memory, disk and network resources provided by each host to the VMs cannot exceed its own resource limits, respectively.

Now we analyze the complexity of the MCRC problem. Considering a simple case that satisfies the following conditions. If the hosts in the CDC are homogeneous, the resource requirements of any VM

v_{j}

in any time segment

t_{k}

are fixed values and satisfy In Equations (30)–(33), then the VM migration cost and SLAV penalty cost are both 0, and the objective function of the MCRC problem is:

M I N C = C_{H} .

(34)

Apparently, the MCRC problem in this simple case can be reduced to the Bin-Packing problem. Since Bin-Packing problem is NP-hard, the MCRC problem is also NP-hard.

4. Solution for MCRC Problem

Since the MCRC problem is NP-hard, we propose a heuristics solution in this paper based on the traditional three-step method for dealing with server consolidation. The first step is to detect the overloaded and underloaded host; the second step is to select suitable VMs from the overloaded host to be migrated; the third step is to select appropriate hosts VMs to be migrated.

Before performing host overloaded detection and VM selection, we first predict the future workload changes of the VMs based on their historical workloads. The purpose of this is to balance loads of hosts before they become overloaded and cause the occurrence of SLAV, thereby reducing cost.

4.1. VM Workload Prediction

Before predicting the future workload of a VM, we first need to preprocess the history of the workload. In Section 3.1.1, we assume that in a certain time segment

t_{k}

, the usage of a given resource by a VM is constant. In practice, this phenomenon is rare. For example, assuming that the sampling interval for workload records is five minutes (five minutes is the sampling interval used by most public workload trace data sets), then the CPU usage of the VM during these five minutes will vary according to business conditions. The following situation is likely to occur. During the 299 s after the

(k - 1)

-th sampling, the CPU usage of a VM is extremely low. However, at the 300th s (k-th sampling), the CPU usage of the VM instantly increases to 90%. This record does not reflect that the VM’s CPU has been under heavy load for the past five minutes. This means two points: (1) there is a certain deviation between the historical sampling records and the actual use of resources by the VM; (2) The assumption in Section 3.1.1 has a certain deviation from the actual use of resources by the VM. To minimize the impact of these biases on the final result, we can consider the presence of noise in the workload records. In this paper, we leverage the convolutional auto-encoder (CAE) to build a filter algorithm to preprocess the workload of the VM and then adopt the Attention-based RNN method for prediction.

4.1.1. A Convolutional Auto-Encoder-Based Filter

CAE has been shown to process time series data well [48].

We use the CAE based filter to process the four resource usages of the VM, respectively. We take the historical time series data of CPU usage by VM

v_{j}

as an example to illustrate the proposed CAE based filter. But we will not describe the mathematical principles of CAE in detail in this paper.

Let the CPU usage record of

v_{j}

in l consecutive time periods as time series data

{s_{j, 1}^{c p u}, s_{j, 2}^{c p u}, \dots, s_{j, l}^{c p u}}

. After being processed by the encoder and decoder of the CAE based filter, the denoised time series data of CPU usage is

{{s^{'}}_{j, 1}^{c p u}, {s^{'}}_{j, 2}^{c p u}, \dots, {s^{'}}_{j, l}^{c p u}}

. The neural network structure of the CAE based filter is shown in Figure 1.

As shown in Figure 1, both the encoder and decoder are three-layer network structures, and the activation function is

T a n h ()

. In our actual use, the input is data collected for 48 consecutive time periods (4 h) and normalized. The data collected in the last 5000 consecutive time periods are used for training. Figure 2 shows the result of denoising the CPU usage of a VM in the Alibaba CDC [49] with the proposed CAE based filter.

4.1.2. An Attention-Based RNN Prediction Method

The time series data denoised by the CAE based filter are used to predict the resource usage of the VM in the future time period. The proposed prediction method is based on attention mechanism, named as the ARP algorithm, and is based on our previous work [50].The ARP algorithm predicts the usage of four resources by VMs at the same time, without separate processing.

The ARP algorithm consists of two parts: the encoder and the decoder. In the encoder, an attention module is used to adaptively select the relevant series. Then in the decoder, the relevant encoder hidden states is selected via another attention module [50]. We demonstrate the neural network structures of the encoder and the decoders in Figure 3 and Figure 4, respectively.

Let

S_{j, l} = {s_{j, l}^{c p u}, s_{j, l}^{m e m}, s_{j, l}^{d i s k}, s_{j, l}^{n e t}}^{⊤}

,

s_{j, l}^{c p u} = {s_{j, 1}^{c p u}, s_{j, 2}^{c p u}, \dots, s_{j, L}^{c p u}}^{⊤}

(L is size of the window),

s_{j, l}^{m e m} = {s_{j, 1}^{m e m}, s_{j, 2}^{m e m}, \dots, s_{j, L}^{m e m}}^{⊤}

,

s_{j, l}^{d i s k} = {s_{j, 1}^{d i s k}, s_{j, 2}^{d i s k}, \dots, s_{j, L}^{d i s k}}^{⊤}

,

s_{j, l}^{n e t} = {s_{j, 1}^{n e t}, s_{j, 2}^{n e t}, \dots, s_{j, L}^{n e t}}^{⊤}

. The attention mechanism in the encoder calculates the encoder attention weight

α_{l}^{q}

based on the previous

l - 1

hidden state, and then get

S_{j, l}^{'} = {α_{l}^{1} \times s_{j, l}^{c p u}, α_{l}^{2} \times s_{j, l}^{m e m}, α_{l}^{3} \times s_{j, l}^{d i s k}, α_{l}^{4} \times s_{j, l}^{n e t}}

. Then

S_{j, l}^{'}

is fed to the LSTM unit in the encoder. The attention mechanism in the decoder calculates the attention weight

β_{l}^{l}

based on the previous

l - 1

hidden state. At time l, for L

S_{j, l}^{'}

, their attention weights are

β_{l}^{1}

to

β_{l}^{L}

at time l, respectively. The input information is represented as a weighted sum of the encoder hidden states across all the time steps, and then is fed to the LSTM unit of the decoder. The output result of LSTM is the prediction of the next time period. For the specific explanation and mathematical description of the model, such as the calculation method of the attention weights, please refer to our previous work [50].

As the examples, Figure 5, Figure 6 and Figure 7 demonstrate, respectively, the results of the CPU part after using the ARP algorithm (L = 10) to predict the resource usages of three different VMs in the Alibaba CDC [49].

After accurately predicting the resources usage of each VM in the next period, we can perform host workload detection and VM selection.

4.2. Host Workload Detection

The purpose of host overloaded detection is to avoid and eliminate the fierce competition of VMs for resources, thereby reducing the occurrence of SLAV. Common host overloaded detection methods are divided into two categories, static threshold method, and dynamic threshold method. Regarding the static threshold method, the overloaded thresholds for various resources are set as fixed values. When the usages exceed the thresholds, the host is overloaded, and SLAV occurs. At this time, the VMs must be migrated to reduce the load. Regarding the dynamic threshold method, various statistical methods are used to analyze the usages of computing resources by VMs or hosts, to determine whether the competition for resources by VMs is intense, and whether the host is overloaded. The advantage of the static threshold method is that the host resources are fully utilized, but the disadvantage is that once overloaded, more overhead is required to solve the SLAV. The advantage of the dynamic threshold method is that it can effectively reduce the SLAV, but sometimes the use of host resources is not sufficient. Therefore, we combine the advantages of the two and propose an ARP-based fixed threshold overloaded detection method (ARP-FT method).

ARP-FT is a dual detection method. At the beginning of the time segment

t_{k}

, ARP-FT first detects whether the usage of various resources on

h_{i}

exceeds the given thresholds, and then, based on the prediction result of the ARP algorithm, we judge whether the usage of various resources on

h_{i}

exceeds the given thresholds in the next period.

Let the overloaded thresholds

T H_{u p} = {T H_{u p}^{c p u}, T H_{u p}^{m e m}, T H_{u p}^{d i s k}, T H_{u p}^{n e t}}

, where

T H_{u p}^{c p u}

,

T H_{u p}^{m e m}

,

T H_{u p}^{d i s k}

, and

T H_{u p}^{n e t}

. They are all in the interval

(0, 1)

. For host

h_{i}

, it is overloaded in time segment

t_{k}

when the following inequalities hold:

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{c p u}, T H_{u p}^{c p u} \times c_{i}^{c p u},

(35)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{m e m} > T H_{u p}^{m e m} \times c_{i}^{m e m},

(36)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{d i s k} > T H_{u p}^{d i s k} \times c_{i}^{d i s k},

(37)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{n e t} > T H_{u p}^{n e t} \times c_{i}^{n e t},

(38)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{c p u} > T H_{u p}^{c p u} \times c_{i}^{c p u},

(39)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{m e m} > T H_{u p}^{m e m} \times c_{i}^{m e m},

(40)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{d i s k} > T H_{u p}^{d i s k} \times c_{i}^{d i s k},

(41)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{n e t} > T H_{u p}^{n e t} \times c_{i}^{n e t} .

(42)

Let the underloaded thresholds

T H_{d o w n} = {T H_{d o w n}^{c p u}, T H_{d o w n}^{m e m}, T H_{d o w n}^{d i s k}, T H_{d o w n}^{n e t}}

, where

T H_{d o w n}^{c p u}

,

T H_{d o w n}^{m e m}

,

T H_{d o w n}^{d i s k}

, and

T H_{d o w n}^{n e t}

. They are all in the interval

(0, 1)

. For host

h_{i}

, it is underloaded in time segment

t_{k}

when the following inequalities hold:

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{c p u} < T H_{d o w n}^{c p u} \times c_{i}^{c p u},

(43)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{m e m} < T H_{d o w n}^{m e m} \times c_{i}^{m e m},

(44)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{d i s k} < T H_{d o w n}^{d i s k} \times c_{i}^{d i s k},

(45)

\sum_{j = 1}^{M} β_{i, j, k} \times s_{j}^{n e t} < T H_{d o w n}^{n e t} \times c_{i}^{n e t},

(46)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{c p u} < T H_{d o w n}^{c p u} \times c_{i}^{c p u},

(47)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{m e m} < T H_{d o w n}^{m e m} \times c_{i}^{m e m},

(48)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{d i s k} < T H_{d o w n}^{d i s k} \times c_{i}^{d i s k},

(49)

\sum_{j = 1}^{M} β_{i, j, k + 1} \times s_{j}^{n e t} < T H_{d o w n}^{n e t} \times c_{i}^{n e t} .

(50)

4.3. VM Selection

VM selection is for overloaded hosts. The reason why we use the ARP algorithm is to avoid host overloading and SLAV as much as possible, rather than reactively responding to SLAV after its occurrence. Therefore, we assume that in time segment

t_{k + 1}

, there would be SLAV and overloaded hosts in the CDC, but not many. Regarding VM selection in

t_{k}

for

h_{i}

, our priority is to select the VMs that may cause

h_{i}

to be overloaded during time segment

t_{k + 1}

and form a list of VMs to be migrated. After the migration of these VMs is completed, if

h_{i}

is still overloaded in

t_{k}

, it will be processed accordingly.

Given a host

h_{i}

overloaded in the beginning of certain future time segment

t_{l}

, we define its overloaded type

o l_t y p e_{i, l}

. We check which inequalities in Equations (35)–(38) are satisfied by

h_{i}

, so as to determine which resources in cpu, memory, disk, and network are involved in its overload state. Let the CPU, memory, disk, and network overloaded marks be

A

,

B

,

C

, and

D

, respectively. In time segment

t_{l}

, they are denoted as

A_{i, l}

,

B_{i, l}

,

C_{i, l}

, and

D_{i, l}

, respectively, and their corresponding values are

V (A_{i, l}) = \frac{c_{i}^{c p u} - r_{i, l}^{c p u}}{T H_{u p}^{c p u} \times c_{i}^{c p u}}

,

V (B_{i, l}) = \frac{c_{i}^{m e m} - r_{i, l}^{m e m}}{T H_{u p}^{m e m} \times c_{i}^{m e m}}

,

V (C_{i, l}) = \frac{c_{i}^{d i s k} - r_{i, l}^{d i s k}}{T H_{u p}^{d i s k} \times c_{i}^{d i s k}}

, and

V (D_{i, l}) = \frac{c_{i}^{n e t} - r_{i, l}^{n e t}}{T H_{u p}^{n e t} \times c_{i}^{n e t}}

. If

h_{i}

satisfies inequalities (35) and (37), that is,

V (A_{i, l}) > 1

and

V (C_{i, l}) > 1

, then its overloaded type sequence can be denoted as

(A_{i, l} C_{i, l})

. We sort the overloaded resources in

h_{i}

’s overloaded type sequence in descending order according to the overloaded mark value. For example, if the initial overloaded type sequence of

h_{i}

is

(A_{i, l} B_{i, l} C_{i, l} D_{i, l})

, and

V (A_{i, l}) = 0.1

,

V (B_{i, l}) = 0.3

,

V (C_{i, l}) = 0.05

, and

V (D_{i, l}) = 0.2

, then the sorted overloaded type sequence is

(B_{i, l} D_{i, l} A_{i, l} C_{i, l})

. The sorted overloaded type sequence is

o l_t y p e_{i, l}

of

h_{i}

in

t_{l}

. Regarding the above instance,

o l_t y p e_{i, l} = (B_{i, l} D_{i, l} A_{i, l} C_{i, l})

.

Given a VM

v_{j}

, we define its workload type

w l_t y p e_{j, l}

in

t_{l}

. Let its CPU, memory, disk, and network usage marks be

A

,

B

,

C

and

D

, respectively. In time segment

t_{l}

, they are denoted as

A_{j, l}

,

B_{j, l}

,

C_{j, l}

, and

D_{j, l}

, respectively, and their corresponding values are

V (A_{j, l}) = \frac{s_{j, l}^{c p u}}{c_{m a x}^{c p u}}

,

V (B_{j, l}) = \frac{s_{j, l}^{m e m}}{c_{m a x}^{m e m}}

,

V (C_{j, l}) = \frac{s_{j, l}^{d i s k}}{c_{m a x}^{d i s k}}

, and

V (D_{j, l}) = \frac{s_{j, l}^{n e t}}{c_{m a x}^{n e t}}

, where

c_{m a x}^{c p u} = m a x {c_{i}^{c p u} | i \in [1, N]}

,

c_{m a x}^{m e m} = m a x {c_{i}^{m e m} | i \in [1, N]}

,

c_{m a x}^{d i s k} = m a x {c_{i}^{d i s k} | i \in [1, N]}

, and

c_{m a x}^{n e t} = m a x {c_{i}^{n e t} | i \in [1, N]}

. Let the initial workload type sequence of VM

v_{j}

be

(A_{j, l} B_{j, l} C_{j, l} D_{j, l})

. We sort the initial sequence in descending order according to the values of resource usage marks, and the sorted workload type sequence is

w l_t y p e_{j, l}

. For example, if

V (A_{j, l}) = 0.1

,

V (B_{j, l}) = 0.09

,

V (C_{j, l}) = 0.15

, and

V (D_{j, l}) = 0.02

, then

w l_t y p e_{j, l} = (C_{j, l} A_{j, l} B_{j, l} D_{j, l})

. Then, we define the complementary workload type

c w l_t y p e_{j, l}

of

v_{j}

in

t_{l}

as the reverse sequence of

w l_t y p e_{j, l}

.

For the overloaded type of

h_{j}

, we select as few VMs as possible to migrate, so that the values of

A_{j, l}

,

B_{j, l}

,

C_{j, l}

, and

D_{j, l}

are all less than or equal to 1. For example, if the overloaded type of

h_{i}

is

(D A)

, then the VMs with the workload type

(D A * *)

should be selected for migration as much as possible. Therefore, we need to use

(D A)

as the reference to sort the VMs on

h_{i}

.

Given marks

A^{'}

,

B^{'}

,

C^{'}

, and

D^{'}

, the corresponding values of host

h_{i}

and VM

v_{j}

in

t_{l}

are

V (A_{i, l}^{'}) = r_{i, l}^{c p u} - T H_{u p}^{c p u} \times c_{i}^{c p u}

,

V (B_{i, l}^{'}) = r_{i, l}^{m e m} - T H_{u p}^{m e m} \times c_{i}^{m e m}

,

V (C_{i, l}^{'}) = r_{i, l}^{d i s k} - T H_{u p}^{d i s k} \times c_{i}^{d i s k}

,

V (D_{i, l}^{'}) = r_{i, l}^{n e t} - T H_{u p}^{n e t} \times c_{i}^{n e t}

,

V (A_{j, l}^{'}) = s_{j, l}^{c p u}

,

V (B_{j, l}^{'}) = s_{j, l}^{m e m}

,

V (C_{j, l}^{'}) = s_{j, l}^{d i s k}

, and

V (D_{j, l}^{'}) = s_{j, l}^{n e t}

. Given the overloaded reference

R F = (E_{1} \dots E_{p})

, where

E_{g} \in {A^{'}, B^{'}, C^{'}, D^{'}}

, and the sort of

E_{1} \dots E_{p}

corresponds to that of

o l_t y p e_{j, l}

. The resources usage of

v_{j}

based on

R F

is denoted as

R U_{j} = (e_{j, 1} \dots e_{j, p})

, where

e_{j, g} \in {A^{'}, B^{'}, C^{'}, D^{'}}

and the sort of

e_{j, 1} \dots e_{j, p}

also corresponds to that of

o l_t y p e_{j, l}

.

Regarding two different VMs

v_{j}

and

v_{j^{'}}

, if

v_{j}

is better than

v_{j^{'}}

based on

R F

, denoted as

v_{j} > v_{j^{'}}

, then the following is met:

\prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j, g}) | > \prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j^{'}, g}) |,

(51)

where

α_{g}^{R F}

is the overloaded reference weight. Resources with higher overloaded levels should be prioritized to reduce load; hence the value of the corresponding

α_{g}^{R F}

should be larger. We let the value of

α_{g}^{R F}

be equal to the value of resource mark in

w l_t y p e_{j, l}

that corresponds to the

E_{g}

of the same subscript g. For instance, assuming

w l_t y p e_{j, l} = (C_{j, l} A_{j, l})

, then

R F = (E_{1} E_{2}) = (C_{j, l}^{'} A_{j, l}^{'})

,

α_{1}^{R F} = V (C_{j, l})

, and

α_{2}^{R F} = V (A_{j, l})

.

If

v_{j}

is equivalent to

v_{j^{'}}

based on

R F

, denoted as

v_{j} = v_{j^{'}}

, then the following is met:

\prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j, g}) | = \prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j^{'}, g}) | .

(52)

If

v_{j}

is worse than

v_{j^{'}}

based on

R F

, denoted as

v_{j} < v_{j^{'}}

, then the following is met:

\prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j, g}) | < \prod_{g = 1}^{p} α_{g}^{R F} \times | V (E_{g}) - V (e_{j^{'}, g}) | .

(53)

According to the above given relationship between VMs based on

R F

, we sort the VMs on

h_{i}

from ’good’ to ’bad’ as an ordered list. Subsequently, the VMs are sequentially selected from the ordered list and placed in the list of VM to be migrated. After each selection, we determine whether

h_{i}

is still overloaded after removing the resources required by the selected VM. If it is overloaded, continue to select VMs in order; otherwise, stop VM selection.

4.4. VM Placement

In order to make full use of the resources of a host, we should fully consider the competition of various VMs for resources when placing them on the same host. This competition is divided into two aspects: space and time of resources usage.

In a given period of time, competition should be evenly distributed over different resources, rather than a situation where multiple VMs scramble for the same one or two resources while leaving others idle. We illustrate this with a simple example. Assume that the case where there are only two resources, CPU and memory, is considered. The CPU and memory provided by host H are both 10 units, and the VMs to be placed are V1–V7. The CPU and memory requirements

(s^{c p u}, s^{m e m})

of these VMs are

(2.40, 1.16)

,

(2.46, 1.11)

,

(2.46, 0.97)

,

(0.73, 2.72)

,

(1.05, 2.13)

,

(0.87, 1.91)

, and

(2.65, 0.97)

, respectively. Figure 8 demonstrates two methods of VM placement, which are shown in Figure 8a,b, respectively. Obviously, the placement in Figure 8a is more reasonable because this method allows the VMs to make full use of the two resources of the host H. In Figure 8b, the VMs’ scramble for resources is concentrated on CPU, so that the utilization rate of H’s CPU has reached the highest point, while the utilization rate of memory resources has not exceeded half, resulting in waste.

In the two adjacent time periods, the user’s business needs change, and the usage of certain resources of the host may change greatly. When we implement VM placement in

t_{k}

, we must consider the resource usage of the VM in

t_{k + 1}

to avoid the overloaded state of the destination host in

t_{k + 1}

after migration. We illustrate this with a simple example. For the convenience of discussion, only CPU is considered. During time period

t_{k}

, the CPU usage of host H is

30 %

. For

t_{k + 1}

, the predicted value of the CPU usage of host H is

60 %

. The VMs to be placed are V1–V5. Their required usage rate of CPU in

t_{k}

and

t_{k + 1}

,

{s_{k}, s_{k + 1}}

, are

{30 %, 15 %}

,

{20 %, 10 %}

,

{15 %, 10 %}

,

{20 %, 30 %}

, and

{10 %, 5 %}

, respectively. Figure 9 demonstrates two methods of VM placement, which are shown in Figure 9a,b, respectively. Obviously, the placement in Figure 9a is more reasonable. In two consecutive time periods, H’s CPU load did not exceed the upper limit. The VM placement in Figure 9b lets H load more VMs in

t_{k}

. However, in the

t_{k + 1}

period, the CPU load of H exceeds the upper limit, leaving H in the overloaded state, and SLAV occurs, so VM migration must be used to reduce the load. The placement of VMs for H in Figure 9b increases the operating cost of the CDC.

Based on this competition in resource space and time, we design a resources- complementary-usage-based VM placement strategy. There are two goals we want to achieve: (1) to ensure that the resources of the target host can be fully utilized during the

t_{k}

period and (2) after placing the VMs on the given host

h_{i}

, this host would not be overloaded in

t_{k + 1}

.

Given a host

h_{i}

, we define its workload type

W L_t y p e_{i, k}

in

t_{k}

. Let its CPU, memory, disk, and network usage marks be

AW

,

BW

,

CW

, and

DW

, respectively. In time segment

t_{k}

, they are denoted as

{AW}_{i, k}

,

{BW}_{i, k}

,

{CW}_{i, k}

, and

{DW}_{i, k}

, respectively, and their corresponding values are

V ({AW}_{i, k}) = \frac{r_{i, k}^{c p u}}{c_{m a x}^{c p u}}

,

V ({BW}_{i, k}) = \frac{r_{i, k}^{m e m}}{c_{m a x}^{m e m}}

,

V ({CW}_{i, k}) = \frac{r_{i, k}^{d i s k}}{c_{m a x}^{d i s k}}

, and

V ({DW}_{i, k}) = \frac{r_{i, k}^{n e t}}{c_{m a x}^{n e t}}

. Let the initial workload type sequence of host

h_{i}

be

({AW}_{i, k} {BW}_{i, k} {CW}_{i, k} {DW}_{i, k})

. We sort the initial sequence in ascending order according to the values of resource usage marks, and the ordered workload type sequence is

W L_t y p e_{j, l}

. For example, if

V ({AW}_{i, k}) = 0.1

,

V ({BW}_{i, k}) = 0.09

,

V ({CW}_{i, k} = 0.15

, and

V ({DW}_{i, k}) = 0.02

, then

W L_t y p e_{j, l} = ({DW}_{i, k} {BW}_{i, k} {AW}_{i, k} {CW}_{i, k})

. Then, we define the complementary workload type

C W L_t y p e_{i, k}

of

h_{i}

in

t_{k}

as the reverse sequence of

W L_t y p e_{j, l}

. In the production environment, the power consumption of the four resources in a host is sorted from largest to smallest as CPU, memory, disk, and network. Therefore, in a host workload type sequence, if two resource usages are equal, these two resources are sorted according to the order of their resource power consumption levels. Hence, we have

4! = 24

workload types.

We divided all non-overloaded hosts (including those underloaded ones) into 24 groups by the workload types. Then, the 24 groups of hosts are arranged in descending order by the number of hosts included to form a list:

H G = {W L T_H o s t_L i s t_{1}, W L T_H o s t_L i s t_{2}, \dots, W L T_H o s t_L i s t_{2} 4}

.

Next, we sort the hosts in each

W L T_H o s t_L i s t_{i}

. We compare two hosts of the same workload type as follows. Given the workload type

(W_{i, 1} W_{i, 2} W_{i, 3} W_{i, 4})

, where

W_{i, p} \in {AW, BW, CW, DW}

. Regarding two different host

h_{i}

and

h_{i^{'}}

of the same workload type:

If

h_{i}

is better than

h_{i^{'}}

, denoted as

h_{i} > h_{i^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{i, p}) > \prod_{p = 1}^{4} V (W_{i^{'}, p}) .

(54)

If

h_{i}

is equivalent to

h_{i^{'}}

, denoted as

h_{i} = h_{i^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{i, p}) = \prod_{p = 1}^{4} V (W_{i^{'}, p}) .

(55)

If

h_{i}

is worse than

h_{i^{'}}

, denoted as

h_{i} < h_{i^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{i, p}) < \prod_{p = 1}^{4} V (W_{i^{'}, p}) .

(56)

Hosts in the same group, according to the above method of comparison, are sorted in descending order to form an ordered group. All ordered groups are denoted as:

{W L T_H o s t_L i s t_s o r t e d_{1}, W L T_H o s t_L i s t_s o r t e d_{2}, \dots, W L T_H o s t_L i s t_s o r t e d_{24}}

.

For each host given a workload type, in order to take full advantage of all its resources, it is best to pick and place VMs of the complementary work type that are the same as the host’s workload type. For instance, if the host is

W L_t y p e_{i, k} = (DW BW CW AW)

, then

C W L_t y p e_{i, k} = (AW CW BW DW)

, and the VMs of the complementary work type

(A C B D)

should be selected and placed on this host. Hence, we divide all VMs to be migrated into 24 groups according to their complementary workload types, denoted as:

V G = {c w l t_V M_L i s t_{1}, c w l t_V M_L i s t_{1}, \dots, c w l t_V M_L i s t_{24}}

.

Next, we sort the VMs in each groups. We compare two VMs of the same complementary workload type as follows. Given workload type

(W_{j, 1} W_{j, 2} W_{j, 3} W_{j, 4})

, where

W_{j, p} \in {A, B, C, D}

. Regarding two different VMs

v_{j}

and

v_{j^{'}}

of the same complementary workload type:

If

v_{j}

is better than

v_{j^{'}}

, denoted as

v_{j} > v_{j^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{j, p}) > \prod_{p = 1}^{4} V (W_{j^{'}, p}) .

(57)

If

v_{j}

is equivalent to

v_{j^{'}}

, denoted as

v_{j} = v_{j^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{j, p}) = \prod_{p = 1}^{4} V (W_{j^{'}, p}) .

(58)

If

v_{j}

is worse than

v_{j^{'}}

, denoted as

v_{j} < v_{j^{'}}

, then the following is met:

\prod_{p = 1}^{4} V (W_{j, p}) < \prod_{p = 1}^{4} V (W_{j^{'}, p}) .

(59)

VMs in the same group, according to the above method of comparison, are sorted in descending order to form an ordered group. All ordered groups are denoted as:

V G^{'} = {c w l t_V M_L i s t_s o r t e d_{1}, c w l t_V M_L i s t_s o r t e d_{2}, \dots, c w l t_V M_L i s t_s o r t e d_{24}}

.

We select a set of hosts

W L T_H o s t_L i s t_s o r t e d_{x}

from

H G

in sequential order and obtain a VM group

c w l t_V M_L i s t_s o r t e d_{y}

that is the same as its complementary workload type. We select a host from

W L T_H o s t_L i s t_s o r t e d_{x}

in sequence and take out VMs from

c w l t_V M_L i s t_s o r t e d_{y}

in sequence and put them into the host. At this point, we need to judge whether the host is overloaded both in the current time period and the next time period after the VM is placed. If the answer is no, the VM can be placed on this host and will be removed from

c w l t_V M_L i s t_s o r t e d_{y}

; otherwise, skip to the next VM in order.

Traversing

c w l t_V M_L i s t_s o r t e d_{y}

for the current host, we then process the next host in order. The above algorithm is named the Time and Space Complementary VM placement algorithm (TSCP algorithm), and its pseudo-code is shown in Algorithm 1.

Algorithm 1 TSCP algoritm.

Input: hostlist, vmlist

Output: allocation of the VMs

1:: $H G$ ← get_HG(hostlist)
2:: for each $W L T_H o s t_L i s t_{i}$ in $H G$ do
3:: $W L T_H o s t_L i s t_s o r t e d_{i}$ ← get_sorted( $W L T_H o s t_L i s t_{i}$ )
4:: end for
5:: $V G$ ← get_VG(vmlist)
6:: $V G^{'}$ ← get_VG’( $V G$ )
7:: for each $c w l t_V M_L i s t_{i}$ in $V G^{'}$ do
8:: $c w l t_V M_L i s t_s o r t e d_{y}$ ← get_sorted( $c w l t_V M_L i s t_{i}$ )
9:: end for
10:: for each $W L T_H o s t_L i s t_s o r t e d_{i}$ do
11:: $c w l t_V M_L i s t_s o r t e d_{s a m e}$ ← FindSameWLT( $W L T_H o s t_L i s t_s o r t e d_{i}$ , $V G^{'}$ )
12:: for each HOST in $W L T_H o s t_L i s t_s o r t e d_{i}$ do
13:: for each VM in $c w l t_V M_L i s t_s o r t e d_{s a m e}$ do
14:: if HOST is not overloaded with hosting VM in $t_{k}$ and $t_{k + 1}$ then
15:: allocation.add(VM,HOST)
16:: end if
17:: end for
18:: end for
19:: end forreturn allocation

After all the above operations are completed, if there are still some VMs not migrated, perform the following steps:

(Step 1): Implementing the TSCP algorithm again but skipping the steps of VM grouping and sorting;
(Step 2): If there are still some VMs not migrated, implement the First-Fit VM placement algorithm for these VMs;
(Step 3): If there are still some VMs not migrated, it indicates that the resources of all working hosts are not enough, and the energy-saving mode hosts should be booted up.

After this, we perform underloaded host detection. If there are underloaded hosts, then we leverage the TSCP algorithm to place all the VMs on them.

5. Performance Evaluation

In this section, we evaluate the performance of our proposed solution with a real VM workload trace-driven simulation.

5.1. Experiment Setup

Based on the energy consumption analysis and statistics of each component of the host in the data center by Minartz et al. [51] and Jin et al. [52], we simulated three sizes of large, medium, and small hosts:

H_{l a r g e}

,

H_{m e d i u m}

, and

H_{s m a l l}

, respectively. Their resource parameters are shown in Table 1, and their power parameters are shown in Table 2. We simulated a CDC consisting of 200

H_{s m a l l}

hosts, 100

H_{m e d i u m}

hosts, and 50

H_{l a r g e}

hosts.

The VM workload trace dataset is provided by the Alibaba CDC [49]. The dataset contains workload traces of 4000 VMs over 8 consecutive days. The traces in the record are generated by sampling every five minutes. Each trace records the CPU usage, memory usage, disk I/O, and network throughput of a VM at a sampling moment. We selected the traces of 1000 VMs in one day (a total of 288 time segments) from the dataset to simulate the users’ business demands for VMs. The simulation is implemented on the CDC simulator CloudMatrix Lite [53], and the CAE-based filter and the ARP algorithm is based on PyTorch [54].

We set the electricity price as

E P = 0.25 $ /

kWh, the SLAV penalty is a static price

p u n_{c p u} = p u n_{m e m} = p u n_{d i s k} = p u n_{n e t} = 0.01 $

. A VM migration costs 10% extra resources; hence, the host reserves 10% resources for migrations. Based on this, we set

T H_{u p}^{c p u} = T H_{u p}^{m e m} =

T H_{u p}^{d i s k} = T H_{u p}^{n e t} = 0.9

. We also set

T H_{d o w n}^{c p u} = T H_{d o w n}^{m e m} = T H_{d o w n}^{d i s k} = T H_{d o w n}^{n e t} = 0.1

.

5.2. Evaluation

In the evaluation, four overloading detection algorithms (MAD [6], IQR [6], and LR [6]), three VM selection algorithms (MMT [6,31,39], MC [6,55], and RS [6]), and one VM placement algorithm (PABFD [6]) are combined as nine methods to compare with our proposed solution. The PABFD placement algorithm and its corresponding energy consumption model only take into account the host’s CPU. Therefore, we modify it here to suit our multi-resource scenario. The new PABFD placement algorithm is PABFDM, see Algorithm 2 for the pseudo-code.

Algorithm 2 PABFDM algorithm.

Input: hostList, vmList

Output: allocation of the VMs

1:: vmList.sortDecreasingUtilization()
2:: for each VM in vmList do
3:: minPower ← MAX
4:: for host in hostList do
5:: if no SLAV on this host and this host meets the CPU, memory, disk and network resource requirement for VM then
6:: power ← estimatePower(host,VM)
7:: if power < minPower then
8:: minPower←power
9:: end if
10:: end if
11:: end for
12:: allocation.add(VM,host)
13:: end for
return allocation

The naming rule of the baseline methods is ’xx-xx’. For instance, LR-MMT indicates that the combination uses LR for overloading detection and MMT for VM selection. We denote our proposed method as ARP-TSCP.

The metrics involved are host-operation-related costs, SLAV cost, and the number of VM migrations. Since the CPU part of the VM migration energy consumption belongs to the host’s energy consumption during calculation, we use the number of VM migrations to indirectly measure the migration cost.

Figure 10 demonstrates the energy consumption comparison of all hosts using ARP-TSCP and the nine baseline methods in 288 consecutive periods. Figure 11 demonstrates the comparison of the total energy consumption of all hosts using these two methods to consolidate servers throughout the day. It can be seen in Figure 10 that, throughout most of the period, the host energy consumption produced by ARP-TSCP is lower than that produced by the baseline methods. Therefore, in the comparison of the total host energy consumption, ARP-TSCP outperforms the baseline methods. The total host energy consumption generated by executing ARP-TSCP in one day is about 18.5% lower than that generated by executing LR-MMT (the best result of all baseline methods) and is about 30.3% lower than that of MAD-RS (the worst result of all baseline methods), as shown in Figure 11.

Figure 12 demonstrates the comparison of the total SLAV of CPU, memory, disk, and network generated by ARP-TSCP and the baseline methods in one day respectively. As can be seen from Figure 12, in terms of reducing the SLAV of CPU and memory, ARP-TSCP has obvious improvement over all baseline methods. We can argue that based on the workload prediction method and eliminating potential overloaded states in advance is a feasible approach. In addition, the memory SLAV caused by performing ARP-TSCP is much smaller than that of other methods. This is because the baseline methods are essentially designed based only on CPU resource. The baseline host overload detection strategies only consider the CPU overloading situation, and the premise of the PABFD algorithm assumes that the host can provide unlimited memory resources for VMs. In the actual production environment, the competition between VMs for host memory resources is very fierce, resulting in a high frequency of memory SLAV. Most of the VMs in the Alibaba VM workload dataset are not disk IO intensive and require more CPU and memory resources [49]. In addition, the NIC throughput rate of 1 GB/s can meet the business needs of many users most of the time. Therefore, in terms of both disk and network, the related SLAV caused by all methods is less than that of CPU and memory.

As shown in Equations (22)–(25), in the SLAV calculation formula, the entire life cycle of the host is involved, so we do not show the SLAV generated in each time period. In Figure 13, we compare the total cost of the SLAV penalty for all methods over a full day. It can be seen that in the case of

p u n_{c p u} = p u n_{m e m} = p u n_{d i s k} = p u n_{n e t}

, the SLAV penalty cost caused by ARP-TSCP is about 38% lower than that of LR-MMT (the best result of all baseline methods) and is about 52% lower than that of MAD-RS (the worst result of all baseline methods). Due to the importance of CPU and memory in actual use as well as their larger proportion of power consumption of the host, if the values of

p u n_{c p u}

and

p u n_{m e m}

are greater than the values of

p u n_{d i s k}

and

p u n_{n e t}

, then ARP-TSCP will have a lower SLAV penalty cost. The advantage will be more obvious.

Figure 14 demonstrates the comparison of the number of VM migrations triggered by all methods in 288 consecutive periods in CDC. Figure 15 demonstrates the comparison of the total number of VM migrations caused by using these ten methods to consolidate servers throughout the day. As can be seen from Figure 14, in different periods, the number of VM migrations induced by ARP-TSCP is sometimes higher than that induced by the baseline methods, and it is sometimes lower than that induced by them. Therefore, as shown in Figure 15, the total amount of VM migrations induced by ARP-TSCP does not have much advantage over that of induced by the baseline methods throughout the day, and the two are very close.

In each period, ARP-TSCP and the baseline methods trigger VM migration for different reasons, respectively. ARP-TSCP actively considers the situation of overloaded hosts and SLAV occurrence in the next period and initiates VM migrations to deal with it in advance. The baseline methods passively respond to the host overloaded phenomenon generated in the current period. The total numbers of VM migrations are very close, but the mechanisms of triggering migration are different. Such differences are reflected in the final result, which is that the amount of SLAV occurrences caused by ARP-TSCP is much smaller than that caused by LR-MMT, as shown in Figure 13.

Compared with the baseline methods, our proposed ARP-TSCP has better improvements in host energy consumption, SLAV occurrence, and VM migration cost. Therefore, in the final total cost comparison, ARP-TSCP outperforms the other methods. As shown in Figure 16, compared to LR-MMT (the best result of all baseline methods), ARP-TSCP reduces the CDC operating cost by about 26.1%, and compared to MAD-RS (the worst result of all baseline methods), ARP-TSCP reduces the CDC operating cost by about 39.3%. This is because the comprehensive performance of ARP-TSCP is better than that of the baseline methods in all aspects.

6. Conclusions

In this paper, we focus on reducing CDC operating costs while ensuring user SLA requirements. We first established a cost model based on multiple computing resources in the CDC, taking into account the host energy cost, VM migration cost, and SLAV penalty cost. Based on this model, we define the MCRC problem in server consolidation. We devised the following step-by-step solution to deal with this problem. We employ a convolutional autoencoder-based filter to preprocess the workload of VMs and to reduce noise in the sampling record. Subsequently, an attention-based RNN method is used to predict the usage of computing resources by VMs in the future period, allowing us to trigger VM migration before the host enters the overloaded state to reduce the occurrence of SLAV. We design a heuristic algorithm based on the complementary use of multiple resources in space and time, ARP-TSCP, to solve the VM placement problem. Finally, simulations driven by real VM workload traces verify the effectiveness of our proposed method. Compared with the existing server consolidation methods, ARP-TSCP can reduce host energy consumption from 18.5% to 30.3%, SLAV cost by 38% to 52%, and total cost by 26.1% to 39.3%.

In the future, we will first consider a more comprehensive cost model, such as taking into account multi-core CPU, the network topology of CDC, use of network equipment, and cooling system. We will also design a forecasting method that can predict the resource usage of equipment over multiple time periods in the future, thereby further reducing the generation of SLAV.

Author Contributions

Conceptualization, H.L., Y.X. and Y.S.; methodology, H.L. and Y.X.; software, H.L.; validation, H.L. and Y.S.; formal analysis, H.L.; investigation, H.L. and H.X.; resources, H.L. and H.X.; data curation, H.L. and H.X.; writing—original draft preparation, H.L.; writing—review and editing, Y.X. and Y.S.; visualization, H.L.; supervision, Y.X.; project administration, Y.S.; funding acquisition, H.L., Y.X., H.X. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62002067), the Guangzhou Youth Talent Program (QT20220101174), the Department of Education of Guangdong Province (No. 2020KTSCX039), the SRP of Guangdong Education Dept (2019KZDZX1031), and the Natural Science Foundation of Education of Guizhou Province (No. KY[2017]351).

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

The State of Remote Work 2021. Available online: https://globalworkplaceanalytics.com/whitepapers (accessed on 8 August 2022).
McKinsey Consumer Pulse. Available online: https://www.mckinsey.com/business-functions/growth-marketing-and-sales/our-insights/global-surveys-of-consumer-sentiment-during-the-coronavirus-crisis (accessed on 8 August 2022).
De’, R.; Pandey, N.; Pal, A. Impact of digital surge during Covid-19 pandemic: A viewpoint on research and practice. Int. J. Inf. Manag. 2020, 55, 102171. [Google Scholar]
Branscombe, M. The network impact of the global COVID-19 pandemic. New Stack 2020, 14. Available online: https://thenewstack.io/the-network-impact-of-the-global-covid-19-pandemic/ (accessed on 16 September 2022).
Salesforce Increases Data Center Spend in 2021/22. Available online: https://www.datacenterdynamics.com/en/news/salesforce-increases-data-center-spend-in-202122/ (accessed on 8 August 2022).
Beloglazov, A.; Buyya, R. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers. Concurr. Comput. Pract. Exp. 2012, 24, 1397–1420. [Google Scholar] [CrossRef]
Aljoumah, E.; Al-Mousawi, F.; Ahmad, I.; Al-Shammri, M.; Al-Jady, Z. SLA in cloud computing architectures: A comprehensive study. Int. J. Grid Distrib. Comput. 2015, 8, 7–32. [Google Scholar] [CrossRef]
Dhiman, G.; Mihic, K.; Rosing, T. A system for online power prediction in virtualized environments using gaussian mixture models. In Proceedings of the 47th Design Automation Conference, Anaheim, CA, USA, 13–18 June 2010; pp. 807–812. [Google Scholar]
Ham, S.; Kim, M.; Choi, B.; Jeong, J. Simplified server model to simulate data center cooling energy consumption. Energy Build. 2015, 86, 328–339. [Google Scholar] [CrossRef]
Kavanagh, R.; Djemame, K. Rapid and accurate energy models through calibration with IPMI and RAPL. Concurr. Comput. Pract. Exp. 2019, 31, e5124. [Google Scholar] [CrossRef]
Gupta, V.; Nathuji, R.; Schwan, K. An analysis of power reduction in datacenters using heterogeneous chip multiprocessors. ACM Sigmetr. Perform. Eval. Rev. 2011, 39, 87–91. [Google Scholar] [CrossRef]
Lefurgy, C.; Wang, X.; Ware, M. Server-level power control. In Proceedings of the Fourth International Conference on Autonomic Computing (ICAC’07), Jacksonville, FL, USA, 11–15 June 2007; p. 4. [Google Scholar]
Beloglazov, A.; Abawajy, J.; Buyya, R. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 2012, 28, 755–768. [Google Scholar] [CrossRef]
Rezaei-Mayahi, M.; Rezazad, M.; Sarbazi-Azad, H. Temperature-aware power consumption modeling in Hyperscale cloud data centers. Future Gener. Comput. Syst. 2019, 94, 130–139. [Google Scholar] [CrossRef]
Chen, Y.; Das, A.; Qin, W.; Sivasubramaniam, A.; Wang, Q.; Gautam, N. Managing server energy and operational costs in hosting centers. In Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Banff, AB, Canada, 6–10 June 2005; pp. 303–314. [Google Scholar]
Wu, W.; Lin, W.; Peng, Z. An intelligent power consumption model for virtual machines under CPU-intensive workload in cloud environment. Soft Comput. 2017, 21, 5755–5764. [Google Scholar] [CrossRef]
Lien, C.; Bai, Y.; Lin, M. Estimation by software for the power consumption of streaming-media servers. IEEE Trans. Instrum. Meas. 2007, 56, 1859–1870. [Google Scholar] [CrossRef]
Economou, D.; Rivoire, S.; Kozyrakis, C.; Ranganathan, P. Full-system power analysis and modeling for server environments. In Proceedings of the International Symposium on Computer Architecture, Ouro Preto, Brazil, 17–20 October 2006. [Google Scholar]
Alan, I.; Arslan, E.; Kosar, T. Energy-aware data transfer tuning. In Proceedings of the 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Chicago, IL, USA, 26–29 May 2014; pp. 626–634. [Google Scholar]
Li, Y.; Wang, Y.; Yin, B.; Guan, L. An online power metering model for cloud environment. In Proceedings of the 2012 IEEE 11th International Symposium on Network Computing and Applications, Cambridge, MA, USA, 23–25 August 2012; pp. 175–180. [Google Scholar]
Lent, R. A model for network server performance and power consumption. Sustain. Comput. Inform. Syst. 2013, 3, 80–93. [Google Scholar] [CrossRef]
Kansal, A.; Zhao, F.; Liu, J.; Kothari, N.; Bhattacharya, A. Virtual machine power metering and provisioning. In Proceedings of the 1st ACM Symposium on Cloud Computing, Indianapolis, IN, USA, 10–11 June 2010; pp. 39–50. [Google Scholar]
Lin, W.; Wang, W.; Wu, W.; Pang, X.; Liu, B.; Zhang, Y. A heuristic task scheduling algorithm based on server power efficiency model in cloud environments. Sustain. Comput. Inform. Syst. 2018, 20, 56–65. [Google Scholar] [CrossRef]
Lin, W.; Wang, H.; Zhang, Y.; Qi, D.; Wang, J.; Chang, V. A cloud server energy consumption measurement system for heterogeneous cloud environments. Inf. Sci. 2018, 468, 47–62. [Google Scholar] [CrossRef]
Maziku, H.; Shetty, S. Towards a network aware VM migration: Evaluating the cost of VM migration in cloud data centers. In Proceedings of the 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), Luxembourg, 8–10 October 2014; pp. 114–119. [Google Scholar]
Dargie, W. Estimation of the cost of VM migration. In Proceedings of the 2014 23rd International Conference on Computer Communication and Networks (ICCCN), Shanghai, China, 4–7 August 2014; pp. 1–8. [Google Scholar]
Li, H.; Li, W.; Wang, H.; Wang, J. An optimization of virtual machine selection and placement by using memory content similarity for server consolidation in cloud. Future Gener. Comput. Syst. 2018, 84, 98–107. [Google Scholar] [CrossRef]
Li, H.; Li, W.; Zhang, S.; Wang, H.; Pan, Y.; Wang, J. Page-sharing-based virtual machine packing with multi-resource constraints to reduce network traffic in migration for clouds. Future Gener. Comput. Syst. 2019, 96, 462–471. [Google Scholar] [CrossRef]
Li, H.; Li, W.; Feng, Q.; Zhang, S.; Wang, H.; Wang, J. Leveraging content similarity among vmi files to allocate virtual machines in cloud. Future Gener. Comput. Syst. 2018, 79, 528–542. [Google Scholar] [CrossRef]
Li, H.; Wang, S.; Ruan, C. A fast approach of provisioning virtual machines by using image content similarity in cloud. IEEE Access 2019, 7, 45099–45109. [Google Scholar] [CrossRef]
Yadav, R.; Zhang, W.; Kaiwartya, O.; Singh, P.; Elgendy, I.; Tian, Y. Adaptive energy-aware algorithms for minimizing energy consumption and SLA violation in cloud computing. IEEE Access 2018, 6, 55923–55936. [Google Scholar] [CrossRef]
Hieu, N.; Di Francesco, M.; Ylä-Jääski, A. Virtual machine consolidation with multiple usage prediction for energy-efficient cloud data centers. IEEE Trans. Serv. Comput. 2017, 13, 186–199. [Google Scholar] [CrossRef]
Esfandiarpoor, S.; Pahlavan, A.; Goudarzi, M. Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing. Comput. Electr. Eng. 2015, 42, 74–89. [Google Scholar] [CrossRef]
Arianyan, E.; Taheri, H.; Sharifian, S. Novel energy and SLA efficient resource management heuristics for consolidation of virtual machines in cloud data centers. Comput. Electr. Eng. 2015, 47, 222–240. [Google Scholar] [CrossRef]
Rodero, I.; Viswanathan, H.; Lee, E.; Gamell, M.; Pompili, D.; Parashar, M. Energy-efficient thermal-aware autonomic management of virtualized HPC cloud infrastructure. J. Grid Comput. 2012, 10, 447–473. [Google Scholar] [CrossRef]
Guan, H.; Yao, J.; Qi, Z.; Wang, R. Energy-efficient SLA guarantees for virtualized GPU in cloud gaming. IEEE Trans. Parallel Distrib. Syst. 2014, 26, 2434–2443. [Google Scholar] [CrossRef]
Sahoo, P.; Mohapatra, S.; Wu, S. SLA based healthcare big data analysis and computing in cloud network. J. Parallel Distrib. Comput. 2018, 119, 121–135. [Google Scholar] [CrossRef]
Sun, C.; Bi, J.; Zheng, Z.; Hu, H. SLA-NFV: An SLA-aware high performance framework for network function virtualization. In Proceedings of the 2016 ACM SIGCOMM Conference, Florianopolis, Brazil, 22–26 August 2016; pp. 581–582. [Google Scholar]
Li, Z.; Yan, C.; Yu, L.; Yu, X. Energy-aware and multi-resource overload probability constraint-based virtual machine dynamic consolidation method. Future Gener. Comput. Syst. 2018, 80, 139–156. [Google Scholar] [CrossRef]
Monshizadeh Naeen, H.; Zeinali, E.; Toroghi Haghighat, A. A stochastic process-based server consolidation approach for dynamic workloads in cloud data centers. J. Supercomput. 2020, 76, 1903–1930. [Google Scholar] [CrossRef]
Sayadnavard, M.; Toroghi Haghighat, A.; Rahmani, A. A reliable energy-aware approach for dynamic virtual machine consolidation in cloud data centers. J. Supercomput. 2019, 75, 2126–2147. [Google Scholar] [CrossRef]
Yuan, C.; Sun, X. Server consolidation based on culture multiple-ant-colony algorithm in cloud computing. Sensors 2019, 19, 2724. [Google Scholar] [CrossRef]
Mamun, S.; Ganguly, A.; Markopoulos, P.; Kwon, M.; Kwasinski, A. NASCon: Network-Aware Server Consolidation for server-centric wireless datacenters. Sustain. Comput. Inform. Syst. 2021, 29, 100452. [Google Scholar] [CrossRef]
Basmadjian, R.; Ali, N.; Niedermeier, F.; De Meer, H.; Giuliani, G. A methodology to predict the power consumption of servers in data centres. In Proceedings of the 2nd International Conference on Energy-efficient Computing and Networking, New York, NY, USA, 31 May–1 June 2011; pp. 1–10. [Google Scholar]
Hsu, C.; Poole, S. Power signature analysis of the SPECpower_ssj2008 benchmark. In Proceedings of the (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, USA, 10–12 April 2011; pp. 227–236. [Google Scholar]
Karyakin, A.; Salem, K. An analysis of memory power consumption in database systems. In Proceedings of the 13th International Workshop on Data Management on New Hardware, Chicago, IL, USA, 14–19 May 2017; pp. 1–9. [Google Scholar]
Garcia-Saavedra, A.; Serrano, P.; Banchs, A.; Bianchi, G. Energy consumption anatomy of 802.11 devices and its implication on modeling and design. In Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, Nice, France, 10–13 December 2012; pp. 169–180. [Google Scholar]
Yin, C.; Zhang, S.; Wang, J.; Xiong, N. Anomaly detection based on convolutional recurrent autoencoder for IoT time series. IEEE Trans. Syst. Man, Cybern. Syst. 2020, 52, 112–122. [Google Scholar] [CrossRef]
Lu, C.; Ye, K.; Xu, G.; Xu, C.; Bai, T. Imbalance in the cloud: An analysis on alibaba cluster trace. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 2884–2892. [Google Scholar]
Xi, H.; Yan, C.; Li, H.; Xiao, Y. An Attention-based Recurrent Neural Network for Resource Usage Prediction in Cloud Data Center. J. Phys. Conf. Ser. 2021, 2006, 012007. [Google Scholar] [CrossRef]
Minartz, T.; Kunkel, J.; Ludwig, T. Simulation of power consumption of energy efficient cluster hardware. Comput. Sci.-Res. Dev. 2010, 25, 165–175. [Google Scholar] [CrossRef]
Jin, Y.; Wen, Y.; Chen, Q.; Zhu, Z. An empirical investigation of the impact of server virtualization on energy efficiency for green data center. Comput. J. 2013, 56, 977–990. [Google Scholar] [CrossRef]
Li, H.; Xiao, Y. CloudMatrix Lite: A Real Trace Driven Lightweight Cloud Data Center Simulation Framework. In Proceedings of the 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 23–25 October 2020; pp. 424–429. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Cao, Z.; Dong, S. Dynamic VM consolidation for energy-aware and SLA violation reduction in cloud computing. In Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications And Technologies, Beijing, China, 14–16 December 2012; pp. 363–369. [Google Scholar]

Figure 1. The neural network structure of the CAE based filter.

Figure 2. Example of CAE based filter result.

Figure 3. The neural network structure of the encoder of ARP.

Figure 4. The neural network structure of the decoder of ARP.

Figure 5. Example of the ARP algorithm result (1).

Figure 6. Example of the ARP algorithm result (2).

Figure 7. Example of the ARP algorithm result (3).

Figure 8. Example of VM placement on resource space competition. (a) VM placement 1; (b) VM placement 2.

Figure 9. Example of VM placement on time competition: (a) VM placement 1; (b) VM placement 2.

Figure 10. Comparing the energy consumption of hosts by all methods in every time segment.

Figure 11. Comparing the total energy consumption hosts by all methods.

Figure 12. Comparing the SLAV by all methods regarding four resources.

Figure 13. Comparing the total SLAV penalty cost by all methods.

Figure 14. Comparing the number of VM migrations triggered by all methods in every time segment.

Figure 15. Comparing the total number of VM migrations triggered by all methods.

Figure 16. Comparing the total cost by all methods.

Table 1. Resource parameters of the hosts.

Host Type	CPU	Memory	Disk Throughput	Network Throughput
$H_{l a r g e}$	4 times Intel Xeon Northwood CPU (single core)	8 GB	399 MB/s	1 GB/s
$H_{m e d i u m}$	3 times Intel Xeon Northwood CPU (single core)	6 GB	266 MB/s	1 GB/s
$H_{s m a l l}$	2 times Intel Xeon Northwood CPU (single core)	4 GB	133 MB/s	1 GB/s

Table 2. Power parameters of the hosts.

Host Type	Value	CPU (kW)	Memory (kW)	Disk (kW)	NIC (kW)
$H_{l a r g e}$	$p_{p e a k}$	0.232	0.21736	0.02106	0.002
	$p_{i d l e}$	0.1124	0.17576	0.01326	0.00078
$H_{m e d i u m}$	$p_{p e a k}$	0.174	0.10868	0.01404	0.002
	$p_{i d l e}$	0.843	0.08788	0.00884	0.00078
$H_{s m a l l}$	$p_{p e a k}$	0.116	0.05434	0.00702	0.002
	$p_{i d l e}$	0.562	0.04394	0.00442	0.00078

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Shen, Y.; Xi, H.; Xiao, Y. Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center. Appl. Sci. 2022, 12, 9654. https://doi.org/10.3390/app12199654

AMA Style

Li H, Shen Y, Xi H, Xiao Y. Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center. Applied Sciences. 2022; 12(19):9654. https://doi.org/10.3390/app12199654

Chicago/Turabian Style

Li, Huixi, Yongluo Shen, Huidan Xi, and Yinhao Xiao. 2022. "Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center" Applied Sciences 12, no. 19: 9654. https://doi.org/10.3390/app12199654

APA Style

Li, H., Shen, Y., Xi, H., & Xiao, Y. (2022). Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center. Applied Sciences, 12(19), 9654. https://doi.org/10.3390/app12199654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complementary in Time and Space: Optimization on Cost and Performance with Multiple Resources Usage by Server Consolidation in Cloud Data Center

Abstract

1. Introduction

2. Related Work

2.1. Sever Consolidation Cost Models

2.2. Server Consolidation Solutions

3. Cost Model and Problem Description

3.1. Cost Model

3.1.1. Cost of Host Energy Consumption

3.1.2. Cost of VM Migration

3.1.3. SLAV Penalty

3.2. Problem Description

4. Solution for MCRC Problem

4.1. VM Workload Prediction

4.1.1. A Convolutional Auto-Encoder-Based Filter

4.1.2. An Attention-Based RNN Prediction Method

4.2. Host Workload Detection

4.3. VM Selection

4.4. VM Placement

5. Performance Evaluation

5.1. Experiment Setup

5.2. Evaluation

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI