Optimizing Service Placement for Microservice Architecture in Clouds

: As microservice architecture is becoming more popular than ever, developers intend to transform traditional monolithic applications into service-based applications (composed by a number of services). To deploy a service-based application in clouds, besides the resource demands of each service, the trafﬁc demands between collaborative services are crucial for the overall performance. Poor handling of the trafﬁc demands can result in severe performance degradation, such as high response time and jitter. However, current cluster schedulers fail to place services at the best possible machine, since they only consider the resource constraints but ignore the trafﬁc demands between services. To address this problem, we propose a new approach to optimize the placement of service-based applications in clouds. The approach ﬁrst partitions the application into several parts while keeping overall trafﬁc between different parts to a minimum and then carefully packs the different parts into machines with respect to their resource demands and trafﬁc demands. We implement a prototype scheduler and evaluate it with extensive experiments on testbed clusters. The results show that our approach outperforms existing container cluster schedulers and representative heuristics, leading to much less overall inter-machine trafﬁc.


Introduction
Microservice architecture is a new trend rising fast for application development, as it enhances flexibility to incorporate different technologies, it reduces complexity by using lightweight and modular services, and it improves overall scalability and resilience of the system.In the definition (Microservices: https://martinfowler.com/tags/microservices.html), the microservice architectural style is an approach to develop a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms (e.g., HTTP resource API).The application then is composed of a number of services (service-based application) that work cohesively to provide complex functionalities.Due to the advantages of microservices architecture, many developers intend to transform traditional monolithic applications into service-based applications.For instance, an online shopping application could be basically divided into product service, cart service, and order service, which can greatly improve the productivity, agility, and resilience of the application.However, it also brings challenges.When deploying a service-based application in clouds, the scheduler has to carefully schedule each service, which may have diverse resource demands, on distributed compute clusters.Furthermore, the network communication between different services needs to be handled well, as the communication conditions significantly influence the quality of service (e.g., the response time of a service).Ensuring the desired performance of service-based applications, especially the network performance between the involved services, becomes increasingly important.
In general, service-based applications involve numerous distributed and complex services which usually require more computing resources beyond single machine capability.Therefore, a cluster of networked machines or cloud computing platforms (e.g., Amazon EC2 (Amazon EC2: https://aws.amazon.com),Microsoft Azure (Microsoft Azure: https://azure.microsoft.com), or Google Cloud Platform (Google Cloud Platform: https://cloud.google.com))are generally leveraged to run service-based applications.More importantly, containers are emerging as the disruptive technology for effectively encapsulating runtime contexts of software components and services, which significantly improves portability and efficiency of deploying applications in clouds.When deploying a service-based application in clouds, several essential aspects have to be taken into account.First, services involved in the application often have diverse resource demands, such as CPU, memory and disk.The underlying machine has to ensure sufficient resources to run each service and at the same time provide cohesive functionalities.Efficient resource allocation to each service is difficult, while it becomes more challenging when the cluster consists of heterogeneous machines.Second, services involved in the application often have traffic demands among them due to data communication, which require meticulous treatment.Poor handling of the traffic demands can result in severe performance degradation, as the response time of a service is directly affected by its traffic situation.Considering the traffic demands, an intuitive solution is to place the services that have large traffic demands among them on the same machine, which can achieve intra-machine communication and reduce inter-machine traffic.However, such services cannot all be co-located on one machine due to limited resource capacities.Hence, placement of service-based applications is quite complicated in clouds.In order to achieve a desired performance of a service-based application, cluster schedulers have to carefully place each service of the application with respect to the resource demands and traffic demands.
Recent cluster scheduling methods mainly focus on the cluster resource efficiency or job completion time of batch workloads [1][2][3].For instance, Tetris [4], a multi-resource cluster scheduler, adapts heuristics for the multi-dimensional bin packing problem to efficiently pack tasks on multi-resource cluster.Firmament [5], a centralized cluster scheduler, can make high-quality placement decisions on large-scale clusters via a min-cost max-flow optimization.Unfortunately, these solutions would face difficulties for handling service-based applications, as they ignore the traffic demands when making placement decisions.Some other research works [6,7] concentrate on composite Software as a service (SaaS) placement problem, which try to minimize the total execution time for composite SaaS.However, they only focus on a set of predefined service components for the application placement.For traffic-aware scheduling, relevant research solutions [8,9] are proposed to handle virtual machine (VM) placement problem, which aims to optimize network resource usage over the cluster.However, these solutions rely on a certain network topology, while most of existing cluster schedulers are agnostic to network topology.In particular, it is hard to get the network topology information when deploying a service-based application on a virtual infrastructure.
In this paper, we propose a new approach to optimizing the placement of service-based applications in clouds.The objective is to minimize the inter-machine traffic while satisfying multi-resource demands for service-based applications.Our approach involves two key steps: (1) The requested application is partitioned into several parts while keeping overall traffic between different parts to a minimum.(2) The parts in the partition are packed into machines with multi-resource constraints.Typically, the partition can be abstracted as a minimum k-cut problem; the packing can be abstracted as a multi-dimensional bin packing problem.However, both are NP-hard problems [10,11].
To address these problems, we first propose two partition algorithms: Binary Partition and K Partition, which are based on a well designed randomized contraction algorithm [12], for finding a high quality application partition.Then, we propose a packing algorithm, which adopts an effective packing heuristic with traffic awareness, for efficiently packing each part of an application partition into machines.Finally, we combine the partition and packing algorithm with a resource demand threshold to find an appropriate placement solution.We implement a prototype scheduler based on our proposed algorithms and evaluate it on testbed clusters.The results show that our scheduler outperforms existing container cluster schedulers and representative heuristics, leading to much less overall inter-machine traffic.

Problem Formulation
In this section, we formulate the placement problem of service-based application, and introduce the objective of this work.The notation used in the work is presented in Table 1.Vector of available resources on machine m i : Amount of resource r j available on machine m i

S
A service-based application which is composed by a set of services: S = {s 1 , s 2 , ..., s N } N Number of services in the application: Vector of resource demands of service s i : Amount of resource r j that service s i demands

T
Matrix of communication traffic between services: Traffic rate from service s i to service s j X A placement solution: X = [x ij ] N×M , where x ij = 1 if service s i is to be placed on machine m j , otherwise x ij = 0

Model Description
We consider a cloud computer cluster is composed of a set of heterogeneous machines M = {m 1 , m 2 , ..., m M }, where M = |M| is the number of machines.We consider R types of resources R = {r 1 , r 2 , ..., r R } (e.g., CPU, memory, disk, etc.) in each machine.For machine m i , let V i = (v 1 i , v 2 i , ..., v R i ) be the vector of its available resources, where the element v j i denotes the amount of resource r j available on machine m i .
In infrastructure as a service (IaaS) model or container as a service (e.g., Amazon ECS) model, users would specify the resource demands of VMs or containers (e.g., a combination of CPU, memory, and storage) when submitting deployment requests.Thus, the resource demands are known upon the arrival of service requests.We consider a service-based application is composed of a set of services S = {s 1 , s 2 , ..., s N } that are to be deployed on the cluster, and N = |S| is the number of services.For service s i , let D i = (d 1 i , d 2 i , ..., d R i ) be the vector of its resource demands, where the element d j i denotes the amount of resource r j that the service s i demands.Let matrix T = [t ij ] N×N denote the traffic between services, where t ij denotes the traffic rate from service s i to service s j .
We model a placement solution as a 0-1 matrix X = [x ij ] N×M .if service s i is to be deployed on machine m j , it is x ij = 1.Otherwise, it is x ij = 0.

Objective
To achieve a desired performance of service-based applications, a scheduler should not only consider the multi-resource demands of services but also the traffic situation between services.
As services, especially data-intensive services, often need to transfer data frequently, the network performance would directly influence the overall performance.Considering the network dynamics, the placement of different services of an application is crucial for maintaining the overall performance, particularly when unexpected network latency or congestion occurs in the cluster.Given the traffic situation, the most intuitive solution is to place the services that have high traffic rate among them on the same machine so that the co-located services can leverage the loopback interface to get a high network performance without consuming actual network resources of the cluster.However, such services cannot all be co-located on one machine due to limited resource capacities.Thus, with the resource constraints, we try to find a placement solution to minimize the overall traffic between services that are placed on different machines (inter-machine traffic) while satisfying multi-resource demands of services, so that the objective of this work can be formulated as: Subject to: Equation ( 2) guarantees that each service is placed on a machine.Equation ( 3) guarantees that resource demands on a machine do not exceed its resource capacities.Equation (1) expresses the goal of this work.

Minimum K-Cut Problem
As a service-based application typically cannot be placed on one machine, an effective partition of the set of services involved in the application is necessary during the deployment.After partition, each subset of the services should be able to be packed into a machine, which means the machine has sufficient resources to run all the services in the subset.Considering the traffic rate between different services, the quality of the partition is crucial for the application performance.To tackle this problem, we first discuss the minimum k-cut problem to understand the problem's complexity.
Let G = (V, E) be an undirected graph, where V is the node set, and E is the edge set.In the graph, each edge e u,v ∈ E has a non-negative weight w u,v .A k-cut in graph G is a set of edges, which when removed, partition the graph into k disjoint nonempty components G = {G 1 , G 2 , ..., G k }.The minimum k-cut problem is to find a k-cut of minimum total weight of edges whose two ends are in different components, which can be computed as: A minimum cut is a simply minimum k-cut when k = 2. Figure 1 shows an example of a minimum cut of a graph.There are 2 cuts shown in the figure, and the dash line is a minimum cut of the graph, as the total weight of edges cut by the dash line is the minimum of all cuts.Given a service-based application, we can represent it as a graph, where the nodes represent services and the weights of edges represent the traffic rate.Specifically, the traffic rate from service s i to service s j and the rate from service s j to service s i are represented as two edges respectively in the graph.Hence, finding a minimum k-cut of the graph is equivalent to partitioning the application into k parts while keeping overall traffic between different parts to a minimum.However, for arbitrary k, the minimum k-cut problem is NP-hard [10].Different from developing a deterministic algorithm, Karger's algorithm [12] provides an efficient randomized approach to find a minimum cut of a graph.The basic idea of the Karger's algorithm is to randomly choose an edge e u,v from the graph with probability proportional to the weight of edge e u,v and merge the node u and node v into one (called edge contraction).In order to find a minimum cut, the algorithm iteratively contracts the edge which are randomly chosen until two nodes remain.The edges that remain at last are then output by the algorithm.The pseudocode is shown in Algorithm 1.
choose an edge e u,v with probability proportional to its weight; Figure 2 shows an example process of the contraction algorithm (k = 2).The algorithm iteratively merges two nodes of the chosen edge, and all other edges are reconnected to the merged node.For a graph G = (V, E) with n = |V| nodes and m = |E| edges, Karger [12] argues that the contraction algorithm returns a minimum cut of the graph with probability Ω(1/n 2 ).Therefore, if we perform the contraction algorithm independently n 2 log n times, we can find a minimum cut with high probability; if we do not get a minimum cut, the probability is less than Ω(1/n).For minimum k-cut, the contraction algorithm is basically the same, except that it terminates when k nodes remain (change |V| > 2 to |V| > k in Algorithm 1) and returns all the edges left in the graph G. Similarly, the contraction algorithm returns a minimum k-cut of the graph with probability Ω(1/n 2k−2 ).If we perform the algorithm independently n 2k−2 log n times, we can obtain a minimum k-cut with high probability.Regarding the time complexity, the contraction algorithm can be implemented to run in strongly polynomial O(mlog 2 n) time [12].

Placement Algorithm
In this section, we describe the algorithms we proposed in this work.The goal of our algorithms is to find a placement solution to minimize inter-machine traffic while satisfying multi-resource demands.
The key design of our approach includes: (1) application partition based on contraction algorithms, (2) heuristic packing with traffic awareness, and (3) placement finding with threshold adjustment.

Application Partition
In order to make the values of different resources comparable to each other and easy to handle, we first normalize the amount of available resources on machines and the resources that services demands to be the fraction of the maximum ones.We define the term v max−j to be the maximum amount of available resources r j on a machine.
Then the vector V i of available resources on machine m i and the vector D i of resource demands of service s i are normalized as: After normalization, we start partitioning the service-based application.The key question we ask first is how many parts the application is partitioned into.Considering multi-resource demands of different services, we introduce a threshold α to determine the number of parts when performing partition algorithms.The threshold α denotes the upper bound of the resource demands of partitioned parts, which means we perform partition algorithms continuously until the total resource demands from each part do not exceed α or no part contains more than one service.With a threshold α ∈ [0, 1] (as the resource demands have been normalized), it assures that each part after partition can be packed into a machine.Figure 3 shows an example of an application partition with threshold α = 0.5.In the figure, the total CPU demands and memory demands from each part do not exceed 0.5.Given a threshold α, we propose two partition algorithms: binary partition and k partition, which are based on the contraction algorithm.

Binary Partition
The idea of the binary partition algorithm is to continuously perform binary partition on the application until the resource demands from each part do not exceed α or no part contains more than one service.The pseudocode is shown in Algorithm 2. The basic process can be described as follows.The algorithm continuously checks the resource demands of each part in current application partition P. The initial partition is P = {S} where the entire application is treated as one part.If the total resource demands of a part S i in P exceeds the threshold α and part S i contains more than one service, the part is selected to be partitioned into 2 parts (binary partition).It first constructs a graph G = (V, E) based on S i , where the nodes represent services and the weights of edges represent the traffic rate.As mentioned in Section 3, if we repeatedly perform the contraction algorithm many times, we can obtain a minimum cut with high probability.Considering both the partition quality and the partition speed, we choose to perform the contraction algorithm n times in our algorithm (in offline manner, it can be set to run n 2 log n times to get a minimum cut with high probability).Then, according to the minimum cut G min , we get from the contraction algorithm, it partitions the S i into two parts {S x , S y }.This process would be repeatedly performed until the resource demands from each part do not exceed threshold α or no part contains more than one service.

K Partition
The idea of the k partition algorithm is to directly partition the application into k parts.By iteratively increasing k, it terminates when the resource demands from each part do not exceed α or no part contains more than one service.The pseudocode is shown in Algorithm 3. The basic process can be described as follows.The algorithm first constructs a graph G = (V, E) based on the application S and then continuously checks the resource demands of each part in current application partition P where P = {S} initially.If the total resource demands of a part S i in P exceeds the threshold α and part S i contains more than one service, it increases k, which is the number of partitioned parts.As mentioned in Section 3, in order to obtain a minimum k-cut with high probability, we have to perform the contraction algorithm independently n 2k−2 log n times.However, the time complexity increases exponentially with k, which is prohibitively high.Thus, we make the time complexity consistent with the binary partition algorithm by sacrificing some probability of finding a minimum k-cut.It also performs the contraction algorithm n times.Then, according to the minimum k-cut G min we get from the contraction algorithm, it partitions the application into k parts P = {S 1 , S 2 , ..., S k }.Similarly, this process would be repeatedly performed until the resource demands from each part do not exceed threshold α or no part contains more than one service.

Heuristic Packing
Given a partition of the application, the algorithm here is to pack each part into the heterogeneous machines.Without considering the traffic rate, the problem can be formulated as a classical multi-dimensional bin packing problem, which is known to be NP-hard [11].When there are a large amount of services involved in the application, it is infeasible to find the optimal solution in polynomial time.Considering the time complexity and packing quality, we adopt two greedy heuristics in our packing algorithm: Traffic Awareness and Most-Loaded Heuristic.The algorithm is shown in Algorithm 4.

Algorithm 4: Heuristic Packing
Input: partition of the application P = {S 1 , S 2 , ..., S N }, vectors of available resources on each machine {V 1 , V 2 , ..., V M } Output: a placement solution X 1 Calculate vectors of resource demands of each part as: In order to find a best possible machine for part S i , the algorithm calculates two matching factors: t f and ml.For machine m j , the factor t f is the sum of the traffic rate between the services in part S i , and the services have been determined to be packed into machine m j before.The factor ml is a scalar value of the load situation between the vector of resource demands from part S i and the vector of available resources on machine m j .Assuming D i is the resource demand vector of part S i and . The higher ml is, the more loaded the machine.The idea of this heuristic is to improve the resource efficiency by packing the part to the most loaded machine.As our main goal is to minimize the inter-machine traffic, the algorithm is designed to first prioritize the machines based on the factors of t f .If the factors of t f are the same, it then prioritizes the machines based on the factors of ml.Consequently, if all parts in the partition can be packed into machines, the algorithm returns the placement solution.Otherwise, it returns null.

Placement Finding
As we discussed before, in order to partition the application, the threshold α is required by the algorithm.However, giving an appropriate deterministic threshold α is difficult, as it cannot guarantee that the algorithm can find a placement solution through the randomized partition and the heuristic packing under a certain threshold α.Intuitively, the higher threshold α results in less parts in the partition, which leads to less traffic rate between different parts.Thus, we introduce a simple algorithm to find a better threshold α by enumerating from large to small.The algorithm is shown in Algorithm 5.At the beginning, the value of α is 1.0.To adjust the thresholds, we set a step value ∆, and the default value is 0.1, which can be customized by users.In each iteration, with the threshold α, the algorithm first partitions the given application S based on the binary partition algorithm or k partition algorithm.Note that the algorithm records the latest partition results to avoid multiple repeated partition.It then tries to pack all parts in the partition into machines based on the heuristic packing algorithm to find a placement solution for the application.

Algorithm 5: Placement Finding
Input: service-based application S, vectors of available resources on each machine Next, we discuss the time complexity of the algorithm we proposed.We assume the number of services is n; the number of edges in the service graph is m (i.e., the number of the traffic rates t ij > 0); the number of machines is M.For a service-based application, it can be partitioned up to n parts.For each partition, we perform the contraction algorithm n times, and the time complexity of the contraction algorithm is O(mlog 2 n).As we record the latest partition results to avoid multiple repeated partition, the time complexity of the overall partition is O(n 2 mlog 2 n).To the heuristic packing, the time complexity is O(nM + n 2 ) as the overall time complexity of calculating the factor t f is O(n 2 ).Let C = 1 ∆ denote the number of iterations.The overall time complexity of the proposed algorithm is (n 2 mlog 2 n + CnM + Cn 2 ).

Evaluation
We implement a prototype scheduler using python, which is based on our proposed algorithms, for deploying service-based applications on container clusters.In the experiments, we evaluate our scheduler in testbed clusters of ExoGENI [13] experimental environment.

Experimental Methodology
Cluster.We create two different testbed clusters in ExoGENI for experiments.For the first cluster, we use 30 homogeneous VMs with 2 CPU cores and 6 GB RAM.Considering the heterogeneity, we use 10 VMs with 2 CPU cores and 6 GB RAM and 10 VMs with 4 CPU cores and 12 GB RAM for the second cluster.The homogeneous cluster has 30 VMs, and the heterogeneous cluster has 20 VMs, but the total resource capacity is the same.
Workloads.In order to evaluate the proposed algorithms in different scenarios, we use synthetic applications in the experiments.Considering the scale of the testbed cluster, we yield service-based applications which are composed by 64, 96, and 128 services.For the size of 64, the CPU demand of each service is uniformly picked at random from [30,100] where 100 represents 1 CPU core, and the memory demand is picked at random from [100,300] where 100 represents 1 GB RAM.For the size of 96, the CPU demand is picked at random from [20,67], and the memory demand is picked at random from [67,200].For the size of 128, the CPU demand is picked at random from [15,50], and the memory demand is picked at random from [50,150].According to these ranges, the total resource demands of different application sizes are roughly the same.For each application size, we generate 10,000 instances for testing.As the work [14] shows that the log-normal distribution produces the best fit to the data center traffic, we choose to generate the traffic demands between services with the probability 0.05 (ensure that application graph is connected), and the traffic rate follows a log-normal distribution (mean = 5 Mbps, standard deviation = 1 Mbps).
Implementation.We implement all proposed algorithms in our prototype scheduler, where the contraction algorithm is based on the parallel implementation [12].As we proposed two algorithms for application partition, there are two kinds of configuration.BP-HP is based on binary partition (BP) and heuristic packing (HP).KP-HP is based on k partition (KP) and heuristic packing.
Baselines.As we mentioned above, many research efforts have been devoted to the composite Software as a service (SaaS) placement problem [6,15].However, they target at the placement for a certain set of predefined service components.More importantly, these metaheuristic based approaches often take minutes or even hours, particularly for large-scale clusters, to generate a placement solution, which would face difficulties for an online response.Another research work focuses on traffic-aware VM placement problem [16,17].However, these solutions rely on a certain network topology, while our approach is agnostic to network topology.Thus, we choose to compare our scheduler with the following schemes: • Kubernetes Scheduler (KS): the default scheduler in Kubernetes [18] container cluster tends to distribute containers evenly across the cluster to balance the overall cluster resource usage.Specifically, we add a soft affinity (i.e., pod affinity in Kubernetes) to the services that have traffic between them, as the scheduler would try to place the services which have affinity between them on the same machine.• First-Fit Decreasing (FFD): it is a simple and commonly adopted algorithm for the multi-dimensional bin packing problem [19].FFD operates by first sorting the services in decreasing order according to a certain resource demand and then packs each service into the first machine with sufficient resources.• Best-Fit Decreasing (BFD): it places a service in the fullest machine that still has enough capacity.BFD operates by first sorting the machines in decreasing order according to a certain resource capacity and then packs each service into the first machine with sufficient resources.• Multi-resource Packer (PACK): the idea of this heuristic [4] is that it schedules the services in increasing order of alignment between resource demands of services and resource availability of machines (i.e., dot product between the vector of resource demands and the vector of available resources).• Random (RAND): it randomly picks a service in the application and then packs it into the first machine with sufficient resources.

Comparison with Baselines
Figure 4 shows the successful placement ratio of different schemes over two clusters.The successful placement of an application is that the algorithm can find a placement solution to place all the involved services, so the ratio is the number of successfully placed applications to the number of all requested applications.We observe that RAND performs worst, as it has no heuristic to pack the services.FFD and BFD perform better than KS and PACK because KS mainly focuses on balancing the resource utilization over the cluster, and PACK focuses on the alignment between resource demands and resource availability.FFD and BFD have been demonstrated as effective algorithms for multi-dimensional bin packing problems [20].BP-HP performs comparably to KP-HP, and they both slightly outperform other schemes in this evaluation.This is mainly because the iterative partition and packing with different thresholds improve the probability of finding a placement solution.Moreover, the packing algorithm can pack services tightly due to the most-loaded heuristic.The results of the homogeneous cluster also show that the successful placement ratio increases when the number of services increases.As the total resource demands of the applications in different sizes (different number of services) are roughly the same, the less number of services results in larger resource demands of each individual service, which easily causes the resource fragmentation problem in the placement.Compared to the homogeneous cluster, the successful placement ratio is much higher in the heterogeneous cluster.As the machines have larger resource capacity in the heterogeneous cluster, it is easier to pack services constrained by multiple resources.Next, we evaluate the traffic situation of different schemes.In the evaluation, we only compare the applications whose all services are placed on the cluster by different algorithms.Figure 5 shows the average co-located traffic ratio of different schemes, and the error bars represent the maximum and minimum ratio.The co-located traffic is the traffic between the services that are placed on the same machine, so the ratio is the amount of co-located traffic to the amount of all traffic.For minimizing inter-machine traffic, the higher the co-located traffic ratio is, the better the placement solution is.To be specific, we present the co-located traffic ratio in Table 2.We observe that BP-HP and KP-HP significantly outperform the baselines.For the cluster with homogeneous machines, BP-HP improves average co-located traffic ratio by 24.8% to 38.1%; KP-HP improves the ratio by 22% to 35.3%.For the cluster with heterogeneous machines, BP-HP improves average co-located traffic ratio by 24.7% to 39.6%; KP-HP improves the ratio by 23.4% to 38.3%.FFD, BFD, PACK, and RAND perform poorly as they only focus on packing the services, without considering the traffic rate.As we set the affinity to the services that have traffic between them in KS, KS tries to put the affinity services on the same machine.However, KS ignores the concrete traffic rate when making placement decisions.Regarding BP-HP and KP-HP, we find that BP-HP performs slightly better and more stable than KP-HP, but KP-HP may find a better solution in some cases (according to the error bars).In contrast, KP-HP also easily returns a worse solution.This is mainly because BP-HP performs the contraction algorithm to find a minimum cut with probability Ω(1/n 2 ); KP-HP performs the contraction algorithm to find a minimum k-cut with probability Ω(1/n 2k−2 ) which is much less than the BP-HP.Thus, the performance of KP-HP varies widely in the experiments.Nevertheless, benefiting from the partition that strives to co-locate the large traffic demands and the traffic-aware packing, BP-HP and KP-HP both can effectively reduce inter-machine traffic for deploying service-based applications on computer clusters.

Impact of Threshold α
In this section, we discuss the impact of threshold α on the service-based application placement.To illustrate, we fix the threshold α by using BP-HP on the cluster with homogeneous machines.Figure 6 shows the successful placement ratio with different values of threshold α.For instance, BP-HP can find a placement solution for 77% of the applications with 64 services when α = 0.5.We observe that the successful placement ratio decreases when the value of threshold α increases in general, and few applications can be successfully placed when α > 0.7.Higher threshold α leads to less parts and larger average resource demands of parts in the partition, so it becomes harder to pack them into machines with multi-resource constraints.To understand the impact on the network traffic, Figure 7 shows the results of average co-located traffic ratio for each value of threshold α, and the error bars represent the maximum and minimum ratio.It explicitly demonstrates that the co-located traffic ratio increases more when α is larger.However, larger threshold α increases the difficulty of packing the applications.Thus, we try to find an appropriate threshold α by enumerating from large to small in the proposed algorithms.

Overhead Evaluation
In this section, we evaluate the overhead by measuring the algorithm runtime and compare it with KS and RAND.In order to fairly compare the algorithm runtime, we also implement the scheduling algorithm of KS in Python, which is the same with other schemes.We conduct this experiment on a dedicated server with Intel Xeon E5-2630 2.4 GHz CPU and 64 GB memory.Figure 8 shows the results of the average algorithm runtime of different schemes for the heterogeneous cluster (the homogeneous cluster is similar), and the error bars represent the maximum and minimum algorithm runtime.RAND incurs little overhead, as it is a very simple algorithm.Compared with RAND, KS is a bit complex, as KS has multiple predicated policies and priorities policies to filter and score machines, such as handling the affinity between services.BP-HP and KP-HP are more complicated than the baselines, and have obviously higher overhead.We also observe that the difference between the maximum and minimum algorithm runtime is quite large, as the algorithm runtime heavily depends on the value of threshold α.In the algorithm, higher threshold α results in less iterations, and lower threshold α causes more iterations.Nevertheless, BP-HP and KP-HP can respond in seconds for different application sizes.Especially, for the application with less than 100 services, BP-HP and KP-HP can respond in sub-second time, which is acceptable for online scheduling.Moreover, the most time consuming part of the proposed algorithms is application partition, which means there would be no big difference of the algorithm runtime for large-scale clusters with the same number of services.We believe that the proposed algorithms can also effectively handle the placement problem on large-scale clusters.

Related Work
As the microservice architecture is emerging as a primary architectural style choice in the service oriented software industry [21], many research efforts have been devoted to the analysis and modeling of microservice architecture [22][23][24].Leitner et al. [25] proposed a graph-based cost model for deploying microservice-based applications on a public cloud.Balalaie et al. [26] presented their experience and lessons on migrating a monolithic software architecture to microservices.Amaral et al. [27] evaluated the performance of microservices architectures using containers.However, the performance of service placement schemes received little attention in these works.
Software as a Service (SaaS) is one of the most important services offered by cloud providers, and many works have been proposed for optimizing composite SaaS placement in cloud environments [15].Yusoh et al. [6] propose a genetic algorithm for the composite SaaS placement problem, which considers both the placement of the software components of a SaaS and the placement of data of the SaaS.It tries to minimize the total execution time of a composite SaaS.Hajji et al. [7] adopt a new variation of PSO called Particle Swarm Optimization with Composite Particle (PSO-CP) to solve the composite SaaS placement problem.It considers not only the total execution time of the composite SaaS but also the performance of the underlying machines.Unfortunately, they target at the placement for a certain set of predefined service components, which has limitations to handle a large number of different services.In addition, plenty of research has been proposed to optimize service placement in edge and fog computing [28,29].Mennan et al. [30] proposed a service placement heuristic to maximize the bandwidth allocation when deploying community networks micro-clouds.It uses the information of network bandwidth and node availability to optimize service placement.Different from it, we consider the constraints of multiple resources rather than just network bandwidth to minimize the inter-machine traffic while satisfying multi-resource demands of service-based applications.Carlos et al. [31] presented a decentralized algorithm for the placement problem to optimize the distance between the clients and the most requested services in fog computing.They assume there are unlimited resources in cloud computing and try to minimize the hop count by placing the most popular services as closer to the users as possible.In contrast, our work focuses on the overall network usage of the cloud underlying cluster, which is modeled as a set of heterogeneous machines.
In recent years, a number of research works have been proposed in the area of VM placement with traffic awareness for cloud data centers [16,17].Meng et al. [8] analyze the impact of data center network architectures and traffic patterns and propose a heuristic approach to reduce the aggregate traffic when placing VM into the data center.Wang et al. [9] formulate the VM placement problem with dynamic bandwidth demands as a stochastic bin packing problem and propose an online packing algorithm to minimize the number of machines required.However, they only focus on optimizing the network traffic in the data center, without considering the highly diverse resources requirements of the virtual machines.Biran et al. [32] proposed a placement scheme to satisfy the traffic demands of the VMs while meeting the CPU and memory requirements.Dong et al. [33] introduced a placement solution to improve network resource utilization in addition to meeting multiple resource constraints.They both rely on a certain network topology to make placement decisions.Besides, many research efforts have been devoted to the scheduling and partitioning on heterogeneous systems [34,35].Different from them, our work is agnostic to the underlying network topology, which aims to minimize the overall inter-machine traffic on the cluster.

Conclusions
In this paper, we investigated service placement problem for microservice architecture in clouds.In order to find a high quality partition of service-based applications, we propose two partition algorithms: Binary Partition and K Partition, which are based on a well designed randomized contraction algorithm.For efficiently packing the application, we adopt most-loaded heuristic and traffic awareness in the packing algorithm.By adjusting the threshold α which denotes the upper bound of the resource demands, we can find a better placement solution for service-based applications.We implement a prototype scheduler based on our proposed algorithms and evaluate it on testbed clusters.In the evaluation, we show that our algorithms can improve the ratio of successfully placing applications on the cluster while significantly increasing the ratio of co-located traffic (i.e., reducing the inter-machine traffic).In the overhead evaluation, the results show that our algorithms incur some overhead but in an acceptable time.We believe that the proposed algorithms are practical for realistic use cases.In the future, we will investigate problem-specific optimizations to improve our implementation and consider resource dynamics in the placement to adapt more sophisticated situations.

Figure 1 .
Figure 1.An example of a minimum cut (dash line).

3 G
← G/e u,v ; // contract edge e u,v 4 end 5 return the cut in G;

Figure 3 .
Figure 3.An example of an application partition with threshold α = 0.5.

Figure 4 .
Figure 4. Comparison of successful placement ratio of different schemes.

Figure 5 .
Figure 5.Comparison of average co-located traffic ratio of different schemes.

Figure 6 .Figure 7 .
Figure 6.Successful placement ratio on the homogeneous cluster by using BP-HP with different values of threshold α.

Figure 8 .
Figure 8.Average algorithm runtime of different schemes for the heterogeneous cluster.

Table 1 .
Notation and Description.Set of heterogeneous machines in the cluster: M = {m 1 , m 2 , ..., m M } M Number of the machines: M = |M| R Set of resource types: R = {r 1 , r 2 , ..., r R } R Number of the resource types: R = |R| V i

Algorithm 2 :
Binary Partition Input: service-based application S, threshold α Output: a partition of the application P = {S 1 , S 2 , ..., S N }, N is number of parts after partition 1 P ← {S}; 2 while exists part S i in P that the total resource demands exceed α, and part S i contains more than one service do Get a partition {S 1 , S 2 , ..., S k } of the application S according to G min ; 3 P ← P − {S i }; 4 Construct a graph G = (V, E) based on S i ; 10 G min ← min(G min , G ) ; // Store the smaller cut in G min 11 t ← t + 1; 12 until t > n; 13 Get a partition {S x , S y } of part S i according to G min ; 14 P ← P ∪ {S x , S y }; 15 end 16 return P; Algorithm 3: K Partition Input: service-based application S, threshold α Output: a partition of the application P = {S 1 , S 2 , ..., S N }, N is number of parts after partition 1 P ← {S}; 2 Construct a graph G = (V, E) based on S; 3 n ← |V|; 4 k ← 1; 5 while exists part S i in P that the total resource demands exceed α and part S i contains more than one service do 6 G min ← G; 7 k ← k + 1; 8 t ← 0; 9 repeat 10 Perform the contraction algorithm until k nodes remain to get a k-cut G ; 11 G min ← min(G min , G ) ; // Store the smaller k-cut in G min 12 t ← t + 1; 13 until t > n; 14 15 P ← {S 1 , S 2 , ..., S k };

1 do 6 if part
S i can be packed into machine m j then 7 t f j ← ∑ t uv ; /* Calculate the total traffic rates between part S i and machine m j , for any service s u in S i and any service s v packed into machine m j j; /* Calculate the load situation between the vector of resource demands from part S i and the vector of available resources on machine m j */ 9 if t f j > t f then 10 t f ← t f j ; ml ← ml j ; y ← j; 11 end 12 else if t f j == t f and ml j > ml then 13 t f ← t f j ; ml ← ml j ; y ← j;