Partitioning DNNs for Optimizing Distributed Inference Performance on Cooperative Edge Devices: A Genetic Algorithm Approach

: To fully unleash the potential of edge devices, it is popular to cut a neural network into multiple pieces and distribute them among available edge devices to perform inference cooperatively. Up to now, the problem of partitioning a deep neural network (DNN), which can result in the optimal distributed inferencing performance, has not been adequately addressed. This paper proposes a novel layer-based DNN partitioning approach to obtain an optimal distributed deployment solution. In order to ensure the applicability of the resulted deployment scheme, this work deﬁnes the partitioning problem as a constrained optimization problem and puts forward an improved genetic algorithm (GA). Compared with the basic GA, the proposed algorithm can result in a running time approximately one to three times shorter than the basic GA while achieving a better deployment.


Introduction
Internet of things (IoT), edge computing (EC), and artificial intelligence (AI) are three technological pillars of the current industrial revolution [1,2]. One hot topic in applying these techniques is edge intelligence [3][4][5], which involves running deep learning algorithms at the edge of the networks instead of entirely offloading the inferencing tasks to the cloud center. As edge intelligence can alleviate problems such as large bandwidth occupancy, high transmission delay, slow response speed, poor network reliability, and leakage of personal or sensitive information, it is being intensively researched and widely used today. For example, it is common to deploy a trained convolutional neural network (CNN) to the edge for performing real-time video analysis in applications including autonomous driving [6], intelligent monitoring [7], industrial IoT [8], smart cities [9], etc. However, as CNN inference is usually computationally intensive and some CNNs are huge, it is often infeasible to deploy a complete CNN model or perform inference on a single-edge device.
On one hand, the above problem can be solved by pruning, quantization, or knowledge distillation [10][11][12] to obtain a smaller DNN model before deploying and inferencing. However, these techniques may sacrifice model inference accuracy somewhat. On the other hand, the powerful cloud server is usually adopted to perform part of the inferencing task [13][14][15]. In such a cloud-assisted approach, a DNN model is often divided into two parts: one remains locally and the other runs at the remote cloud server. As with offloading a complete DNN model to the cloud server, cloud-assisted approaches also need to face the problem of private data leakage, which is natural in cloud computing. Additionally, balancing the calculation accuracy, end-to-end delay, and resource occupancy is also challenging.
To fully unleash the potential of edge devices, it is popular to cut a neural network into multiple pieces and distribute them among available edge devices to perform the inference cooperatively [16,17]. This approach could overcome the problems above by keeping the inferencing process in the edge network. Nevertheless, it is more challenging to partition and distribute a neural network to achieve optimal performance, as it is an NP-hard problem. Although some strategies have been developed in an attempt to split a DNN into several parts effectively [18][19][20], most of them pay more attention to the methodology of reorganizing the network structure rather than optimizing the process for getting an optimal solution from the perspective of the actual system running. Hence, the problem of partitioning a DNN model to achieve optimal deployment has not been adequately addressed.
This paper proposes a novel layer-based partitioning approach to obtain an optimal DNN deployment solution. In order to ensure the applicability of the resulting deployment scheme, the partitioning problem is defined as a constrained optimization problem and an improved genetic algorithm (GA) is proposed to ensure the generation of feasible candidate solutions after each crossover and mutation operation. Compared to the basic GA, the proposed GA in this paper results a running time that is one to three times shorter than that of the basic GA, while obtaining a better deployment. The main contributions of this paper are as follows: • Firstly, the DNN model partitioning problem is modeled as a constrained optimization problem and the corresponding problem is introduced. • Secondly, the paper puts forward a novel genetic algorithm to shorten solving time by ensuring the validity of chromosomes after crossover and mutation operation. • Finally, experiments are performed on several existing DNN models, including AlexNet, ResNet110, MobelNet, and SqueenzeNet, to present a more comprehensive evaluation.
The remainder of this paper is organized as follows: Section 2 gives an overview of the related work. Section 3 presents the problem definition of the DNN partition problem. Section 4 introduces the details of the proposed algorithm. Section 5 provides the experimental results, and Section 6 concludes the paper.

Literature Review
As most modern DNNs are constructed by layers, such as the convolutional layer, the fully connected layer, and the pooling layer, layer-based partitioning is the most intuitive DNN partitioning strategy. For example, Ref. [14] proposed to partition a CNN model at the end of the convolutional layer, allocating the convolutional layers at the edge and the rest of the fully-connected layers at the host. Unlike this fixed partitioning strategy, recent methodologies have focused on adapting their results to the actual inferencing environment. Generally, depending on the construction of the target deployment environment, existing methods are divided into the following two categories.
According to the basic idea of the cloud-assisted approaches, some studies try to divide a given DNN model into two sets and push the latter part to the cloud server. For example, Ref. [13] designed a lightweight scheduler named Neurosurgeon to automatically partition DNN computation between mobile devices and data centers based on neural network layers. Similarly, Refs. [21][22][23] adopted the same strategy, while they took some further processing. In [21], the authors integrated DNN right-sizing to accelerate the inference by early exiting inference at an intermediate layer. In contrast, Ref. [22] first added early exiting points to the original network and then partitioned the reformed network into two parts. To determine the optimal single cut point, all of [13,21,22] applied exhaustive searching, while [23] solved the problem with mixed-integer linear programming.
For making full use of the available resources in the edge environment, more DNN partitioning strategies have been emerging to divide a DNN model into more than two pieces for distributing the inference task among several edge devices. Generally, based on the object to be partitioned, there are four kinds of main strategies, i.e., partitioning the inputs [24,25], weights [26], and layers [18,19], as well as hybrid strategies [17,20,[27][28][29][30]. Partitioning the inputs or weights focuses on the large storage requirements for storing large inputs or weights. Partitioning the DNN layers can solve the depth problem of DNN inferencing. Furthermore, the hybrid strategies aim to solve both problems mentioned above. For example, Ref. [27] employed input partitioning after layer-based partitioning to obtain a small enough group of inferencing tasks to be executed. The authors of [20] proposed fused tile partitioning (FTP) to fuse layers and partition them vertically in a grid fashion. The authors of [29] modeled a neural network as a data-flow graph where vertices are input data, operations, or output data and edges are data transfers between vertices. Then, the problem was transformed into a graph partitioning problem.
Nearly all of the above works take inference delay or energy consumption as the optimization objectives. Recently, more studies have begun to focus on the joint optimization of DNN partitioning and resource allocation [31][32][33]. However, it is still an open and critical challenge to achieve an optimal DNN distributed deployment. Unlike existing approaches, this work models the DNN partitioning problem as a constrained optimization problem, aiming to achieve the optimal inference performance with available resources in the edge environment. Moreover, it proposes a novel genetic algorithm to optimize the solving process of the formulated optimization problem.

System Model and Problem Formulation
This section provides an overview of the motivation and fundamental process of the proposed DNN partitioning approach and presents a formal problem description. Suppose that there are N edge devices and an edge server forming an edge network. The edge server acts as a master to receive user requests, partition DNN models, and assign the DNN inferencing tasks for each edge device. Take video-based fall detection in health monitoring as an example. A video-based fall detection application takes a video stream as the input and recognizes if there is a human falling based on a given neural network. Due to the latency and privacy requirements, such applications are best deployed in an edge environment. In order to avoid the edge server becoming the inference bottleneck of all the edge intelligent applications, it is better to distribute the background inferencing task to other edge devices.
As illustrated in Figure 1, after a user deploys a fall detection application in the edge environment through a user interface, the edge server will extract the neural network inferencing task and the corresponding neural network. It partitions and dispatches the neural network according to the current status of each edge device, such as the smart camera, the smart speaker, and the sweeping robot in Figure 1. Then, the neural network is divided into three parts in this example, i.e., p1, p2, and p3, and deployed to the smart camera, the smart speaker, and the sweeping robot, respectively. All these selected devices will cooperate to complete a further distributed inferencing process without the edge server. Specifically, the smart camera will run the partition p1 and send its output to the smart speaker as the input of partition p2. The sweeping robot, in turn, performs the partition p3 after receiving the smart speaker's output, then outputs the recognized result.
As a group of edge devices will cooperate in executing a single DNN, an edge device must receive an input from a preceding device, perform the inferencing task, and deliver the output to the next device. Suppose the DNN model is divided into n pieces and deployed to n different edge devices. Using p i (i = 1, 2, . . . , n) to represent a sub-model of a given DNN and d j (j = 1, 2, . . . , n) to represent a selected device, if a sub-model p i is deployed to a device d j , the corresponding execution time t i,j is defined as where tc i,j is the time of executing sub-model p i on device d j , and tr i and ts i are the time for receiving the input of p i and sending the output of p i , respectively. If tt i is used to represent the total transmission time, then tt i = tr i + ts i and t i,j = tc i,j + tt i . In addition, because not all sub-models can run directly on any edge device, it also needs to consider whether an edge device can complete a specific inferencing task according to its current state. For example, it is necessary to determine that the available memory is enough and its remaining battery capacity is sufficient. Suppose m j is the size of available memory on device d j and rm i is the required memory for running sub-model p i . If p i can be executed on device d j , the following inequality must be true.
Similarly, if ep j is the average running power of device d j and c j is the remaining battery capacity on device d j , then if p i can be executed on device d j , the following inequality must be true.
Above all, the DNN partitioning problem is formulated as a constrained optimization problem, trying to minimizing the total execution time of the DNN inferencing under given limitation of edge devices' available memory and energy. The corresponding objective function is formulated as follows: Here, α i,j is a coefficient to indicate whether model p i will be assigned to device d j and the value of α i,j is either 0 or 1. When α i,j = 1, model p i will be assigned to device d j . Otherwise, model p i will be assigned to other device rather than d j . Assume that the model will be divided into n parts, and each part will be uniquely deployed to one specific device, then for any device d j (j = 1, 2, . . . , n), the equation ∑ n i=1 α i,j = 1 is reasonable; the same is true for any sub-model p i (i = 1, 2, . . . , n), ∑ n j=1 α i,j = 1. All of these α i,j (∀i, j ∈ {1, . . . , n}) form an n-by-n matrix. The goal is to achieve a specific matrix with the shortest execution time under the constraints of memory and energy consumption.

The Proposed Genetic Algorithm
Genetic algorithms (GA) are a method of searching for an optimal solution by simulating the natural evolution process. When solving complex combinatorial optimization problems, GA can usually obtain better optimization results faster than some conventional optimization algorithms. This section will first analyze the problems that face the basic genetic algorithm when solving the optimization problem formulated in the previous section. On this basis, it puts forward the ideas for the improvements in this work and then describes the corresponding algorithms in detail.

Problems of Applying Basic GA for DNN Partitioning
The chromosome coding scheme is the fundamental element of GA. In solving the above DNN partitioning problem, assume that each chromosome represents an actual distributed deployment solution. Suppose the required DNN model has l layers, which will be divided into n(n ≤ l) pieces and deployed to n different edge devices. To ensure that each sub-model contains only continuous layers to avoid extra data transmission costs, this study constructs a matrix of n rows and l columns to represent a specific deployment scheme, i.e., a chromosome in the GA. For example, the following matrix denotes the deployment scheme that partitions a DNN model with seven layers into three parts and distributes it to three different devices.
If L i is the ith (i = 1, 2, . . . , l) layer in a given DNN model, this chromosome represents that L 1 and L 2 will be divided into a group and deployed on device d 1 , L 3 , L 4 and L 5 will be divided into a group and deployed on device d 2 , and L 6 and L 7 will be divided into a group and deployed on device d 3 .
The basic process of GA starts with generating an initial population, i.e., a set of chromosomes following the above coding scheme. Then, it will run through the loop, including individual evaluation, selection, crossover, and variation, until satisfying the given termination condition. The cross operation plays a core role in GA, which acts on a group of chromosomes and generates new individuals by replacing part of the chromosomes of two father-generation individuals. Figure 2 shows a simple example of computation in a partially mapped crossover operator. In Figure 2, C 1 and C 2 are the two father individuals, while C 11 and C 21 are two new individuals generated by swapping the subsections in each father individual included in the rectangles. It is not difficult to find that the assumption that each sub-model only contains continuous layers is broken during the above crossover operation. For example, layers L 2 and L 5 are grouped together and deployed to device d 2 in the left new individual C 11 , while layers L 2 , L 5 , and L 6 are grouped together and L 1 , L 3 and L4 are grouped together in the right new individual C 21 . Such deployments will lead to extra network bandwidth and equipment energy consumption caused by repeated transmission between devices. For example, if deploying the DNN according to C 11 , the output of L 1 will be sent from d 1 to d 2 , and then the output of L 2 will be sent back from d 2 to d 1 . In turn, the output of L 4 will be sent from d 1 to d 2 again. As a result, the intermediate results need to be transferred four times among the three devices, twice as many as deployed according to C 1 .

The Proposed Improvement
To ensure reasonable individuals that only group continuous layers together after crossover and mutation, this work proposes to distinguish partitioning and deployment by constructing two-layer chromosomes, i.e., partitioning chromosomes and deployment chromosomes. A partitioning chromosome represents a certain partitioning scheme and a deployment chromosome represents a specific deployment scheme. Figure 3 shows an example of the relationship between a DNN structure, a partitioning chromosome, and a deployment chromosome. In Figure 3, there is a DNN with seven layers. A partitioning chromosome is represented by a one-dimensional vector whose length is l − 1 (l is the number of layers in a given DNN model), and each gene is a possible cut point. The given partitioning chromosome represents that the DNN is divided into three parts by splitting at the end of L 2 and L 5 . According to this partitioning scheme, a group of corresponding deployments can be generated. The meaning of the example deployment chromosome in Figure 3 is the same as introduced in the above section.
Based on the description above, there is a one-to-many relationship between the partitioning chromosome and the deployment chromosome. Especially if there are n devices to participate in the collaborative inferencing, there would be n! different deployment chromosomes generated from one partitioning chromosome. Conversely, only one partitioning chromosome can be abstracted from a given deployment chromosome. The detail of the conversion algorithms between partitioning chromosomes and deployment chromosomes are presented in Algorithms 1 and 2 as below.
Algorithm 1 begins with initializing two empty sets DC_lines and DC to store possible lines in a deployment chromosome and the required number of deployment chromosomes, respectively. Then, all possible lines for a deployment chromosome are constructed through lines 4 to 15. Then, the loop from lines 16 to 19 composed these possible lines in a random order to construct n specific deployment chromosomes which make up the deployment chromosome set DC.
Extracting a partitioning chromosome from a given deployment chromosome is more straightforward than generating deployment chromosomes based on a given partitioning chromosome. As shown in Algorithm 2, it only needs to read through the input deployment chromosome and compare whether every two adjacent elements are the same or not (see the for loop in Algorithm 2). If the two adjacent elements are the same, append a 0 to the vector pc. Otherwise, append a 1 to pc (see the if-else statement in Algorithm 2). At last, after checking the last pair of elements, the corresponding partitioning chromosome is achieved.

Algorithm 1: Deployment Chromosome Generation Algorithm
Input: a partitioning chromosome pc, the number of deployment chromosomes to be generated n Output: n deployment chromosomes  On this basis, the basic GA needs to be improved in the following two aspects.

•
On the one hand, the initial population generation needs to be modified according to the above chromosome classification. The initialization process should be divided into two steps: first, the random generation of a partitioning population. Then, the derivation of the corresponding deployment population based on Algorithm 1.

•
On the other hand, after selecting excellent individuals out of the deployment population, the corresponding partitioning population should be extracted based on Algorithm 2. Then, crossover and mutation should be performed on these partitioning chromosomes and corresponding deployment individuals should be selected to produce a new deployment population.
To summarize, Algorithm 3 shows the complete framework of the improved genetic algorithm in this paper. As mentioned above, if there are n candidate devices, each partitioning chromosome can directly derive n! different deployment chromosomes. To control the population size, the algorithm adopts a proportion p dc (0 < p dc <= 1). Then, it only needs to generate n! × p dc deployment chromosomes for each partitioning chromosome. According to the optimization objectives described in Section 3, the fitness function is defined as follows.
if dc satisfies all constraints 10 −6 if dc does not satisfy all constraints (6) In the above fitness function, dc is the α matrix in the formulated problem definition (shown in Equation (4)) for calculating the fitness of a specific deployment chromosome.

Performance Evaluation
This section evaluates the performance of the proposed DNN partitioning method on four real-world CNNs. It presents experimental results and compares them to other existing methodologies to demonstrate that the proposed algorithm can execute given CNN inference on a group of distributed collaborative edge devices in a shorter time.

Experiment Setting
To provide a comprehensive comparison, four common CNNs designed for running on edge devices are adopted, namely AlexNet [34], ResNet110 [35], MobileNet [36] and SqueenzeNet [37]. All of these CNNs have a diverse number of layers, memory requirements, and performance. The CNN training and inferencing is based on the Cifar-10 data set [38], which consists of 32 × 32 color images divided into ten classes.
In addition, a simulated distributed system with seven devices with different configurations is set up as shown in Table 1. To achieve the performance description of the candidate devices (D in Algorithm 3), PALEO [39] is adopted, which is an analytical performance model that can efficiently and accurately model the expected scalability and performance under a given deployment assumption. Based on PALEO, the memory requirements and execution time of different DNN layers on each given device are evaluated. In PALEO, the execution time of a single DNN layer consists of the time it takes to receive input from the upper layer, the time it takes for the current layer's computation, and the time it takes to write the output to local memory. Based on this, the energy consumption required to perform a given layer on a specific device is calculated by multiplying device power with execution time.
The other parameter values in Algorithm 3 are set as follows. The initial size of the partitioning chromosome population is five. In order to ensure fairness in comparing with the basic GA, the population size of deployment chromosomes is set to 50 fixedly, which is the same as population size set in basic GA. In both basic GA and the improved GA, crossover probability is 0.5, mutation probability is 0.01, the maximum iteration number is 200, and the algorithm will be terminated when the optimal fitness value remained unchanged for 50 consecutive generations. In basic GA, the chromosomes are generated according to the structure shown in Section 4.1.
The experiments are executed on a laptop with an AMD Ryzen7 5700U CPU and 16 GB memory in a Pycharm environment. The following results are collected by running a same algorithm ten times as a group.

Comparison of Inference Performance
In order to achieve the optimally distributed deployment scheme, the partitioning optimization and deployment optimization are considered either separately or simultaneously. The following experiments first compare the inferencing delay under a given partition scheme and then compare both inferencing delay and device average energy cost in considering partitioning and deployment simultaneously.

Comparison in Considering Partitioning Optimization and Deployment Optimization Separately
To obtain an optimal partition scheme, the experiment adopts the DNN partitioning algorithm proposed in [13,40] to optimally divide a given DNN into two parts and calculates the optimal deployment by exhaustive searching. The average value, maximum value, minimum value, mode, and standard deviation are achieved by running each algorithm ten times. In Table 2, method-1 and method-2 present partitioning DNN based on [13,40], respectively. In Table 2, the average inferencing time by distributing inferencing according to the proposed algorithm is superior to other methods. In addition, the proposed algorithm results in smaller standard deviation values for partitioning most of these CNNs, which means the proposed algorithm is relatively more stable.

Comparison in Considering Partitioning Optimization and Deployment Optimization Simultaneously
As a genetic algorithm is an approximate algorithm, it cannot ensure obtaining the absolute optimal solution. This experiment adopts an exhaustive method to obtain the optimal inferencing delay and energy cost in any given setting as the baseline, and then compares the corresponding results by running basic GA and the improved GA in considering partitioning optimization and deployment optimization simultaneously. The following experimental results are all average values from running each algorithm ten times in every test case.
Firstly, Figure 4 shows the comparison of inferencing delay in each test case. In this figure, each sub-picture refers to a different CNN where the horizontal axis represents the number of partitions and the vertical axis represents the average inferencing time in milliseconds. It demonstrates that the inferencing delay resulting from the proposed GA is closer to the optimal value than basic GA. In addition, the trends of the proposed GA and the optimal value are also more similar.  A similar conclusion can be achieved by observation of Figure 5, which compares device average energy costs in each test case. As a result, the proposed algorithm can produce better deployments under different scenarios compared to the basic genetic algorithm. The performance of some solutions is even close to the optimal deployment generated by the exhaustive method.

Comparison of Algorithm Efficiency
From the perspective of the actual running process of an intelligent application system, as mentioned in Section 3, the usability of each edge device may keep changing. It is necessary to dynamically divide and deploy a DNN according to the latest status of edge devices when a request arrives. In this scenario, the algorithm's execution time will be accumulated to the actual system response time. Therefore, this section compares the algorithm running time of exhaustive method, basic genetic algorithm, and the proposed method in different deployment scenarios. Table 3 shows the detailed comparison results. The above table shows that the running time of all three algorithms increases significantly with the growing number of devices or DNN layers. However, the improved GA needs the least time to obtain a better solution. For example, in partitioning AlexNet, the improved GA needs about 1.84×, 1.26×, and 166.74× shorter time than the exhaustive method in each scenario. In partitioning ResNet110 into three parts, the improved GA can save 712.13× running time compared to applying the exhaustive method and nearly 3× running time compared to applying the basic GA. It can be seen that when the problem size gets larger, the propose GA has better execution efficiency.

Conclusions
This paper establishes a dynamic DNN partitioning and deployment system model to represent the actual application requirements of distributed DNN inferencing in an edge environment. On this basis, the problem of optimal deployment-oriented DNN partitioning is modeled as a constrained optimization problem. Considering that the crossover and mutation operators in a basic genetic algorithm may produce many infeasible solutions, it aims to distinguish two types of chromosomes, i.e., partitioning chromosomes and deployment chromosomes. Then, it performs the crossover and mutation operations on partitioning chromosomes to ensure generating reasonable deployment chromosomes and produce new deployment chromosomes based on the updated partitioning population and the select excellent deployment individuals for the next iteration. The experimental results show that the proposed algorithm can not only result in shorter inferencing time and lower device average energy cost, but also needs less time to achieve an optimal deployment.
To further improve this work, a potential future research direction is to try to reduce working on CPU by constructing proper mathematical models. In addition, 3D imagerelated applications will be considered in the future.