1. Introduction
Internet of things (IoT), edge computing (EC), and artificial intelligence (AI) are three technological pillars of the current industrial revolution [
1,
2]. One hot topic in applying these techniques is edge intelligence [
3,
4,
5], which involves running deep learning algorithms at the edge of the networks instead of entirely offloading the inferencing tasks to the cloud center. As edge intelligence can alleviate problems such as large bandwidth occupancy, high transmission delay, slow response speed, poor network reliability, and leakage of personal or sensitive information, it is being intensively researched and widely used today. For example, it is common to deploy a trained convolutional neural network (CNN) to the edge for performing real-time video analysis in applications including autonomous driving [
6], intelligent monitoring [
7], industrial IoT [
8], smart cities [
9], etc. However, as CNN inference is usually computationally intensive and some CNNs are huge, it is often infeasible to deploy a complete CNN model or perform inference on a single-edge device.
On one hand, the above problem can be solved by pruning, quantization, or knowledge distillation [
10,
11,
12] to obtain a smaller DNN model before deploying and inferencing. However, these techniques may sacrifice model inference accuracy somewhat. On the other hand, the powerful cloud server is usually adopted to perform part of the inferencing task [
13,
14,
15]. In such a cloud-assisted approach, a DNN model is often divided into two parts: one remains locally and the other runs at the remote cloud server. As with offloading a complete DNN model to the cloud server, cloud-assisted approaches also need to face the problem of private data leakage, which is natural in cloud computing. Additionally, balancing the calculation accuracy, end-to-end delay, and resource occupancy is also challenging.
To fully unleash the potential of edge devices, it is popular to cut a neural network into multiple pieces and distribute them among available edge devices to perform the inference cooperatively [
16,
17]. This approach could overcome the problems above by keeping the inferencing process in the edge network. Nevertheless, it is more challenging to partition and distribute a neural network to achieve optimal performance, as it is an NP-hard problem. Although some strategies have been developed in an attempt to split a DNN into several parts effectively [
18,
19,
20], most of them pay more attention to the methodology of reorganizing the network structure rather than optimizing the process for getting an optimal solution from the perspective of the actual system running. Hence, the problem of partitioning a DNN model to achieve optimal deployment has not been adequately addressed.
This paper proposes a novel layer-based partitioning approach to obtain an optimal DNN deployment solution. In order to ensure the applicability of the resulting deployment scheme, the partitioning problem is defined as a constrained optimization problem and an improved genetic algorithm (GA) is proposed to ensure the generation of feasible candidate solutions after each crossover and mutation operation. Compared to the basic GA, the proposed GA in this paper results a running time that is one to three times shorter than that of the basic GA, while obtaining a better deployment. The main contributions of this paper are as follows:
Firstly, the DNN model partitioning problem is modeled as a constrained optimization problem and the corresponding problem is introduced.
Secondly, the paper puts forward a novel genetic algorithm to shorten solving time by ensuring the validity of chromosomes after crossover and mutation operation.
Finally, experiments are performed on several existing DNN models, including AlexNet, ResNet110, MobelNet, and SqueenzeNet, to present a more comprehensive evaluation.
The remainder of this paper is organized as follows:
Section 2 gives an overview of the related work.
Section 3 presents the problem definition of the DNN partition problem.
Section 4 introduces the details of the proposed algorithm.
Section 5 provides the experimental results, and
Section 6 concludes the paper.
2. Literature Review
As most modern DNNs are constructed by layers, such as the convolutional layer, the fully connected layer, and the pooling layer, layer-based partitioning is the most intuitive DNN partitioning strategy. For example, Ref. [
14] proposed to partition a CNN model at the end of the convolutional layer, allocating the convolutional layers at the edge and the rest of the fully-connected layers at the host. Unlike this fixed partitioning strategy, recent methodologies have focused on adapting their results to the actual inferencing environment. Generally, depending on the construction of the target deployment environment, existing methods are divided into the following two categories.
According to the basic idea of the cloud-assisted approaches, some studies try to divide a given DNN model into two sets and push the latter part to the cloud server. For example, Ref. [
13] designed a lightweight scheduler named Neurosurgeon to automatically partition DNN computation between mobile devices and data centers based on neural network layers. Similarly, Refs. [
21,
22,
23] adopted the same strategy, while they took some further processing. In [
21], the authors integrated DNN right-sizing to accelerate the inference by early exiting inference at an intermediate layer. In contrast, Ref. [
22] first added early exiting points to the original network and then partitioned the reformed network into two parts. To determine the optimal single cut point, all of [
13,
21,
22] applied exhaustive searching, while [
23] solved the problem with mixed-integer linear programming.
For making full use of the available resources in the edge environment, more DNN partitioning strategies have been emerging to divide a DNN model into more than two pieces for distributing the inference task among several edge devices. Generally, based on the object to be partitioned, there are four kinds of main strategies, i.e., partitioning the inputs [
24,
25], weights [
26], and layers [
18,
19], as well as hybrid strategies [
17,
20,
27,
28,
29,
30]. Partitioning the inputs or weights focuses on the large storage requirements for storing large inputs or weights. Partitioning the DNN layers can solve the depth problem of DNN inferencing. Furthermore, the hybrid strategies aim to solve both problems mentioned above. For example, Ref. [
27] employed input partitioning after layer-based partitioning to obtain a small enough group of inferencing tasks to be executed. The authors of [
20] proposed fused tile partitioning (FTP) to fuse layers and partition them vertically in a grid fashion. The authors of [
29] modeled a neural network as a data-flow graph where vertices are input data, operations, or output data and edges are data transfers between vertices. Then, the problem was transformed into a graph partitioning problem.
Nearly all of the above works take inference delay or energy consumption as the optimization objectives. Recently, more studies have begun to focus on the joint optimization of DNN partitioning and resource allocation [
31,
32,
33]. However, it is still an open and critical challenge to achieve an optimal DNN distributed deployment. Unlike existing approaches, this work models the DNN partitioning problem as a constrained optimization problem, aiming to achieve the optimal inference performance with available resources in the edge environment. Moreover, it proposes a novel genetic algorithm to optimize the solving process of the formulated optimization problem.
3. System Model and Problem Formulation
This section provides an overview of the motivation and fundamental process of the proposed DNN partitioning approach and presents a formal problem description. Suppose that there are N edge devices and an edge server forming an edge network. The edge server acts as a master to receive user requests, partition DNN models, and assign the DNN inferencing tasks for each edge device. Take video-based fall detection in health monitoring as an example. A video-based fall detection application takes a video stream as the input and recognizes if there is a human falling based on a given neural network. Due to the latency and privacy requirements, such applications are best deployed in an edge environment. In order to avoid the edge server becoming the inference bottleneck of all the edge intelligent applications, it is better to distribute the background inferencing task to other edge devices.
As illustrated in
Figure 1, after a user deploys a fall detection application in the edge environment through a user interface, the edge server will extract the neural network inferencing task and the corresponding neural network. It partitions and dispatches the neural network according to the current status of each edge device, such as the smart camera, the smart speaker, and the sweeping robot in
Figure 1. Then, the neural network is divided into three parts in this example, i.e.,
,
, and
, and deployed to the smart camera, the smart speaker, and the sweeping robot, respectively. All these selected devices will cooperate to complete a further distributed inferencing process without the edge server. Specifically, the smart camera will run the partition
and send its output to the smart speaker as the input of partition
. The sweeping robot, in turn, performs the partition
after receiving the smart speaker’s output, then outputs the recognized result.
As a group of edge devices will cooperate in executing a single DNN, an edge device must receive an input from a preceding device, perform the inferencing task, and deliver the output to the next device. Suppose the DNN model is divided into
n pieces and deployed to
n different edge devices. Using
to represent a sub-model of a given DNN and
to represent a selected device, if a sub-model
is deployed to a device
, the corresponding execution time
is defined as
where
is the time of executing sub-model
on device
, and
and
are the time for receiving the input of
and sending the output of
, respectively. If
is used to represent the total transmission time, then
and
.
In addition, because not all sub-models can run directly on any edge device, it also needs to consider whether an edge device can complete a specific inferencing task according to its current state. For example, it is necessary to determine that the available memory is enough and its remaining battery capacity is sufficient. Suppose
is the size of available memory on device
and
is the required memory for running sub-model
. If
can be executed on device
, the following inequality must be true.
Similarly, if
is the average running power of device
and
is the remaining battery capacity on device
, then if
can be executed on device
, the following inequality must be true.
Above all, the DNN partitioning problem is formulated as a constrained optimization problem, trying to minimizing the total execution time of the DNN inferencing under given limitation of edge devices’ available memory and energy. The corresponding objective function is formulated as follows:
Here, is a coefficient to indicate whether model will be assigned to device and the value of is either 0 or 1. When , model will be assigned to device . Otherwise, model will be assigned to other device rather than . Assume that the model will be divided into n parts, and each part will be uniquely deployed to one specific device, then for any device , the equation is reasonable; the same is true for any sub-model , . All of these form an n-by-n matrix. The goal is to achieve a specific matrix with the shortest execution time under the constraints of memory and energy consumption.
4. The Proposed Genetic Algorithm
Genetic algorithms (GA) are a method of searching for an optimal solution by simulating the natural evolution process. When solving complex combinatorial optimization problems, GA can usually obtain better optimization results faster than some conventional optimization algorithms. This section will first analyze the problems that face the basic genetic algorithm when solving the optimization problem formulated in the previous section. On this basis, it puts forward the ideas for the improvements in this work and then describes the corresponding algorithms in detail.
4.1. Problems of Applying Basic GA for DNN Partitioning
The chromosome coding scheme is the fundamental element of GA. In solving the above DNN partitioning problem, assume that each chromosome represents an actual distributed deployment solution. Suppose the required DNN model has
l layers, which will be divided into
pieces and deployed to
n different edge devices. To ensure that each sub-model contains only continuous layers to avoid extra data transmission costs, this study constructs a matrix of
n rows and
l columns to represent a specific deployment scheme, i.e., a chromosome in the GA. For example, the following matrix denotes the deployment scheme that partitions a DNN model with seven layers into three parts and distributes it to three different devices.
If is the ith layer in a given DNN model, this chromosome represents that and will be divided into a group and deployed on device , , and will be divided into a group and deployed on device , and and will be divided into a group and deployed on device .
The basic process of GA starts with generating an initial population, i.e., a set of chromosomes following the above coding scheme. Then, it will run through the loop, including individual evaluation, selection, crossover, and variation, until satisfying the given termination condition. The cross operation plays a core role in GA, which acts on a group of chromosomes and generates new individuals by replacing part of the chromosomes of two father-generation individuals.
Figure 2 shows a simple example of computation in a partially mapped crossover operator.
In
Figure 2,
and
are the two father individuals, while
and
are two new individuals generated by swapping the subsections in each father individual included in the rectangles. It is not difficult to find that the assumption that each sub-model only contains continuous layers is broken during the above crossover operation. For example, layers
and
are grouped together and deployed to device
in the left new individual
, while layers
,
, and
are grouped together and
,
and
are grouped together in the right new individual
. Such deployments will lead to extra network bandwidth and equipment energy consumption caused by repeated transmission between devices. For example, if deploying the DNN according to
, the output of
will be sent from
to
, and then the output of
will be sent back from
to
. In turn, the output of
will be sent from
to
again. As a result, the intermediate results need to be transferred four times among the three devices, twice as many as deployed according to
.
4.2. The Proposed Improvement
To ensure reasonable individuals that only group continuous layers together after crossover and mutation, this work proposes to distinguish partitioning and deployment by constructing two-layer chromosomes, i.e., partitioning chromosomes and deployment chromosomes. A partitioning chromosome represents a certain partitioning scheme and a deployment chromosome represents a specific deployment scheme.
Figure 3 shows an example of the relationship between a DNN structure, a partitioning chromosome, and a deployment chromosome.
In
Figure 3, there is a DNN with seven layers. A partitioning chromosome is represented by a one-dimensional vector whose length is
(
l is the number of layers in a given DNN model), and each gene is a possible cut point. The given partitioning chromosome represents that the DNN is divided into three parts by splitting at the end of
and
. According to this partitioning scheme, a group of corresponding deployments can be generated. The meaning of the example deployment chromosome in
Figure 3 is the same as introduced in the above section.
Based on the description above, there is a one-to-many relationship between the partitioning chromosome and the deployment chromosome. Especially if there are n devices to participate in the collaborative inferencing, there would be different deployment chromosomes generated from one partitioning chromosome. Conversely, only one partitioning chromosome can be abstracted from a given deployment chromosome. The detail of the conversion algorithms between partitioning chromosomes and deployment chromosomes are presented in Algorithms 1 and 2 as below.
Algorithm 1 begins with initializing two empty sets and to store possible lines in a deployment chromosome and the required number of deployment chromosomes, respectively. Then, all possible lines for a deployment chromosome are constructed through lines 4 to 15. Then, the loop from lines 16 to 19 composed these possible lines in a random order to construct n specific deployment chromosomes which make up the deployment chromosome set .
Extracting a partitioning chromosome from a given deployment chromosome is more straightforward than generating deployment chromosomes based on a given partitioning chromosome. As shown in Algorithm 2, it only needs to read through the input deployment chromosome and compare whether every two adjacent elements are the same or not (see the for loop in Algorithm 2). If the two adjacent elements are the same, append a to the vector . Otherwise, append a to (see the if-else statement in Algorithm 2). At last, after checking the last pair of elements, the corresponding partitioning chromosome is achieved.
Algorithm 1: Deployment Chromosome Generation Algorithm |
|
Algorithm 2: Partitioning Chromosome Extraction Algorithm |
|
On this basis, the basic GA needs to be improved in the following two aspects.
On the one hand, the initial population generation needs to be modified according to the above chromosome classification. The initialization process should be divided into two steps: first, the random generation of a partitioning population. Then, the derivation of the corresponding deployment population based on Algorithm 1.
On the other hand, after selecting excellent individuals out of the deployment population, the corresponding partitioning population should be extracted based on Algorithm 2. Then, crossover and mutation should be performed on these partitioning chromosomes and corresponding deployment individuals should be selected to produce a new deployment population.
To summarize, Algorithm 3 shows the complete framework of the improved genetic algorithm in this paper. As mentioned above, if there are n candidate devices, each partitioning chromosome can directly derive different deployment chromosomes. To control the population size, the algorithm adopts a proportion . Then, it only needs to generate deployment chromosomes for each partitioning chromosome.
Algorithm 3: The Framework of the Proposed Genetic Algorithm |
|
Algorithm 3 first predicts and stores the execution time of each DNN layer according to Equation (
1) for calculating individual fitness (from line 2 to line 4). Line 5 and line 6 initialize a deployment population. The while statement from line 9 to line 19 is the main loop in the algorithm. First, line 10 updates the current number of iterations and line 11 selects outstanding individuals from the current population to
. Then, line 11 and line 12 extract the corresponding partitioning chromosomes from
and perform crossover and mutation to generate a new partitioning population
. Line 14 constructs a new deployment population according to
and
, each of which has a corresponding partitioning chromosome in
. In the end, the individual with the maximum fitness in the current population is resulted through a specified number of times consecutively as the stop condition. If so, the loop is exited. Otherwise, search is continued until the maximum number of iterations MAXGEN is reached. Finally, the algorithm returns the deployment chromosome corresponding to the current maximum fitness as the final optimal deployment scheme.
According to the optimization objectives described in
Section 3, the fitness function is defined as follows.
In the above fitness function,
is the
matrix in the formulated problem definition (shown in Equation (
4)) for calculating the fitness of a specific deployment chromosome.