1. Introduction
In the next few years, a burst in the number of InternetofThings (IoT) devices is expected [
1,
2,
3]. IoT devices present many sensors and can generate a large amount of data per second, which will prevent these data from being sent to the cloud for processing due to the high and variable latency and limited bandwidth of current networks [
1,
4]. Thus, an approach to process the large amount of data generated by the IoT and to efficiently use the IoT limited resources is fog computing, which allows the applications or part of them to be executed closer to the devices or on the devices themselves [
5].
To achieve the billions of devices estimated for the IoT system, many of them will have to be constrained, for instance, in size and cost. A constrained device presents limited hardware in comparison to the current devices connected to the Internet. Recently, a classification of constrained devices has been proposed, showing the increasing importance of them in the IoT [
6]. These devices are constrained due to their embedded nature and/or size, cost, weight, power, and energy. Considering that these constraints impact on the amount of memory, computational power, communication performance, and battery life, these resources must be properly employed to satisfy applications requirements. The proposed classification not only differentiates more powerful IoT devices such as smartphones and singleboard computers such as Raspberry Pi from constrained devices but also delimits the IoT scope, which does not include servers either desktop or notebook computers.
To obtain valuable information from the vast amount of data generated by the IoT, deep learning can be used since it can extract automatic features from the data and strongly benefits from large amounts of data [
7]. Nevertheless, deep learning techniques often present a high computational cost, which brings more challenges in using resourcelimited devices even if we only consider executing the inference phase of these methods. These constrained devices may impact an application that has as requirements realtime responses or a high inference rate, for instance.
The size and computational requirements of current Deep Neural Networks (DNNs) may not fit constrained IoT devices. Two approaches are commonly adopted to enable the execution of DNNs on this type of device. The first approach prunes the neural network model so that it requires fewer resources. The second approach partitions the neural network and executes in a distributed way on multiple devices. In some works that employ the first approach, pruning a neural network results in accuracy loss [
8,
9,
10]. On the other hand, several works can apply the first approach to reduce DNN requirements and enable its execution on limited devices without any accuracy loss [
11,
12,
13]. However, it is important to notice that, even after pruning a DNN, its size and computational requirements may still prevent the DNN from being executed on a single constrained device. Therefore, our focus is on the second approach. In this scenario, the challenge of how to distribute the neural network aiming to satisfy one or more requirement arises.
Some Machine Learning (ML) and IoT frameworks that offer the infrastructure to distribute the neural network execution to multiple devices already exist such as TensorFlow, Distributed Artificial Neural Networks for the Internet of Things (DIANNE), and DeepX [
14,
15,
16]. However, they require the user to manually partition the neural network and they limit the partitioning into a perlayer approach. The perlayer partitioning may prevent neural networks from being executed on devices with more severe constraint conditions, for instance, some devices from the STM32 32bit microcontroller family [
17]. This may happen because there may be a single DNN layer whose memory requirements do not fit the available memory of these constrained devices. On the other hand, other generalpurpose, automatic partitioning tools such as SCOTCH [
18] and METIS [
19] do not take into account the characteristics of neural networks and constrained devices. For this reason, they provide a suboptimal result or, in some cases, they are not able to provide any valid partitioning.
Recently, we proposed KernighanandLinbased Partitioning [
20], an algorithm to automatically partition neural networks into constrained IoT devices, which aimed to reduce the number of communications among partitions. Communication reduction is important so that the network is not overloaded, a situation that can be aggravated in a wireless connection shared with several devices. Even though reducing communication may help any system, in several contexts, one of the main objectives is to optimize (increase) the inference rate, especially on applications that need to process a data stream [
5,
21,
22,
23].
In this work, we extend this preliminary work and propose Deep Neural Networks Partitioning for Constrained IoT Devices (DN${}^{2}$PCIoT), an algorithm to automatically partition DNNs into constrained IoT devices, including inference rate maximization or communication reduction as objective functions. Additionally, for both objective functions, this new algorithm accounts more precisely for the amount of memory required by the shared parameters and biases of Convolutional Neural Networks (CNNs) in each partition. This feature allows our algorithm to provide valid partitionings even when more constrained setups are employed in the applications.
We are concerned with scenarios in which data are produced within constrained devices and only constrained devices such as the ones containing microphones and cameras are available to process these data. Although constrained devices equipped with cameras might not be constrained in some of their resources, we have to consider that only part of these resources is available for extra processing. After all, the devices have to execute their primary task in the first place.
Several IoT resources can be considered when designing an IoT solution to improve quality of service. The main IoT issues include the challenges in the network infrastructure and the large amount of data generated by the IoT devices, but other requirements such as security, dependability, and energy consumption are equally important [
24]. Additionally, minimizing communication is important to reduce interference in the wireless medium and to reduce the power consumed by radio operations [
25]. These issues and requirements usually demand a tradeoff among the amount of memory, computational power, communication performance, and battery life of the IoT devices. For instance, by raising the levels of security and dependability, offloading processing to the cloud, and/or processing data on the IoT devices, energy consumption is raised as well, impacting on the device battery life. In this work, we are concerned with the requirement that many DNN applications presents: the DNN inference rate maximization. Our objective is to treat the large amount of data generated by the IoT devices by executing DNNs on the devices themselves. We also address some of the challenges in the network infrastructure by reducing communication between IoT devices.
We use the inference rate maximization objective function to partition the LeNet CNN model using several approaches such as perlayer partitionings provided by popular ML frameworks, partitionings provided by METIS, and by our algorithm DN${}^{2}$PCIoT. We show that DN${}^{2}$PCIoT starting from random partitionings or DN${}^{2}$PCIoT starting from partitionings generated by the other approaches results in partitionings that achieve up to 38% more inferences per second than METIS. Additionally, we also show that DN${}^{2}$PCIoT can produce valid partitionings even when the other approaches cannot. The main contributions of this article are summarized as follows:
the DN${}^{2}$PCIoT algorithm that optimizes partitionings aiming for inference rate maximization or communication reduction while properly accounting for the memory required by the CNNs’ shared parameters and biases;
a case study whose results show that the DN${}^{2}$PCIoT algorithm is capable of producing partitionings that achieve higher inference rates and that it is also capable of producing valid partitionings for very constrained IoT setups;
a case study of popular ML tools such as TensorFlow, DIANNE, and DeepX, which may not be able to execute DNN models on very constrained devices due to their perlayer partitioning approach;
a study of the METIS tool, which indicates that it is not an appropriate tool to partition DNNs for constrained IoT setups because it may not provide valid partitionings under these conditions;
an analysis of the DNN model granularity results to show that our DNN with more grouping minimally affects the partitioning result;
an analysis of how profitable it is to distribute the inference rate execution among multiple constrained devices; and
a greedy algorithm to reduce the number of communications based on the available amount of memory of the devices.
This paper is organized as follows.
Section 2 provides the background in CNNs and neural networks represented as a dataflow graph; it also presents the related work in ML and IoT tools and in generalpurpose, automatic partitioning algorithms.
Section 3 presents the DN
${}^{2}$PCIoT algorithm.
Section 4 explains how LeNet was modeled, the adopted approaches, and the experiment setups.
Section 5 presents and discusses the results. Finally,
Section 6 provides the conclusions.
3. Proposed Deep Neural Networks Partitioning for Constrained IoT Devices (DN${}^{2}$PCIoT)
The DN${}^{2}$PCIoT algorithm is inspired by the Kernighan and Lin’s approach, which attempts to find a better solution than its initial partitioning by swapping vertices from different partitions. The Kernighan and Lin’s algorithm avoids some local minimum solutions by allowing swaps that produce a partitioning that is worse than the previous one. This situation can happen if such a swap is the best operation at some point in the algorithm.
DN${}^{2}$PCIoT accepts a dataflow graph as the input for the neural network, in which the vertices represent the neural network neurons (input data, operations, or output data), and the edges represent data transfers between the vertices. This same approach is used in SCOTCH and METIS. DN${}^{2}$PCIoT also receives a target graph, which contains information about the devices (the number of them in the system, computational power, communication performance, and system topology) in a way similar to SCOTCH.
To work with more than two partitions, the original Kernighan and Lin’s heuristic repeatedly applies its twopartition algorithm to pairs of subsets of the partitions. This approach may fall into local minima and we avoid some of these local minima by allowing the algorithm to work with multiple partitions by considering swaps between any partitions during the whole algorithm.
The swap operation in the original Kernighan and Lin’s algorithm also led to other local minima since it was limited to produce partitions with the same number of vertices of the initial partitioning. To solve this limitation, we introduced a “move” operation, in which the algorithm considers moving a single vertex from one partition to another, without requiring another vertex from the destination partition to be moved back to the source partition of the first vertex.
In the case of the communication reduction objective (or cost) function, this move operation allows all vertices to be moved to a single partition and, thus, the communication would be zero, which is the best result for this objective function. However, the dataflow graph containing the neural network model may not fit a single memoryconstrained IoT device due to memory limitations. Hence, we added memory requirements for each vertex in the dataflow graph and modified the graph header to contain information about the shared parameters and biases for CNNs. Furthermore, we designed the DN${}^{2}$PCIoT algorithm to consider the amount of memory of the devices as a restriction for the algorithm, i.e., the operations cannot be performed if there is not sufficient memory in the partitions. This feature allowed the initial partitioning and any partitioning in the middle and at the end of the algorithm to be unbalanced. At this point, unlike SCOTCH and METIS, DN${}^{2}$PCIoT could always produce valid partitionings.
The DN${}^{2}$PCIoT algorithm also includes a feature to factor redundant edges out of the cost computation. Redundant edges represent the transfer of the same data between partitions, which happens when there are multiple edges from one vertex to vertices that are assigned to the same partition. Neither SCOTCH nor METIS considers redundant edges, i.e., they show a number of communications that are much larger than the real value that must be indeed transferred.
Another feature that is not present in SCOTCH or METIS is the account for shared parameters and biases in the memory computation. The shared parameters are an important feature in CNNs because they can greatly reduce the amount of memory required to store the neural network. Besides that, they also help in the training phase and in avoiding overfitting as there are fewer parameters to train. A conservative solution that could be applied to SCOTCH or METIS would be to copy each set of shared parameters and biases for every vertex that needs them, however, the resultant graph would require much more memory than it is really necessary. DN${}^{2}$PCIoT accounts for shared parameters and biases only when they are necessary, i.e., there is one corresponding set of shared parameters and biases per partition only if there is at least one neuron that needs it in the partition. This feature allows DN${}^{2}$PCIoT to produce valid partitionings that require a realistic amount of memory and to produce valid partitionings even for very constrained devices, unlike METIS.
Finally, we designed DN
${}^{2}$PCIoT to produce partitionings that maximize the neural network inference rate or reduce the amount of transferred data per inference. Other objective functions can be easily employed in DN
${}^{2}$PCIoT due to its design. Different from METIS, which reduces the number of communications while attempting to balance the computational load and memory requirements in the hope of achieving good computational performance, DN
${}^{2}$PCIoT directly optimizes the partitioning for inference rate maximization, using the equations explained in
Section 2.2 and
Section 2.3. In the inference rate maximization, the device or connection between devices that most limit the result is the maximum inference rate that some partitioning can provide.
The pseudocode of DN
${}^{2}$PCIoT is listed in Algorithm 1. The first step of the algorithm is to initialize
bestP, which contains the best partitioning found so far, with the desired initial partitioning. This initial partitioning can be randomgenerated or defined by the user in an input file, which can be the result of another partitioning tool, for instance. It is worth noting that neither METIS nor SCOTCH can start from a partitioning obtained by another algorithm. After that, the algorithm runs in
epochs, which are the iterations of the outer loop (Lines 3–26). This outer loop runs some epochs until the best partitioning found so far is no longer improved. For each epoch, the algorithm first unlocks all the vertices (Line 5) and initializes the current partitioning with the best partitioning found so far (Line 6). After that, the inner loop, which is a
step of the epoch, searches for a better partitioning and updates the current and best partitionings (Lines 7–25). In each step, DN
${}^{2}$PCIoT seeks the best operation locally to identify which operation (swap or move) according to the objective function is better for the current partitioning (Line 8). This function is further detailed in Algorithm 2. Then, the best operation chosen in this function is applied to the current partitioning and the corresponding vertices are locked (Lines 10–15), i.e., they are not eligible to be chosen until the current epoch finishes. The best operation in each step may worsen the current partitioning because, if there are no operations that improve the partitioning, then the best operation is the one that increases the cost minimally. If there are no valid operations, the current step and the epoch finish (Lines 16–18). This happens when all vertices are locked or when there are unlocked vertices, but they cannot be moved or swapped due to memory constraints, i.e., if they are moved or swapped, then the partitioning becomes invalid. When the current partitioning is updated, its cost is compared to the best partitioning cost (Line 21) and, if the current partitioning cost is better, then the best partitioning is updated and the
bestImproved flag is set to
true so that the algorithm runs another epoch to attempt a better partitioning.
Figure 2 shows the flowchart related to Algorithm 1 and represents a general view of our proposal (DN
${}^{2}$PCIoT).
Algorithm 1 DN${}^{2}$PCIoT algorithm. 
 1:
function DN${}^{2}$PCIoT($initialPartitioning$)  2:
$bestP\leftarrow initialPartitioning$;  3:
repeat  4:
$bestImproved\leftarrow false$;  5:
$unlockAllNodes\left(\right)$;  6:
$currentP\leftarrow bestP$;  7:
while there are unlocked nodes do  8:
$op\leftarrow findBestValidOperation\left(currentP\right)$;  9:
/* Perform the operation */  10:
if $op.type=SWAP$ then  11:
$currentP.swap(op.v1,op.v2)$;  12:
$lockVertex(op.v1)$; $lockVertex(op.v2)$;  13:
else if $op.type=MOVE$ then  14:
$currentP.move(op.v,op.targetPartition)$;  15:
$lockVertex(op.v)$;  16:
else if $op.type=INVALID$ then  17:
/* No valid operations */  18:
break;  19:
end if  20:
/* Update the best partitioning */  21:
if $currentP.cost\left(\right)<bestP.cost\left(\right)$ then  22:
$bestP\leftarrow currentP$;  23:
$bestImproved\leftarrow true$;  24:
end if  25:
end while  26:
until $bestImproved=false$  27:
return $bestP$;  28:
end function

Algorithm 2 shows the pseudocode for the findBestValidOperation() function. First, the algorithm initializes the op type with “invalid”. If this function returns this value, then there are no operations that maintain the partitioning valid. After that, a loop runs through all the unlocked vertices searching for the best valid operation for each vertex in this set (Lines 3–32). For each vertex, the algorithm searches for the best move for it (Lines 4–15) and the best swap using this vertex (Lines 16–31). In the best move search, a loop runs through all the partitions (Lines 5–15). In this loop, the algorithm changes the current partition of the vertex being analyzed (Line 6), checks if the partitioning remains valid (Line 7), calculates the new cost of this partitioning according to the objective function (Line 8), checks if this new partitioning has a better cost than the current one (larger inference rate or fewer communications) or if no valid operation was found so far (Line 9), and updates, if necessary, bestCost with the better value and op with the move operation and the corresponding vertex and partition (Lines 10–12). In the best swap search, another loop runs through all the unlocked vertices (Lines 16–31). In this loop, the algorithm changes the current partition of both vertices that are being analyzed (Lines 17–19), checks if the partitioning remains valid (Line 20), calculates the new cost of this partitioning according to the objective function (Line 21), checks if this new partitioning has a better cost than the current one (larger inference rate or fewer communications) or if no valid operation was found so far (Line 22), and updates, if necessary, bestCost with the better value and op with the swap operation and the corresponding vertices and partitions (Lines 23–25). At the end of the loop, the original partitions of the vertices being analyzed are restored to proceed with the swap search (Lines 28–29). After the outer loop finishes, the best operation found in this function is returned to DN${}^{2}$PCIoT (or the “invalid” operation, if no valid operations were found).
Algorithm 2findBestValidOperation function. 
 1:
functionfindBestValidOperation($currentP$)  2:
$op.type\leftarrow INVALID$;  3:
for $i\leftarrow unlocked.first$ to $unlocked.last$ do  4:
$originalPi\leftarrow currentP\left[i\right]$;  5:
for $p\leftarrow 1$ to $numberOfPartitions$ do  6:
$currentP\left[i\right]\leftarrow p$;  7:
if $validPartitioning\left(currentP\right)=true$ then  8:
$cost\leftarrow computeCost\left(currentP\right)$;  9:
if $cost<currentP.cost$ or $op.type=INVALID$ then  10:
$bestCost\leftarrow cost$;  11:
$op\leftarrow moveOp(i,p)$;  12:
$op.type\leftarrow MOVE$;  13:
end if  14:
end if  15:
end for  16:
for $j\leftarrow unlocked.first$ to $unlocked.last$ do  17:
$originalPj\leftarrow currentP\left[j\right]$;  18:
$currentP\left[i\right]\leftarrow originalPj$;  19:
$currentP\left[j\right]\leftarrow originalPi$;  20:
if $validPartitioning\left(currentP\right)=true$ then  21:
$cost\leftarrow computeCost\left(currentP\right)$;  22:
if $cost<currentP.cost$ or $op.type=INVALID$ then  23:
$bestCost\leftarrow cost$;  24:
$op\leftarrow swapOp(i,originalPi,j,originalPj)$;  25:
$op.type\leftarrow SWAP$;  26:
end if  27:
/* Restore current partitioning */  28:
$currentP\left[i\right]\leftarrow originalPi$;  29:
$currentP\left[j\right]\leftarrow originalPj$;  30:
end if  31:
end for  32:
end for  33:
return $op$;  34:
end function

4. Methodology
In this section, we show the LeNet models and the device characteristics that we used in the experiments as well as the experiment details and approaches.
4.1. LeNet Neural Network Model
In this work, we used the original LeNet5 DNN architecture [
27] as a case study. Although LeNet is the first successful CNN, its lightweight model is suitable for constrained IoT devices. In this paper, we show that even a lightweight model such as LeNet requires partitioning to execute on constrained IoT devices. Furthermore, several works have been recently published using LeNet [
40,
41,
42], causing this CNN to be still relevant nowadays.
The LeNet neurons were grouped into vertices. The neurons in the depth dimension of the LeNet convolution and pooling layers were grouped into one vertex because two neurons in these layers in the same position of width and height but different positions in depth present the same communication pattern. Thus, a partitioning algorithm would tend to assign these vertices to the same partition. For the inference rate, this modeling only affects the number of operations that a vertex will need to calculate. In the fully connected layers, as the width and height have size one, the depth was not modeled as having size one because this would limit too much the partitioning and the constrained devices able to execute this partitioning. For instance, only one setup of our experiments would fit a partitioning with this grouping, which is the least memoryconstrained setup that we used in this work.
Two versions of LeNet were modeled:
LeNet 1:1: the original LeNet with 2343 vertices (except for the depth explained above); and
LeNet 2:1: LeNet with 604 vertices, in which the width and height of each convolution and pooling layer were divided by two, except for the last pooling layer, and the depth of the fully connected layers was divided by four, i.e., every four neurons in each of these layers were grouped to form one vertex in the model.
Figure 3 shows the dataflow graph of each LeNet version with the following perlayer data: the number of vertices in height, width, and depth, the layer type, and the amount of transferred data in byte required by each edge in each layer. In
Figure 3, the cubes represent the original LeNet neurons and the circles and ellipses represent the dataflow graph vertices.
The grouping of the LeNet neurons reduces the dataflow graph size as we can see by the difference in the number of vertices for each graph. This reduction decreases the partitioning execution time so that we can perform more experiments in a shorter time frame. LeNet 1:1 is a more finegrained model, thus, it may achieve better results than a less finegrained model such as LeNet 2:1. We are aware that this approach constrains the partitioning algorithm because it cannot assign vertices in the original graph to different partitions since they are now grouped. However, in this work, we also want to show that a coarsegrained model such as LeNet 2:1 can achieve comparable results to a finegrained model such as LeNet 1:1 and, thus, can be employed for partitionings with adequate performance. It is also important to highlight that our approach for grouping the vertices is different from the METIS multilevel approach and we show that DN${}^{2}$PCIoT produces better results than METIS.
Finally,
Table 3 shows the number of shared parameters and biases per layer for each layer and the amount of memory and computation (the number of FLOP per inference) required by each vertex per layer per LeNet model. It is worth noting that, in the LeNet model used in this work, the pooling layers present biases and trainable coefficients. In this table and hereafter, the convolution layers are represented by
C, the pooling layers are represented by
P, and the fully connected layers, by
FC.
4.2. Device Characteristics
Four different devices inspired the setups that we used in the experiments. These setups are progressively constrained in memory, computational power, and communication performance and these values are shown in
Table 4. The first column shows the maximum number of devices allowed to be used in each experiment. The second column shows the name of the device model that inspired each experiment. The third column shows the amount of Random Access Memory (RAM) that each device provides, which is available in the respective device datasheet [
43,
44,
45,
46]. The amount of RAM that each device provides varies from 16 KiB to 388 KiB (1 Kibibyte (KiB) = 1024 B). The fourth column represents the estimated computational performance of each device, which varies from 1.6 MFLOP/s to 180 MFLOP/s. Finally, communication is performed through a wireless medium. As this medium is shared with all the devices, the communication performance is an estimation that depends on the number of devices used in the experiments. Therefore, considering connections able to transfer up to 300 Mbits/s, the communication performance for each device varies from 9.4 KiB/s to 6103.5 KiB/s.
The reasoning for the maximum number of devices allowed to participate in the partitioning is the following. As the amount of memory provided by each device decreases, we need to employ more devices to enable a valid partitioning. Furthermore, the memory of shared parameters and biases should be taken into account when choosing the number of devices in an experiment because of the experiments that start with randomgenerated partitionings. To make these experiments work, each device should be able to contain at least one vertex of each neural network layer and its respective shared parameters and biases. This condition, in some cases, may increase the number of needed devices. For instance, the memory needed for LeNet (to store intermediate results, parameters, and biases) is 546.625 KiB if each layer is entirely assigned to one device. If the devices provide up to 64 KiB, it is possible to achieve valid partitionings using nine devices to fit the LeNet model. However, to start with randomgenerated partitionings and, thus, requiring that each device can contain at least one vertex of each layer and its respective shared parameters and biases, the number of required devices increases to 11 to produce valid randomgenerated partitionings.
For each experiment, the communication links between each device present the same performance, which is constant during the whole partitioning algorithm. The difference in the communication performance for the most constrained setups (with 56 and 63 devices) is due to the different number of devices sharing the same wireless connection. Thus, for the experiment in which the system can employ up to 63 devices for the partitioning, the communication links perform a little worse than with up to 56 devices, although the same device models with the same available memory and computational power are used.
4.3. Types of Experiments
For each setup in
Table 4, two experiments were performed:
the freeinputlayer experiment, in which all the LeNet model vertices were free to be swapped or moved; and
the lockedinputlayer experiment, in which the LeNet input layer vertices were initially assigned to the same device and, then, they were locked, i.e., the input layer vertices could not be swapped or moved during the whole algorithm.
The freeinputlayer experiments allow all the vertices to freely move from one partition to the others, including the input layer vertices. These experiments represent situations in which the device that produces the input data is not able to process any part of the neural network and, thus, must send its data to nearby devices. In this case, we would have to add more communication to send the input data (the LeNet input layer) from the device that contains these data to the devices chosen by the approaches in this work. However, as the increased amount of transferred data involved in sending the input data to nearby devices is fixed, it does not need to be shown here. On the other hand, the lockedinputlayer experiments represent situations in which the device that produces the input data can also perform some processing, therefore, no additional cost is involved in this case.
Nine partitioning approaches were employed for each experiment listed in this subsection (for each setup and free and locked inputs). These approaches are explained in the next subsections and the corresponding visual partitionings are shown for the approaches that cause it to be necessary. It is worth noting that these visual partitionings are not considered the results of this paper and are shown here for clarification of the approaches.
4.4. Per Layers: UserMade perLayer Partitioning (Equivalent to Popular Machine Learning Frameworks)
The first approach to performing the experiments is the perlayer partitioning performed by the user. In this approach, the partitioning is performed per layers, i.e., a whole layer should be assigned to a device. This partitioning is offered by popular ML tools such as TensorFlow, DIANNE, and DeepX. TensorFlow allows a finegrained partitioning, but only if the user does not use its implemented functions for each neural network layer type.
Considering the LeNet model [
27] used in both versions of our experiments, it is possible to calculate the layer that requires the largest amount of memory. This layer is the second fully connected layer (the last but one LeNet layer), which requires 376.875 KiB for the parameters, the biases, and to store the layer final result. Thus, when considering the constrained devices chosen for our experiments (
Table 4), it is possible to see that there is only one setup that is capable of providing the necessary amount of memory that a LeNet perlayer partitioning requires. This setup is the least constrained in our experiments and allows a maximum of two devices to be employed in the partitioning.
In the perlayer partitioning approach, the partitioning is performed by the user, so we partitioned LeNet for the first setup and show the resultant partitioning in
Figure 4. In this figure, only the partitioning for LeNet 2:1 is shown because the partitioning for LeNet 1:1 is equivalent. It is worth noting that each color in this figure and in all the figures that represent visual partitionings corresponds to a different partition.
4.5. Greedy: A Greedy Algorithm for Communication Reduction
The second approach is a simple algorithm that aims to reduce communication. In this algorithm, whose pseudocode is listed in Algorithm 3, the layers are assigned to the same device in order until it has no memory to fit some layer. Next, if there is any space left in the device and the layer type is convolution, pooling, or input, then a twodimensional number of vertices (width and height) that fit the rest of the memory of this device are assigned to it or, if the layer is fully connected, then a number of vertices that fit the rest of the memory of this device are assigned to it. After that or if there is any space left in the device, the next layer or the rest of the current layer is assigned to the next device and the process goes on until all the vertices are assigned to a device. It is worth noting that Algorithm 3 assumes that there is a sufficient amount of memory provided by the setups for the neural network model. Furthermore, this algorithm can partition graphs using fewer devices than the total number of devices provided. This algorithm contains two loops that depend on the number of layers (L) and the number of devices (D) of the setup, which render the algorithm complexity equals to O(L+D). However, it is worth noticing that both L and D are usually much smaller than the number of neurons of the neural network.
Figure 5 shows the visual partitioning using the greedy algorithm for each experiment. It is worth noting that this algorithm works both for the freeinputlayer and the lockedinputlayer experiments because the input layer could be entirely assigned to the same device for all setups. Furthermore, as the partitioning scheme is similar for LeNet 2:1 and LeNet 1:1 in the experiments with 2, 4, and 11 devices, only the partitionings for LeNet 2:1 are shown in
Figure 5a–c for the sake of simplicity. For the experiments with 56 and 63 devices, the greedy algorithm results in the same partitioning because the same device model is employed in these experiments. However, as LeNet 2:1 employs 44 devices and LeNet 1:1 employs 38 devices, both results are shown in
Figure 5d,e, respectively.
Algorithm 3 Greedy algorithm for communication reduction. 
 1:
functiongreedyAlgorithm($lenet$, $setup$)  2:
for $layer\leftarrow 1$ to $lenet.numberOfLayers$ do  3:
for $device\leftarrow setup.first$ to $setup.last$ do  4:
if $lenet\left[layer\right].memory\ne 0$ then  5:
if $lenet\left[layer\right].memory\le device.memory$ then  6:
assign $layer$ to $device$;  7:
else if $lenet\left[layer\right].type$ = $conv$ or $pooling$ or $input$ then  8:
assign a 2D number of vertices that fit $device$;  9:
else if $lenet\left[layer\right].type$ = $fully$$connected$ then  10:
assign the number of vertices that fits $device$;  11:
end if  12:
$device.memory\leftarrow device.memoryassigned$;  13:
$lenet\left[layer\right].memory\leftarrow lenet\left[layer\right].memoryassigned$;  14:
else  15:
break;  16:
end if  17:
end for  18:
end for  19:
end function

4.6. iRgreedy: UserMade Partitioning Aiming for Inference Rate Maximization
The third approach is a partitioning performed by the user that aims for the inference rate maximization. The rationale behind this greedy approach is to equally distribute the vertices of each layer to each device since all the experiments present a homogenous setup. Thus, this approach employs all the devices provided for the partitioning. Besides that, again, the partition vertices are chosen in two dimensions for the input, convolution, and pooling layers.
Figure 6a shows the visual partitioning for the 11device freeinput LeNet 2:1 experiment. It is worth noting that, for the two and fourdevice freeinput experiments, the partitioning follows the same pattern. For the 2, 4, and 11device lockedinput experiments, only the input layer partitioning was changed to be assigned to only one device. Thus, these partitionings are not shown here.
For the 56 and 63device experiments, it was not possible to employ the same rationale due to memory issues. Thus, for these experiments, the rationale was to start by the layers that require most memory and assign to the same device the largest number of vertices possible of that layer. Furthermore, in these experiments, the vertices were assigned in a perline way because the layers were not equally distributed to all the available devices. This approach reduces the number of copies of the shared parameters and biases and, thus, allows for a valid partitioning. For the lockedinput experiments, besides the input layer being changed to be entirely assigned to only one device, some adjustments had to be performed to produce valid partitionings.
Figure 6b,c show the visual partitionings for the 56device freeinput LeNet 2:1 and LeNet 1:1 partitioning, respectively. Additionally, in the 63device experiments with LeNet 2:1, the vertices in the same positions of the first layers were assigned to the same devices to reduce communication. The visual partitioning, in this case, is shown in
Figure 6d. As the approach for 63 devices in LeNet 1:1 was the same for the 56 devices, the visual partitioning for 63 devices in LeNet 1:1 is not shown here either. The two approaches detailed in this subsection are greedy and, therefore, are also called inference rate greedy approach (
iRgreedy) in the rest of this paper.
4.7. METIS
In this approach, the program gpmetis from METIS was used to automatically partition LeNet and compare the results with our approaches. The reason to choose METIS is that it is considered a widely known stateoftheart tool used to automatically partition graphs for general purpose.
This tool offers several parameters that can be modified by the user like the number of partitions, the number of different partitionings to compute, the number of iterations for the refinement algorithms at each stage of the uncoarsening process, the maximum allowed load imbalance among the partitions, and the algorithm’s objective function. The number of partitions corresponds to the maximum number of devices allowed to be employed in each setup described in
Section 4.2. Thus, as METIS attempts to balance all the constraints, it always employs the maximum number of devices in each experiment. All the other parameters listed here were varied in our tests and, for our inference rate maximization objective function, the maximum allowed load imbalance among the partitions parameter was substituted for the maximum allowed load imbalance among partitions per constraint, which allows using different values for memory and the computational load. It is worth noting that, for the objective function parameter, both functions were used:
edgecut minimization and total communication volume minimization. These parameters are detailed in the METIS manual [
47].
For the lockedinput experiments, the LeNet graph had the vertices from the input layer removed to run METIS with a small difference between the constraints proportion (target weights in METIS) related to the amount of memory and computational load that the input layer requires. After METIS performs the partitioning, the input layer is plugged back into the LeNet graph and we calculate the cost (inference rate or amount of transferred data) and if this partitioning is valid.
4.8. DN${}^{2}$PCIoT 30R
The fifth approach that we used for the experiments was the application of DN${}^{2}$PCIoT starting from randomgenerated partitionings. This approach executed DN${}^{2}$PCIoT 30 times starting from different randomgenerated partitionings and we report the best value achieved in these 30 executions. It is worth noting that this approach was only executed for the LeNet 2:1 model due to the more costly execution of LeNet 1:1 starting from a random partitioning. This was the only approach that did not employ LeNet 1:1. Furthermore, DN${}^{2}$PCIoT can discard some devices when they are not necessary, i.e., if DN${}^{2}$PCIoT finds a better partitioning with fewer devices.
4.9. DN${}^{2}$PCIoT after Approaches
The last approach corresponds to the execution of the proposed DN${}^{2}$PCIoT starting from partitionings obtained by the other approaches considered in this work. Thus, four experiments were performed in this approach: DN${}^{2}$PCIoT after per layers, DN${}^{2}$PCIoT after greedy, DN${}^{2}$PCIoT after iRgreedy, and DN${}^{2}$PCIoT after METIS. This approach also allows the partitionings to employ fewer devices than the maximum number of devices allowed in each experiment. It is worth noting that no other approach in this paper can start from a partitioning obtained by another algorithm and try to improve the solution based on this initial partitioning.
5. Experimental Results
In this section, we show the results for all the experiments discussed in
Section 4 (varying number of devices, free and locked input layer, and all the approaches) for the inference rate maximization objective function. After that, we show the pipeline parallelism factor for each setup to compare the performance of a single device to the distribution performance. Finally, the results of the inference rate maximization are plotted along with results for communication reduction to see how optimizing for one objective function affects the other. Our approaches (greedy algorithm, iRgreedy approach, DN
${}^{2}$PCIoT 30R, and DN
${}^{2}$PCIoT after the other approaches) were compared to two literature approaches: the perlayers approach (equivalent to popular ML frameworks such as TensorFlow, DIANNE, and DeepX) and METIS. We implemented DN
${}^{2}$PCIoT using C++ and executed the experiments on Linuxbased operating systems.
5.1. Inference Rate Maximization
Table 5 shows the results for the inference rate maximization objective function for the approaches detailed in
Section 4 that are compared to DN
${}^{2}$PCIoT.
Table 6 shows the results for the inference rate maximization objective function for DN
${}^{2}$PCIoT 30R and DN
${}^{2}$PCIoT after all the approaches in
Table 5. It is worth noting that both
Table 5 and
Table 6 present normalized results, i.e., these results are normalized by the maximum inference rate achieved in each experiment. For instance, in the freeinput twodevice experiments, considering both
Table 5 and
Table 6, the maximum inference rate was achieved by DN
${}^{2}$PCIoT after METIS with LeNet 2:1. We take this value and divide it by each result of the freeinput twodevice experiments. Thus, we have a value of 1.0 for the maximum inference rate in the DN
${}^{2}$PCIoT after METIS with LeNet 2:1 and the values of the other approaches reflect how many times the inference rate was worse than the maximum inference rate. In the first column of both tables, the number indicates the maximum number of devices allowed in each experiment, “free” refers to the freeinputlayer experiments, and “locked” refers to the lockedinputlayer experiments. For each experiment in the first column of both tables, there is a range of colors in which the red color represents the worst results while the green color represents the best results. Intermediate results are represented by yellow. As discussed in
Section 4, some approaches were not able to produce valid partitionings and this is represented by an “x”. For LeNet 1:1, as it is a large graph with 2343 vertices, we had to interrupt some executions and we report the best value found followed by an asterisk (“*”).
As general results, it is possible to see in
Table 5 and
Table 6 that DN
${}^{2}$PCIoT 30R and DN
${}^{2}$PCIoT after approaches led to the best values for all the experiments. DN
${}^{2}$PCIoT 30R produced results that range from intermediate to the best results, with only 20% of the experiments yielding intermediate results. The DN
${}^{2}$PCIoT 30R results show the robustness of DN
${}^{2}$PCIoT, which can achieve reasonable results even when starting from random partitionings.
There are some important conclusions that we can draw from
Table 5. The perlayer partitioning was the most limiting approach when considering constrained devices because it could only partition the model for the least constrained device setup, which used two devices. It is worth noting that this approach is the one offered by popular ML tools such as TensorFlow, DIANNE, and DeepX. Thus, we show that these tools are not able to execute DNNs in a distributed way in very constrained setups. Moreover, the perlayer partitioning produced suboptimal results for the only setup that it could produce valid partitionings. The quality of these results was due to the heavy unbalanced partitioning in the perlayer approach, which overloaded one device while a low load was assigned to the other device, as the least constrained setup offered two devices.
The stateoftheart tool METIS also led to suboptimal results because it attempts to balance all the constraints, which are memory and computational load. Additionally, several partitionings provided by METIS were invalid because METIS does not consider a limit for the amount of memory in each partition. METIS could not produce any valid partitionings at all for the 56 and 63device experiments because METIS cannot properly account for the memory required by the shared parameters and biases of CNNs. One way to solve this issue would be to add the memory required by the shared parameters and biases to every vertex that needs them, even if the vertices were assigned to the same partition. However, this solution would require much more memory and no partitioning using this solution would be valid for the setups used in this work. Thereby, we gave METIS the LeNet model without the memory information required by the shared parameters and biases in the hope that it would produce valid partitionings since our setups provided enough memory for LeNet and one full set of shared parameters and biases for each device. Unfortunately, METIS was not able to produce any valid partitionings in any of the 56 and 63device constrained setups.
Finally, the greedy algorithm and the iRgreedy approach are simple approaches. Although they produced poor results, they could produce valid partitionings for all the proposed setups. Thus, considering the ability to produce valid partitionings, these approaches demonstrated to be better than METIS and the perlayer partitioning offered by popular ML frameworks in the proposed setups.
In
Table 6, we can see that DN
${}^{2}$PCIoT starting from the partitioning produced by the other approaches also achieved results that range from intermediate to the best results. When comparing to the stateoftheart tool METIS, DN
${}^{2}$PCIoT after METIS could improve the METIS result by up to 38%. Additionally, DN
${}^{2}$PCIoT after approaches is a better approach when compared to DN
${}^{2}$PCIoT 30R because DN
${}^{2}$PCIoT after approaches do not need to run 30 times to attempt to find the best partitioning as in the DN
${}^{2}$PCIoT 30R. Furthermore, the single execution required by DN
${}^{2}$PCIoT after each approach may run faster than DN
${}^{2}$PCIoT 30R because it starts from the intermediate result achieved by the other approaches instead of a random partitioning that usually requires several epochs to stop.
DN${}^{2}$PCIoT after the greedy algorithm result also show the robustness of DN${}^{2}$PCIoT because the greedy algorithm produced the worst results mostly. Nevertheless, DN${}^{2}$PCIoT after greedy could improve the poor results of the greedy algorithm up to 11.1 times, yielding at least intermediate results in comparison to the other approaches.
The LeNet 1:1 model runs in a considerably larger time than the LeNet 2:1 model due to the difference in the number of vertices and edges of the graphs. When comparing the two LeNet models used in the experiments, it is possible to see that DN${}^{2}$PCIoT for LeNet 2:1 led to the best result in 80% of the experiments. Thus, the results for the proposed setups suggest that it is possible to employ LeNet 2:1 for faster partitionings with limited impact on the results.
To conclude, our results show that DN${}^{2}$PCIoT starting from 30 randomgenerated partitionings and DN${}^{2}$PCIoT after the other approaches achieved the best results for inference rate maximization in all the proposed experiments and should be employed when partitioning CNNs for execution on multiple constrained IoT devices.
5.2. Pipeline Parallelism Factor
After showing the results for the inference rate maximization objective function, it is interesting to look at the pipeline parallelism factor to check if there is gain or loss when distributing the neural network execution. In the first column of
Table 7, we have the device model and the maximum number of devices allowed to be used in each setup. The second column shows the inference rate if the entire LeNet model fit one device’s memory, i.e., the inference rate based on the computational power of the devices. In this column, it is possible to see that the diminishing computational power affects the inference rate performance, as expected. In the third column, there is the best inference rate achieved in the corresponding experiments of the previous subsection. Finally, the fourth column shows the pipeline parallelism factor, which is the best inference rate achieved in the experiments (third column) divided by the inference rate if the entire LeNet fit one device’s memory (second column). It is worth noting that the larger is the parallelism factor, the better, and results that are less than one indicate that there is some loss in the distribution of the neural network execution.
In
Table 7, it is possible to note that there is a gain in the inference rate performance in using 2, 4, 56, and 63 devices. For the 11device experiment, the communication among the partitions surpasses the distribution computational gain and negatively affects performance. In this case, we have 72% of the performance offered by a single device. However, we have to remember that this device cannot execute this model alone due to its memory limit. Additionally, it is worth noting that all the experiments in
Table 7 were limited by the communication performance among the devices.
For the last device model, used in the 56 and 63device experiments, we have different values for the best inference rate in the experiments due to the communication link among them, which is less powerful in the 63device experiment and, consequently, its result is worse than for 56 devices, showing that communication impacts on the inference rate in this setup. Furthermore, the computational power of the most constrained devices used in these experiments is so low that we have gains of 4.7 and 3.9 when distributing LeNet, even considering the communication overhead. These results show that, even if we could execute LeNet in a single device, it would be more profitable to distribute the execution to achieve a higher inference rate, except for the 11device setup.
It is important to note that, with this distribution, we enable such a constrained system to execute a CNN like LeNet. This would not be possible if only a single constrained device were employed due to the lack of memory. However, in the most constrained setups, the inference rate may be low. This can be the case, for instance, in an anomaly detection application that classifies incoming images from a camera. As most surveillance cameras generate 6–25 frames per second [
48], most of the setups presented in this work satisfy the inference rate requirement for this application. Nonetheless, the most constrained setups do not satisfy the inference rate requirement of this application, thus the system may lose some frames. In the worst case, we still have 71% of the required inference rate (
$17.65/25$), allowing the system to execute the application, even if the inference rate is not ideal.
Additional time may be required for synchronization so that a system provides correct results. The synchronization guarantees that all the data that a vertex needs to calculate its computation arrive in its inputs. Techniques such as retiming [
49] can be applied to the partitionings provided by DN
${}^{2}$PCIoT to enforce synchronization. Such a technique would increase the amount of memory required to execute the CNN in a distributed form. Although this is an important issue for the deployment of CNNs on constrained IoT devices, in this work, we are not concerned by it because one of our aims is to show how better DN
${}^{2}$PCIoT can be in partitioning CNNs for constrained IoT devices when comparing to one of the stateoftheart partitioning algorithms, which does not include synchronization overhead as well.
5.3. Inference Rate versus Communication
Minimizing communication is important to reduce interference in the wireless medium and to reduce the power consumed by radio operations. Common realtime applications that need to process data streams in a small period of time such as anomaly detection from camera images, for instance, the detection of vehicle crashes and robberies, may require a minimum inference rate so that there is no frame loss while reducing communication or even energy consumption is desirable so that the network is not overloaded and device energy life is augmented. On the other hand, applications that process data at a lower rate such as nonrealtime image processing may require a small amount of communication so that device battery life is augmented while desirable characteristics are the network nonoverload and inference rate maximization.
Thus, in this subsection, we want to show how optimizing for one of the objective functions, for instance, inference rate maximization, affects the other, for instance, communication reduction. For this purpose,
Figure 7 presents the results of
Section 5.1 for the inference rate maximization along with their respective values for the amount of transferred data per inference for each partitioning. We also plotted in these graphs results for the communication reduction objective function, which allow for a fair comparison in the amount of transferred data. For instance, when the objective function is the inference rate, the amount of transferred data may be larger than when the objective function is communication reduction. The inverse may also occur for the inference rate. These results were obtained by executing all the approaches discussed in
Section 4, including DN
${}^{2}$PCIoT 30R and DN
${}^{2}$PCIoT after the other approaches with the communication reduction objective function.
Each graph in
Figure 7 corresponds to one setup. In this figure, “comm” in the legend parentheses stands for when the approach used the communication reduction objective function, “inf” stands for the inference rate maximization objective function, “free” stands for the freeinput experiment, and “locked” stands for the lockedinput experiment. It is worth noting that each approach in the legend corresponds to two points in the graphs of
Figure 7, one for the execution of LeNet 2:1 and one for LeNet 1:1. DN
${}^{2}$PCIoT 30R is an exception because it was executed only for LeNet 2:1, thus each approach with DN
${}^{2}$PCIoT 30R in the legend corresponds to only one point in the graphs. Another exception is the perlayer partitioning, which yielded the same result for both LeNet models and, thus, its results are represented by only one point. In this subsection, we do not distinguish the two LeNet versions employed in this work because our focus is on the approaches and so that the graphs do not get polluted.
As we want to maximize the inference rate and minimize the amount of transferred data, the best tradeoffs are the ones on the right and bottom side of the graph, i.e., in the southeast position. We draw the Pareto curve [
50] using the results for inference rate maximization and communication reduction achieved by all the approaches listed in
Section 4 to show the best tradeoffs and we divided the graphs into four quadrants considering the minimum and maximum values for each objective function. These quadrants help the visualization and show within which improvement region each approach fell.
In
Figure 7a, for the twodevice experiments, the Pareto curve contains two points, which correspond to the freeinput DN
${}^{2}$PCIoT after METIS for the inference rate maximization and most of the lockedinput DN
${}^{2}$PCIoT after approaches for communication reduction. The only approach that falls within the southeast quadrant is the freeinput DN
${}^{2}$PCIoT after METIS for the inference rate maximization, which is the best tradeoff between the inference rate and the amount of transferred data for this setup. Although several points fell within the southeast quadrant, it is worth noting that the three points that are closest to this best tradeoff all correspond to the freeinput DN
${}^{2}$PCIoT for the inference rate maximization, showing the robustness of DN
${}^{2}$PCIoT.
In
Figure 7b, for the fourdevice experiments, the approach that falls both in the Pareto curve and closest to the southeast quadrant is the freeinput DN
${}^{2}$PCIoT after
iRgreedy when reducing communication. Therefore, this approach presents the best tradeoff for the fourdevice setup.
Six points compose the Pareto curve for the 11device experiments in
Figure 7c. Three of these points falls in the best tradeoff quadrant and are the freeinput DN
${}^{2}$PCIoT after
iRgreedy for communication reduction and free and lockedinput METIS for inference rate maximization. In this case, the final choice for the best tradeoff depends on which condition is more important: if the application requires a larger inference rate, then METIS is the appropriate choice. On the other hand, if the application requires a smaller amount of communication, then DN
${}^{2}$PCIoT after
iRgreedy for communication reduction is a better approach.
Six points also compose the Pareto curve for the 56device experiments in
Figure 7d. In this graph, the approach that falls both in the Pareto curve and closest to the southeast quadrant is the freeinput DN
${}^{2}$PCIoT 30R when maximizing the inference rate. Therefore, this approach presents the best tradeoff for the 56device setup.
Finally, in
Figure 7e, for the 63device experiments, the approach that falls both in the Pareto curve and closest to the southeast quadrant is the freeinput DN
${}^{2}$PCIoT 30R when maximizing the inference rate. This approach presents the best tradeoff for the 63device setup.
Back to the example of anomaly detection in
Section 5.2, in which the application requirements involve a minimum inference rate of around 24 inferences per second while reducing communication is desirable, we can choose the best tradeoffs for each setup analyzed in this subsection. In
Figure 7a–c, for the setups with 2, 4, and 11 devices, respectively, all the points in the Pareto curve satisfy the application requirement of a minimum inference rate. Thus, we can choose the points that provide the minimum amount of communication. However, in
Figure 7d,e, for the setups with 56 and 63 devices, respectively, the points in the Pareto curve with the minimum amount of communication do not satisfy the application requirement of the minimum inference rate. Hence, we have to choose the points with the largest inference rate in the Pareto curve of each setup, which require more communication. These results evidence the lower computational power of the devices used in the 56 and 63device setup.
Our results suggest that our tool also deliver the best tradeoffs between the inference rate and communication, with DN${}^{2}$PCIoT providing more than 90% of the results that belong to the Pareto curve. DN${}^{2}$PCIoT after the approaches or DN${}^{2}$PCIoT starting from 30 random partitionings achieved the best tradeoffs for the proposed setups, although these approaches only aim at one objective function. Thus, DN${}^{2}$PCIoT 30R and DN${}^{2}$PCIoT after approaches are adequate strategies when both communication reduction and inference rate maximization are needed, although it is possible to improve DN${}^{2}$PCIoT with a multiobjective function containing both objectives.
5.4. Limitations of Our Approach
Our algorithm presents a computational complexity of O(N${}^{5}$), in which N is the number of vertices of the dataflow graph. Thus, the grouping of the neural network neurons may be necessary so that the algorithm executes in a feasible time. As our results suggest, the LeNet version that groups more neurons presents a limited impact on the results while the algorithms may execute faster, as the problem size is smaller. Other algorithms such as METIS performs an aggressive grouping and, thus, can execute in a feasible time. However, it is worth noting that, with 30 executions, our algorithm achieves results that are close to the best result that DN${}^{2}$PCIoT can achieve for an experiment. On the other hand, we had to execute METIS with many different parameters to achieve valid partitionings and find the best result that METIS can get, adding up to more than 98,000 executions. Thus, METIS execution time is also not negligible.
Current CNNs such as VGG and ResNet would require more constrained devices and/or devices with a larger amount of memory so that partitioning algorithms can produce valid partitionings. However, as they are also composed of convolution, pooling, and fully connected layers, the partitioning patterns [
20] tend to be similar. Additionally, as current CNNs present more neurons, strategies that groups more neurons similar to LeNet 2:1 or in multilevel partitioning algorithms such as METIS may also be required so that the partitioning algorithm executes in a feasible time.
Other strategies that we can use to reduce our algorithm execution time are to start from partitionings obtained with other tools and to interrupt execution as soon as the partitioning achieves a target value or the improvements are smaller than a specified threshold. Our algorithm can also be combined with other strategies such as the multilevel approach, which automatically groups graph vertices, but without the shortcomings of METIS, which are suboptimal values and invalid partitionings. Even with the limitations of our approach, the results suggest that there is a large space for improvements when we consider constrained devices and compare to wellknown approaches.