1. Introduction
Deep learning has demonstrated astonishing performance in various fields such as computer vision, natural language processing, games, etc., over the past several years [
1,
2,
3], and there is no doubt that deep models will play a critical role in the machine intelligence in the future. The deep models require massive amounts of computation and a large memory [
4], so they are difficult to deploy in embedded systems where computational resources and energy are tightly constrained [
5]. In order to address this challenge, much research effort is devoted to develop domain-specific computing systems to deep learning using ASICs and FPGAs [
6,
7]. Most of these systems are built upon the von Neumann architecture where instructions, implicit or explicit, and data are stored in memories separated from processors. This approach is practical but more radical changes may be necessary to cope with rapidly increasing demand on computing power in deep learning [
8].
Another approach starts from the idea that we can create more powerful computers by learning from the most advanced, existing intelligent system, our brain [
9]. While some neuromorphic systems still use von Neumann architectures, many neuromorphic systems tailor even the fundamental architecture for drastic improvements in computing power and energy efficiency, resulting in non-von Neumann systems. In the non-von Neumann neuromorphic systems, each weight in a neural network is stationary within a processing element for the entire testing of the neural network [
10,
11]. In neuromorphic systems, processing elements, each with its own local memory, are arranged in two different ways. The first type of neuromorphic systems arranges neuron-like processing elements in a grid and each of them stores a fixed number of weights. This type is referred as the matrix-based neuromorphic system [
12]. The second type of neuromorphic systems implement synapse-like processing elements at each crosspoint in a crossbar. This type is referred as the crossbar-based neuromorphic system [
13,
14]. This is also considered a processing-in-memory (PIM) architecture [
15]. Resistive devices like memristors are usually used as the synaptic elements.
Given a trained neural network with high precision weights, the weights should be quantized because neuromorphic systems usually employ low-precision synapses to integrate a large number of synapses onto a chip [
16]. This process incurs quantization error, leading to accuracy loss. Quantization-aware training is also possible to reduce this loss, but post-training quantization is often needed or preferred because it doesn’t require time-consuming training and the full-size dataset, which is often unavailable for various reasons such as privacy and proprietary [
17].
In the weight stationary neuromorphic systems, units in neural networks cannot be often mapped ono-to-one into the neurons in the hardware because the neurons have a fixed, small number
p of synapses. Thus, given a trained network, a unit with more than
p connections should be assigned to and evaluated with more than one hardware neuron, and this can be considered that the unit is decomposed into multiple units. There are many ways to perform this decomposition for a given unit. Due to the per-neuron dynamic fixed-point, this decomposition affects the accuracy loss during weight quantization. Moreover, it is possible to minimize the accuracy loss further if we are allowed to use extra hardware resources. In [
12], the authors formulate this problem into a dual objective optimization problem and propose two heuristics, the sorting-based algorithm (SBA) and the packing-based algorithm (PBA). The SBA finds a good partitioning in terms of the accuracy loss without use of extra neurons, whereas the PBA reduces the accuracy loss further at a marginal use of extra neurons. However, the previous work targets at the matrix-based neuromorphic systems, which is not the mainstream neuromorphic architecture at the moment. For the crossbar-based neuromorphic systems, this optimization problem becomes more difficult due to the constraints from crossbars. In this case, each neuron cannot be decomposed independently any more as in the previous work. In this paper, we first address the neural network decomposition problem for mapping onto the crossbar-based neuromorphic systems and propose a novel decomposition algorithm.
2. Problem Formulation
Our algorithm primarily targets at RRAM-based computing systems such as PRIME [
13].
Figure 1 illustrates our hardware model of RRAM-based computing systems. In the RRAM core, each cell consists of a single programmable resistor. Let us consider a bitline in the core. If voltages
are applied to each wordline simultanesously as shown in
Figure 1, the current from the
i-th wordline to the bitline becomes
where
is the conduction of the
ith resistor, and the bitline current becomes the sum of the currents flowing from each wordline by Kirchoff’s Law. In this analog way, each bitline can compute a dot-product value and the crossbar array can perform matrix-vector multiplication, which is the key primitive in machine learning workloads. Our model assumes a programmable shifter for each column so the dot-product values can be shifted in parallel. Most of recent RRAM-based computing systems are equipped with shifters right below the RRAM core [
13,
14,
18]. Some RRAM-based systems simply use the fixed-point representation for weights, but considering a huge performance gap and a small hardware cost difference between the dynamic fixed-point and the fixed-point [
19], it is more suitable to adopt the dynamic-fixed point as in PRIME. In the dynamic fixed-point representation, a group of numbers shares a scaling factor as in the fixed-point format, but the scaling factor is dynamically determined depending on the numbers in the group as in the floating-point format. When the group size is 1, or each number has one scaling factor, the dynamic fixed-point is reduced to a floating-point format. We assume that the scaling factor is a power of 2, and the exponent values are programmed to the shifters.
Figure 2 shows a mapping example. Each unit with four connections is decomposed into two units with two connections to be mapped on 2 × 3 crossbars. The outputs of the two decomposed neurons are added by a neuron with weights of 1 but this type of neurons are not considered in this paper because they are very few in any practical size of crossbars. The red dotted part is mapped into a single crossbar. Unlike the typical per-layer dynamic fixed-point, the target system allows a scaling factor per (decomposed) neuron, reducing quantization error. If we use the minimum number of neurons, 2, the quantization error in the mean square sense becomes 0.0003. On the other hand, if we use one extra neuron, it is reduced to 0.0001 taking advantage of one extra scaling factor. This motivates us to develop a mapping algorithm that uses hardware neurons wisely to reduce the accuracy loss induced by quantization error.
Now we formally define our problem. The two parts of bipartite graph are denoted by U and V. The set of edges is denoted by E. The set of nodes of a tree is denoted by O. The set of nodes at the ith level is denoted by . A layer of a neural network is represented by a bipartite graph G where and are associated with the inputs and the outputs of the layer, respectively. A decomposition of a graph G is a family of edge-disjoint subgraphs of G such that .
The subgraphs in decomposition can be decomposed further so decompositions can be hierarchical. A hierarchical decomposition of a graph G is represented by a tree where a node is associated with a subgraph of G. If the set of the children of a node is , is a decomposition of . We will consider two-level hierarchical decompositions only. We call the first-level decomposition the global decomposition, and the second-level decomposition the local decomposition. The set of all possible 2-level decompositions is denoted by .
A graph
G is
mappable if
and
. A mappable decomposition of a graph
G is a 2-level decomposition of
G such that all subgraphs at the first level are
mappable and all the subgraphs at the second level are
mappable. The set of all mappable decompositions is denoted by
. When we map the layer per a mappable decomposition
, the weights are quantized using a dynamic fixed-point quantizer
Q. The quantizer is applied on a leaf node basis, and the weights associated with a leaf node are quantized together. If
is the weight parameters of the layer, the quantized parameters are denoted by
. Then,
the neural network decomposition problem is formulated as a dual-objective combinatorial optimization problem
where
J is the loss function. Considering the enormous solution space, it seems to be intractable to solve this problem optimally and we propose heuristics for this problem.
4. Local Decomposition Methods
Let be the decomposition of k-SDA. After the global decomposition, for each , a local decomposition method is applied to independently. It selects some important vertices in and unlike VDA, the edges of each of the selected vertices are partitioned into two or more groups. In VDA, the vertices in are mapped to the neurons in a crossbar one to one. If some vertices in are mapped to two or more neurons, the quantization error can be reduced. To utilize the k spare neurons effectively, we need to select most salient vertices in the sense that reducing quantization error of those vertices will have the most effect on the training loss.
4.1. Candidate Selection
To select the salient vertices, we rank vertices in
using a score metric
S. We consider a vertex
. Let
be the weights associated with
v. The
ith element of a vector is denoted by the subscript. The dynamic fixed-point quantizer
Q is
. A vertex with a negligible quantization error will have negligible effect on the training loss. Thus, we can use the mean square quantization error (MSEQ) as a score metric. We can write the MSQE by
However, the training loss has different sensitivities to each vertex, they should be taken into account. Usually, this is done by considering
. However, when the training is done,
, so we use the second-order derivative as in Optimal Brain Damage (OBD) [
20]. Let
denote the activation of a vertex
v when
is used. Moreover, let
be the activation of
v when
is used. Then, similar to OBD,
Moreover, after the training is done, the Fisher information is an approximation to the second-order derivative so we can use the Fisher information [
21], which is easier to compute because the standard backprop for stochastic gradient descent (SGD) can do it. Thus we can write
These score metrics enable us to select a given number of neurons to be decomposed. At last, we introduce ways to fill up the spare neurons with the selection method.
4.2. k Split-in-Two (KS2)
In this scheme,
k salient vertices in
is selected. Then, the edges of each vertex are partitioned into two groups. This process is performed independently for each vertex and this problem was explored well in our previous work [
12], where the packing-based algorithm (PBA) and the sorting-based algorithm (SBA) were proposed. For the partitioning, we can use either one of them.
4.3. One Split-in- (1SK)
This heuristic selects the most salient single vertex in for each . Then its edges are split into partitions. For the split, we again use PBA or SBA. This algorithm would work best if there is one vertex in from which the most accuracy loss comes.
Figure 5 illustrates how KS2 and 1SK fill up the spare neurons in a
crossbar with two extra neurons. KS2 selects two neurons because there are two sparse neurons. Each selected neuron is decomposed into two. 1SK select the first neuron and split it into three ways, filling up the two spare neurons only from the first neuron.
5. Experimental Results
We have implemented the proposed decomposition algorithm using PyTorch [
22]. We have used six convolutional neural networks, VGG11, VGG13, VGG16 [
2], ResNet18, ResNet50 [
23] and Mobilenet_v2 [
24], and ILSVRC2012 dataset. Their original accuracies for the dataset are 69.02%, 69.93%, 71.59%, 69.79%, 76.15% and 71.88% respectively. The original networks are mapped onto
crossbars (i.e., a crossbar has
q neurons, each with
p synapses). We use the per-neuron dynamic fixed-point format to represent weights (i.e., each neuron has a scaling factor that is shared across the weights of the neuron). The synapses in the crossbars can store
m-bit weights. The scaling factor is set to
, where
E is a non-negative integer determined dynamically and
is the exponent bias. We assume that each neuron has a
e-bit storage for
E. Thus, the range of
E becomes
. For all experiments, we use
,
and
unless stated otherwise.
We use two metrics to evaluate the algorithms. The first one is the ’accuracy loss’, which is defined as the original accuracy minus the accuracy of the mapped network. The second one is the ’neuron overhead’, which is defined as the ratio of the extra neurons used to the minimum required neurons.
We have several options to choose in designing our
k-sparse decomposition algorithm (
k-SDA). The chosen options are validated empirically.
Figure 6 compares the two local decomposition methods for
k-SDA. For the candidate selection, we use the mean square quantization error (MSQE). Note that we use the sorting-based algorithm of [
12] to decompose a neuron into two. The KS2 clearly outperforms 1SK. This results seem to come from that decomposing a neuron into two provides enough reduction in the quantization error. Thus, we choose KS2 as our default local decomposition method.
Figure 7 compares our
k-SDA(Hessian),
k-SDA(MSQE) to the baseline, VDA. VDA always uses the minimum required neurons and has zero neuron overhead, but it does not provide a knob to control the trade-off. Our two
k-SDA reduce the accuracy loss significantly at a reasonable cost of extra neurons. Among them,
k-SDA(Hessian) takes the sensitivity of the quantization error to the accuracy into account and can select better candidates to decompose. Besides, it provides the knob to trade off the neuron usage against the accuracy. For the rest of the experiments, we use Hessian for the candidate selection and KS2(SBA) for the local decomposition.
Table 1 summarizes the results of VDA and our
k-SDA. For
k-SDA, we use
. When the synapse resolution is low, our proposed method provides higher accuracy improvements, which suggests that we can potentially lower the synapse resolution counting on the proposed technique. As is widely known, Mobilenet is very sensitive to quantization error because of the reduced number of parameters, requiring a high synapse resolution. However, the enhancement by the proposed method is consistent across the various networks.