1. Introduction
Artificial neural networks are becoming an integral part of modern day reality. This technology consists of two stages: A training phase and an inference phase. The training phase is computationally expensive and typically outsourced to cluster or cloud computing. It takes place only now and then, eventually only once forever. The inference phase is implemented on the device running the application. It is repeated whenever the neural network is used. This work solely targets the inference phase after the neural network has been successfully trained.
The inference phase consists of scalar nonlinearities and matrix–vector multiplications. The former ones are much easier to implement than the latter. The target of this work is to reduce the computational cost of the following task: Multiply an arbitrary vector with a constant matrix. At the first layer of the neural network, the arbitrary vector is the input to the neural network. At a subsequent layer, it is the activation function of the respective previous layer. The constant matrices are the weight matrices of the layers that were found in the training phase and that stay fixed for all inference cycles of the neural network.
The computing unit running the inference phase need not be a generalpurpose processor. With neural networks being more and more frequently deployed in lowenergy devices, it is attractive to employ dedicated hardware. For some of them, e.g., field programmable gate arrays or applicationspecific integrated circuits with a reprogrammable weightmemory, e.g., realized in static random access memory, the data center has the option to update the weight matrices whenever it wants to reconfigure the neural network. Still, the matrices stay constant for most of the time. In this work, we will not address those updates, but focus on the most computationally costly effort: the frequent matrix–vector multiplications within the dedicated hardware.
Besides the matrix–vector multiplications, memory access is currently also considered a major bottleneck in the inference phase of neural networks. However, technological solutions to the memory access problem, e.g., stacked dynamical random access memory utilizing throughsilicon vias [
1] or emerging nonvolatile memories [
2], are being developed and are expected to be available soon. Thus, we will not address memoryaccess issues in this work. Note also that the use in neural networks is just one, though a very prominent one, of the many applications of fast matrix–vector multiplication. Many more applications, can be found. In fact, we were originally motivated by beamforming in wireless multiantenna systems [
3,
4], but think that neural networks are even better suited for our idea, as they update their matrices much less frequently. Fast matrix–vector products are also important for applications in other areas of signal processing, compressive sensing, numerical solvers for partial differential equations, etc. This opens up many future research directions based on linear computation coding.
Various works have addressed the problem of simplifying matrix–matrix multiplications utilizing certain recursions that result in subcubic timecomplexity of matrix–matrix multiplication (and matrix inversion) [
5,
6]. However, these algorithms and their more recent improvements, to the best of our knowledge, do not help for matrix–vector products. This work is not related to that group of ideas.
Various other studies have addressed the problem of simplifying matrix–vector multiplications in neural networks utilizing structures of the matrices, e.g., sparsity [
7,
8]. However, this approach comes with severe drawbacks: (1) It does not allow us to design the training phase and inference phase independently of each other. This restricts interoperability, hinders efficient training, and compromises performance [
9]. (2) Sparsity alone does not necessarily reduce computational cost, as it may require higher accuracy, i.e., larger wordlength for the nonzero matrix elements. In this work, we will neither utilize structures of the trained matrices nor structures of the input data. The vector and matrix to be multiplied may be totally arbitrary. They may, but need not, contain independent identically distributed (IID) random variables, for instance.
It is not obvious that, without any specific structure in the matrix, significant computational savings are possible over stateoftheart methods implementing matrix–vector multiplications. In this work, we will develop a theory to explain why such savings are possible and provide a practical algorithm that shows how they can be achieved. We also show that these savings are very significant for typical matrixsizes in present day neural networks: By means of the proposed linear computation coding, the computational cost, if measured in number of additions and bit shifts, is reduced several times. A gain close to half the binary logarithm of the matrix size is very typical. Recent FPGA implementations of our algorithm [
10] show that the savings counted in lookup tables are even higher than the savings counted in additions and bit shifts. In this paper, however, we are concerned with the theory and the algorithmic side of linear computation coding. We leave details on reconfigurable hardware and neural networks as topics for future work.
The paper is organized as follows: In
Section 2, the general concept of computation coding is introduced. A reader that is only interested in linear functions, but not in the bigger picture, may well skip this section and go directly to
Section 3, where we review the stateoftheart and define a benchmark for comparison. In
Section 4, we propose our new algorithm.
Section 5 and
Section 6 study its performance by analytic and simulative means, respectively.
Section 7 discusses the tradeoff between the cost and the accuracy of the computations.
Section 8 summarizes our conclusions and gives an outlook for future work.
Matrices are denoted by boldface upper letters, and vectors are not explicitly distinguished from scalar variables. The sets $\mathbb{Z}$ and $\mathbb{R}$ denote the integers and reals, respectively. The identity matrix, the all zero matrix, the all one matrix, the expectation operator, the sign function, matrix transposition, and Landau’s big Ooperator are denoted by $\mathbf{I}$, $\mathbf{0}$, $\mathbf{1}$, $\mathsf{E}[\xb7]$, $\mathrm{sign}(\xb7)$, ${\xb7}^{\mathsf{T}}$, and $\mathsf{O}(\xb7)$, respectively. Indices to constant matrices express their dimensions. The notation $\left\right\xb7{\left\right}_{0}$ counts the number of nonzero entries of the vector or matrixvalued argument and is referred to as the zero norm. The inner product of two vectors is denoted as $\langle \xb7;\xb7\rangle $.
2. Computation Coding for General Functions
The approximation by an artificial neural network is the current stateoftheart to compute a multidimensional function efficiently. There may be other ones, yet undiscovered, as well. Thus, we define computation coding for general multidimensional functions. Subsequently, we discuss the practically important case of linear functions, i.e., matrix–vector products, in greater detail.
The best starting point to understand general computation coding is ratedistortion theory in lossy data compression. In fact, computation coding can be interpreted as a lossy encoding of functions with a side constraint on the computational cost of the decoding algorithm. As we will see in the sequel, it shares a common principle with lossy source coding: Random codebooks, if suitably constructed, usually perform well.
Computation coding consists of computation encoding and computation decoding. Roughly speaking, computation encoding is used to find an approximate representation $m(x)$ for a given and known function $f(x)$ such that $m(x)$ can be calculated for most arguments x in some support $\mathcal{X}$ with low computational cost and $m(x)$ approximates $f(x)$ with high accuracy. Computation decoding is the calculation of $m(x)$. Formal definitions are as follows:
Definition 1. Given a probability space $(\mathcal{X},{P}_{\mathcal{X}})$ and a metric $d:\mathcal{F}\times \mathcal{F}\mapsto \mathbb{R}$, a computation encoding with distortion D for given function $f:\mathcal{X}\mapsto \mathcal{F}$ is a mapping $m:\mathcal{X}\mapsto \mathcal{M}\subseteq \mathcal{F}$ such that ${\mathsf{E}}_{x\in \mathcal{X}}\left[d(f(x),m(x))\right]\le D$.
Definition 2. A computation decoding with computational cost C for given operator $\mathsf{C}$ is an implementation of the mapping $m:\mathcal{X}\mapsto \mathcal{M}$ such that $\mathsf{C}[m(x)]\le C$ for all $x\in \mathcal{X}$.
The computational cost operator $\mathsf{C}[m(\xb7)]$ measures the cost to implement the function $m(\xb7)$. It reflects the properties of the hardware that executes the computation.
Computation coding can be regarded as a generalization of lossy source coding. If we consider the identity function $f(x)=x$ and the limit $C\to \infty $, computation coding reduces to lossy source coding with $m(x)$ being the codeword for x. Ratedistortion theory analyzes the tradeoff between distortion D and the number of distinct codewords. In computation coding, we are interested in the tradeoff between distortion D and computational cost C. The number of distinct codewords is of no or at most subordinate concern.
The expectation operator in the distortion constraint of Definition 1 is natural to readers familiar with ratedistortion theory. From a computer science perspective, it follows the philosophy of approximate computing [
11]. Nevertheless, hard constraints on the accuracy of computation can be addressed via distortion metrics based on the infinity norm, which enforces a maximum tolerable distortion.
The computational cost operator may also include an expectation. Whether this is appropriate or not depends on the goal of the hardware design. If the purpose is minimum chip area, one usually must be able to deal with the worst case and an expectation can be inappropriate. Power consumption, on the other hand, overwhelmingly correlates with average computational cost.
The above definitions shall not be confused with related, but different definitions in the literature of approximation theory [
12]. There, the purpose is rather to allow for proving theoretical achievability bounds than evaluating algorithms. The approach to distortion is similar. Complexity, however, is measured as the growth rate of the number of bits required to achieve a given upper bound on distortion. This is quite different from the computational cost in Definition 2.
4. Proposed Scheme for Linear Computation Coding
The shortcoming of the mailman algorithm is the restriction that the wiring matrix must be a permutation. Thus, it does not do computations except for multiplying the codebook matrix to the output of the wiring. The size of the codebook matrix grows exponentially with the number of computations it executes. As a result, the matrix dimension must be huge to achieve even reasonable accuracy.
We cure this shortcoming, allowing for a few additional entries in the wiring matrix. To keep the computational cost as low as possible, we follow the philosophy of the CORDIC algorithm and allow all nonzero entries to be signed powers of two only. We do not restrict the wiring matrix to be a rotation, since there is no convincible reason to do so. The computational cost is dominated by the number of nonzero entries in the wiring matrix. It is not particularly related to the geometric interpretation of this matrix.
The important point, as the analysis in
Section 5 will show, is to keep the aspect ratio of the codebook matrix exponential. This means the number of rows
N relates to the number of columns
K as
for some constant
R, which is
${log}_{2}T$ in the mailman algorithm, but can also take other values, in general. Thus, the number of columns scales exponentially with the number of rows. Alternatively, one may transpose all matrices and operate with logarithmic aspect ratios. However, codebook matrices that are not far from square perform poorly.
4.1. Aspect Ratio
An exponential or logarithmic aspect ratio is not a restriction of generality. In fact, it gives more flexibility than a linear or polynomial aspect ratio. Any matrix with a less extreme aspect ratio can be cut horizontally or vertically into several submatrices with more extreme aspect ratios. The proposed algorithm can be applied to these submatrices independently. A square $256\times 256$ matrix, for instance, can be cut into 32 submatrices of size $8\times 256$. Even matrices whose aspect ratio is superexponential or sublogarithmic do not pose a problem. They can be cut into submatrices vertically or horizontally, respectively.
Horizontal cuts are trivial. We simply write the matrix vector product
$\mathbf{T}x$ as
such that each submatrix has exponential aspect ratio and apply our matrix decomposition algorithm to each submatrix. Vertical cuts work as follows:
Here, the input vector x must be cut accordingly. Furthermore, the submatrix–subvector products ${\mathbf{T}}_{1}{x}_{1},{\mathbf{T}}_{2}{x}_{2},\cdots $ need to be summed up. This requires only a few additional computations. In the sequel, we assume that the aspect ratio is either exponential, i.e., $\mathbf{T}$ is wide, or logarithmic, i.e., $\mathbf{T}$ is tall, without loss of generality.
4.2. General Wiring Optimization
For given distortion measure
$d(\xb7,\xb7)$ (see Definition 1 for details), given the upper limit on the computational cost
C, given wide target matrix
$\mathbf{T}$ and codebook matrix
$\mathbf{B}$, we find the wiring matrix
$\mathbf{W}$ such that
where the operator
$\mathsf{C}[\xb7]$ measures the computational cost.
For a tall target matrix
$\mathbf{T}$, run the decomposition algorithm (
11) with the transpose of
$\mathbf{T}$ and transpose its output. In that case, the wiring matrix is multiplied to the codebook matrix from the left, not from the right. Unless specified otherwise, we will consider wide target matrices in the sequel, without loss of generality.
4.2.1. Multiple Wiring Matrices
Wiring optimization allows for a recursive procedure. The argument of the computational cost operator
$\mathsf{C}[\xb7]$ is a matrix–vector multiplication itself. It can also benefit from linear computation coding by a decomposition of
$\mathsf{\Omega}$ into a codebook and a wiring matrix via (
11). However, it is not important that such a decomposition of
$\mathsf{\Omega}$ approximates
$\mathsf{\Omega}$ very closely. Only the overall distortion of the linear function
$\mathbf{T}x$ is relevant. This leads to a recursive procedure to decompose the target matrix
$\mathbf{T}$ into the product of a codebook matrix
$\mathbf{B}$ and multiple wiring matrices such that the wiring matrix
$\mathbf{W}$ in (
6) is given as
for some finite number of wiring matrices
L. Any of those wiring matrices are found recursively via
with
${\sum}_{\ell =1}^{L}{C}_{\ell}=C$. This means that
$\mathbf{B}$ serves as a codebook for
${\mathbf{W}}_{1}$ and
$\mathbf{B}{\mathbf{W}}_{1}\cdots {\mathbf{W}}_{\ell 1}$ serves as a codebook for
${\mathbf{W}}_{\ell}$.
Multiple wiring matrices are useful, if the codebook matrix $\mathbf{B}$ is computationally cheap, but poor from a distortion point of view. The product of a computationally cheap codebook matrix $\mathbf{B}$ with a computationally cheap wiring matrix ${\mathbf{W}}_{1}$ can serve as a codebook $\mathbf{B}{\mathbf{W}}_{1}$ for subsequent wiring matrices that performs well with respect to both distortion and computational cost.
Multiple wiring matrices can also be useful if the hardware favors some serial over fully parallel processing. In this case, circuitry for multiplying with ${\mathbf{W}}_{\ell}$ can be reused for subsequent multiplication with ${\mathbf{W}}_{\ell 1}$. Note that in the decomposition phase, wiring matrices are preferably calculated in increasing order of the index ℓ, while in the inference phase, they are used in decreasing order of ℓ, at least for wide matrices.
4.2.2. Decoupling into Columns
The optimization problems (
11) and (
13) are far from trivial to solve. A pragmatic approach to simplify them is to decouple the optimization of the wiring matrix columnwise.
Let
${t}_{k}$ and
${w}_{k}$ denote the
$k\mathrm{th}$ columns of
$\mathbf{T}$ and
$\mathbf{W}$, respectively. We approximate the solution to (
11) columnwise as
with
$s=C/K$. This means we do not approximate the linear function
$\mathbf{T}x$ with respect to the joint statistics of its input
x. We only approximate the columns of the target matrix
$\mathbf{T}$ ignoring any information on the input of the linear function. While in (
11), the vector
x may have particular properties, e.g., restricted support or certain statistics that are beneficial to reduce distortion or computational cost, the vector
$\xi $ in (
14) is not related to
x and must be general.
The wiring matrix
$\tilde{\mathbf{W}}$ resulting from (
14) will fulfill the constraint
$\mathsf{C}\left[\tilde{\mathbf{W}}x\right]\le C$ only approximately. The computational cost operator does not decouple columnwise, in general.
4.3. Computational Cost
To find a wiring matrix in practice, we need to measure computational cost. In the sequel, we do this by solely counting additions. Sign changes are cheaper than additions and their numbers are also often proportional to the number of additions. Shifts are much cheaper than additions. Fixed shifts are actually without cost on dedicated hardware such as ASICs and FPGAs. Multiplications are counted as multiple shifts and additions.
We define the nonnegative function
$\mathrm{csd}:\mathbb{R}\mapsto \mathbb{Z}$ as follows:
This function counts how many signed binary digits are required to represent the scalar
t, cf. Example 2. The number of additions to directly calculate the matrix–vector product
$\mathsf{\Omega}x$ via the CSD representation of
$\mathsf{\Omega}\in {\mathbb{R}}^{N\times K}$ is thus given by the function
$\mathrm{csda}:{\mathbb{R}}^{N\times K}\mapsto \mathbb{Z}$ as
In (
16),
${\omega}_{k,n}$ denotes the
$(n,k)\mathrm{th}$ element of
$\mathsf{\Omega}$. The function
$\mathrm{csda}(\xb7)$ is additive with respect to the rows of its argument. With respect to the column index, we have to consider that adding
$k>1$ terms only requires
$k1$ additions. This means the function
$\mathrm{csda}(\xb7)$ does not decouple columnwise (although it does decouple rowwise). For columnwise decoupled wiring optimization in
Section 4.2.2, this means that
$\mathrm{csda}(\tilde{\mathbf{W}})\ne \mathrm{csda}(\mathbf{W})$ in general.
Setting
in (
13), we measure computational cost in terms of elementwise additions. Our goal is to find algorithms, for which
Although the optimization in (
13) implicitly ensures this inequality, it is not clear how to implement such an algorithm in practice. Even if we restrict it to
$\mathsf{C}\left[\langle \omega ;x\rangle \right]=\mathrm{csda}({\omega}^{\mathsf{T}})$, the optimization (
14) is still combinatorial in
$\mathrm{csda}({\omega}^{\mathsf{T}})$.
If the matrix contains only zeros or signed powers of two, the function
$\mathrm{csda}(\xb7)$ can be written as
in terms of the zero norm. The approximation (20) was used in the preliminary conference versions of this work [
4,
42,
43]. In the sequel, we continue with the exact number of additions as given in (
16).
While counting the number of additions by means of the zero norm is helpful to emphasize the similarities of linear computation coding with compressive sensing, it enforces one, though minor, unnecessary restriction: the constraint for the matrix
$\mathsf{\Omega}$ to contain signed powers of 2 as nonzero elements. The wiring matrix
$\mathbf{W}$ forms linear combinations of the columns of the codebook matrix
$\mathbf{B}$, cf. (
6). If we form only linear combinations of
different codewords, the zero norm formulation in (
19) is perfectly fine. If we do not want to be bound by the unnecessary constraint that codewords may not be used twice within one linear combination, we have to resort to the more general formulation in (
16). While for large matrices the performance is hardly affected, for small matrices this does make a difference. This is one of several reasons why, in the preliminary conference versions of this work [
4,
42,
43], the decomposition algorithm does not perform so well for small matrices.
4.4. Codebook Design
For codebook matrices, the computational cost depends on the way they are designed. Besides being easy to multiply to a given vector, a codebook matrix should be designed such that pairs of columns are not collinear. A column that is collinear to another one is almost obsolete: It hardly helps to reduce the distortion while it needs to compute additions. In an early conference version of this work [
42], we proposed to find the codebook matrix by sparse quantization of the target matrix. While this results in significant savings of computational cost over the state of the art, there are even better designs for codebook matrices. Three of them are detailed in the sequel.
4.4.1. Binary Mailman Codebook
In the binary mailman codebook, only the all zero column is obsolete. It is shown in
Appendix B that the multiplication of the binary mailman matrix with an arbitrary vector requires at most
$2K5$ and
$K2$ additions for column vectors and row vectors, respectively. The main issue with the binary mailman codebook is its lack of flexibility: It requires the matrix dimensions to fulfill
$K={2}^{N}$. This may restrict its application.
4.4.2. TwoSparse Codebook
We choose the alphabet $\mathcal{S}\subset \{0,\pm {2}^{0},\pm {2}^{1},\pm {2}^{2},\cdots \}$ as a subset of the signed positive powers of two augmented by zero. Then, we find K vectors of lengths N such that no pair of vectors is collinear and each vector has zero norm equal to either 1 or 2. For sufficiently large sizes of the subset, those vectors always exist. These vectors are the columns of the codebook matrix. The ordering is irrelevant. It turns out to be useful to restrict the magnitude of the elements of $\mathcal{S}$ to the minimum that is required to avoid collinear pairs.
4.4.3. SelfDesigning Codebook
We set
$\mathbf{B}={\mathbf{B}}_{0}{\mathbf{B}}_{1}$ with
${\mathbf{B}}_{0}=\left[{\mathbf{I}}_{N}\phantom{\rule{4pt}{0ex}}{\mathbf{0}}_{N\times (KN)}\right]$ and find the
$K\times K$ matrix
${\mathbf{B}}_{1}$ via (
14) for
$C=K$ interpreting it as wiring matrix for some given auxiliary target matrix
$\tilde{\mathbf{T}}$. The auxiliary target matrix may, but need not, be identical to
$\mathbf{T}$. The codebook designs itself taking the auxiliary target matrix as a model.
4.4.4. Codebook Evolution
If multiple wiring matrices are used, the codebook evolves towards the target matrix. For multiple wiring matrices, the previous approximation of the target matrix serves as a codebook. Thus, with an increasing number of wiring matrices, the codebook gets closer and closer to the target matrix, no matter what initial codebook was used.
Codebook evolution can become a problem if the target matrix is not a suitable codebook, e.g., it contains collinear columns or is rank deficient. In such cases, multiple wiring matrices should be avoided or reduced to a small number.
Codebook evolution can also be helpful. This is the case if the original codebook is worse than the target matrix, e.g., because it shall be computationally very cheap as for the selfdesigning codebook.
4.4.5. Cutting Diversity
Target matrices that are not wide or tall are cut into wide or tall submatrices via (
9) and (
10), respectively. However, there are various ways to cut them. There is no need to form each submatrix from adjacent rows or columns of the original matrix. In fact, the choice of rows or columns is arbitrary. Some of the cuts may lead to submatrices that are good codebooks, other cuts to worse ones. These various possible cuts provide many options, which allow us to avoid submatrices that are bad codebooks. They provide diversity to ensure a certain quality in case of codebook evolution.
4.5. Greedy Wiring
Greedy wiring is one practical way to cope with the combinatorial nature of (
14). It is demonstrated in
Section 5 and
Section 6 to perform well and briefly summarized below:
Start with $s=0$ and $\omega ={\mathbf{0}}_{N\times 1}$.
Update $\omega $ such that it changes in at most a single component.
Increment s.
If $s\le C/K$, go to step 2.
For quadratic distortion measures, this algorithm is equivalent to matching pursuit [
44]. Note that orthogonal matching pursuit as in [
45] is not applicable, since restricting the coefficients to signed powers of two results in a generally suboptimal leastsquares solution that does not necessarily satisfy the orthogonality property.
4.6. PseudoCode of the Algorithm Used for Simulations
In
Section 4, many options for linear computation coding are presented with various tradeoffs between performance and complexity, some of them even being NPhard, which prevents them from being implemented unless the target matrices are very small. In order to clarify the algorithm we have used in our simulation results, we provide its pseudocode here.
Algorithm 1 requires a zero mean target matrix
$\mathbf{T}$ as input. The number of additions per row for the
ℓth wiring matrix is a free design variable, which is conveniently set to unity. Any initial codebook can be used. Algorithm 1 calls the subroutine Algorithm 2 to perform the decomposition in (
14) by means of greedy wiring.
Algorithm 1 Algorithm used in the simulation results. 
 1:
procedure MatrixFactorization  2:
$\mathbf{T}\leftarrow \mathrm{zero}\text{}\mathrm{mean}\phantom{\rule{4.pt}{0ex}}\mathrm{target}\phantom{\rule{4.pt}{0ex}}\mathrm{matrix}$  3:
$\mathbf{B}\leftarrow \mathrm{identity}\phantom{\rule{4.pt}{0ex}}\mathrm{matrix}\phantom{\rule{4.pt}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}\mathrm{same}\phantom{\rule{4.pt}{0ex}}\mathrm{size}\phantom{\rule{4.pt}{0ex}}\mathrm{as}\phantom{\rule{4.pt}{0ex}}\mathbf{T}\phantom{\rule{4.pt}{0ex}}\mathrm{or}\phantom{\rule{4.pt}{0ex}}\mathrm{other}\phantom{\rule{4.pt}{0ex}}\mathrm{codebook}\phantom{\rule{4.pt}{0ex}}\mathrm{matrix}$  4:
${S}_{\ell}\leftarrow \mathrm{number}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{additions}\phantom{\rule{4.pt}{0ex}}\mathrm{per}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}\ell \text{}\mathrm{th}\phantom{\rule{4.pt}{0ex}}\mathrm{wiring}\phantom{\rule{4.pt}{0ex}}\mathrm{matrix}$  5:
$\ell \leftarrow 0$  6:
loop:  7:
if $\mathbf{T}$ and $\mathbf{B}$ differ too much then  8:
$\ell \leftarrow \ell +1$  9:
$\mathbf{W}(\ell )\leftarrow SubroutineGreedyWiring(\mathbf{T},\mathbf{B},{S}_{\ell})$  10:
$\mathbf{B}\leftarrow \mathbf{W}(\ell )\mathbf{B}$  11:
goto loop.  12:
else  13:
return the matrix factors $\mathbf{W}(1)\cdots \mathbf{W}(\ell )$

Algorithms 1 and 2 are suited for tall matrices as used in the simulation section. This is in contrast to the wide matrices used in
Section 4.2 to
Section 4.5. For wide instead of tall matrices, just transpose both inputs and outputs of the two algorithms.
Algorithm 2 Subroutine used in Algorithm 1. 
 1:
procedure SubroutineGreedyWiring($\mathbf{T},\mathbf{B},S$)  2:
$k\leftarrow \mathrm{number}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{rows}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}\mathbf{T}$  3:
$\mathbf{W}\leftarrow k\times k\phantom{\rule{4.pt}{0ex}}\mathrm{all}\phantom{\rule{4.pt}{0ex}}\mathrm{zero}\phantom{\rule{4.pt}{0ex}}\mathrm{matrix}$  4:
outer loop:  5:
if $S\ge 0$ then  6:
$k\leftarrow \mathrm{number}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{rows}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}\mathbf{T}$  7:
inner loop:  8:
if $k>0$ then  9:
$n\leftarrow \mathrm{index}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}\mathbf{B}\phantom{\rule{4.pt}{0ex}}\mathrm{that}\phantom{\rule{4.pt}{0ex}}\mathrm{if}\phantom{\rule{4.pt}{0ex}}\mathrm{scaled}\phantom{\rule{4.pt}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}\mathrm{a}\phantom{\rule{4.pt}{0ex}}\mathrm{signed}\phantom{\rule{4.pt}{0ex}}\mathrm{power}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}2$ $\phantom{\rule{4.pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{closest}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}k\text{}\mathrm{th}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathbf{T}$  10:
$\mathbf{W}(k,n)\leftarrow \mathbf{W}(k,n)+\mathrm{signed}\phantom{\rule{4.pt}{0ex}}\mathrm{power}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}2\phantom{\rule{4.pt}{0ex}}\mathrm{that}\phantom{\rule{4.pt}{0ex}}\mathrm{was}\phantom{\rule{4.pt}{0ex}}\mathrm{used}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}\mathrm{find}\phantom{\rule{4.pt}{0ex}}n$  11:
$k\text{}\mathrm{th}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathbf{T}\leftarrow k\text{}\mathrm{th}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathbf{T}\mathbf{W}(k,n)\times n\text{}\mathrm{th}\phantom{\rule{4.pt}{0ex}}\mathrm{row}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathbf{B}$  12:
$k\leftarrow k1$  13:
goto inner loop.  14:
$S\leftarrow S1$  15:
goto outer loop.  16:
else  17:
return the matrix $\mathbf{W}$

5. Performance Analysis
In order to analyze the expected distortion, we resort to a columnwise decoupling of the wiring optimization into target vectors and greedy wiring as in
Section 4.2.2 and
Section 4.5, respectively. We assume that the codebook and the target vectors are IID Gaussian random vectors. We analyze the meansquare distortion for this random ensemble. The IID Gaussian codebook is solely chosen, as it simplifies the performance analysis. In practice, the IID Gaussian random matrix
$\mathbf{B}$ must be replaced by a codebook matrix with low computational cost, but similar performance. Simulation results in
Section 6 will show that practical codebooks perform very similar to IID Gaussian ones.
5.1. Exponential Aspect Ratio
The key point to the good performance of the multiplicative matrix decomposition in (
6) is the exponential aspect ratio. The number of columns of the codebook matrix, i.e.,
K, scales exponentially with the number of its rows
N. For a linear computation code, we define the code rate as
The code rate is a design parameter that, as we will see later on, has some impact on the tradeoff between distortion and computational cost.
The exponential scaling of the aspect ratio is fundamental. This is a consequence of extremevalue statistics of largedimensional random vectors: Consider the correlation coefficients (inner products normalized by their Euclidean norms) of
Ndimensional real random vectors with IID entries in the limit
$N\to \infty $. For any set of those vectors whose size is polynomial in
N, the squared maximum of all correlation coefficients converges to zero, as
$N\to \infty $ [
46]. Thus, the angle
$\alpha $ in
Figure 1 becomes a right angle and the norm of the angle error is lower bounded by the norm of the target vector. However, for an exponentially large set of size
${2}^{RN}$ with rate
$R>0$, the limit for
$N\to \infty $ is strictly positive and given by ratedistortion theory as
$1{4}^{R}$ [
47]. The asymptotic squared relative error of approximating a target vector by an optimal real scaling of the best codeword is therefore
${4}^{R}$. The residual error vector can be approximated by another vector of the exponentially large set to get the total squared error down to
${4}^{2R}$. Applying that procedure for
s times, the (squared) error decays exponentially in
s. In practice, the scale factor cannot be a real number, but must be quantized. This additional error is illustrated in
Figure 1 and labeled distance error as opposed to the previously discussed angle error.
5.2. Angle Error
Consider a unit norm target vector
$t\in {\mathbb{R}}^{N}$ that shall be approximated by a scaled version of one out of
K codewords
${b}_{k}\in {\mathbb{R}}^{N}$ that are random and jointly independent. Denoting the angle between the target vector
t and the codeword
${b}_{k}$ as
${\alpha}_{k}$, we can write (the norm of) the angle error as
The correlation coefficient between target vector
t and codeword
${b}_{k}$ is given as
The angle error and the correlation coefficient are related by
We will study the statistical behavior of the correlation coefficient in order to learn about the minimum angle error.
Let ${\mathrm{P}}_{{\rho}^{2}t}(r,t)$ denote the cumulative distribution function (CDF) of the squared correlation coefficient given target vector t. The target vector t follows a unitarily invariant distribution. Thus, the conditional CDF does not depend on it. In the sequel, we choose t to be the first unit vector of the coordinate system, without loss of generality.
The squared correlation coefficient
${\rho}_{k}^{2}$ is known to be distributed according to the beta distribution with shape parameters
$\frac{1}{2}$ and
$\frac{N1}{2}$ Section III.A in [
48], and given by
Here,
$\mathrm{B}(\xb7,\xb7,x)$ denotes the regularized incomplete Beta function [
49]. It is defined as
for
$x\in [0;1]$ and zero otherwise. With (
24) and (
26), the distribution of the squared angle error is, thus, given by
5.3. Distance Error
Consider the right triangle in
Figure 1. The squared Euclidean norms of the angle error
${a}_{k}$ and the codeword
${b}_{k}$ scaled by the optimal factor
${v}_{k}$ give
for a target vector of unit norm. The distance error
${d}_{k}$ is maximal, if the magnitude of the optimum scale factor
${v}_{k}$ is exactly in the middle of two adjacent powers of two, say
${p}_{k}$ and
$2{p}_{k}$. In that case, we have
which results in
Due to the orthogonal projection, the magnitude of the optimal scale factor is given as
Thus, the distance error obeys
with equality, if the optimal scale factor is three quarters of a signed power of two.
It will turn out useful to normalize the distance error as
and specify its statistics by
${\mathrm{P}}_{\delta}(\delta )$, to avoid the statistical dependence of the angle error. The distance error is a quantization error. Those errors are commonly assumed uniformly distributed [
13]. Following this assumption, the average squared distance error is easily calculated as
Unless the angle $\alpha $ is very small, the distance error is significantly smaller than the angle error. Their averages become equal for an angle of arccot $\sqrt{27}\approx {11}^{\circ}$.
Note that the factor $1/27$ slightly differs from the factor $1/28$ in Example 2. Like in Example 2, the number to be quantized is uniformly distributed within some interval. Here, however, the interval boundaries are not signed powers of two. This leads to a minor increase in the power of the quantization noise.
5.4. Total Error
Since distance error and angle error are orthogonal to each other, the total squared error is simply given as
Conditioning on the normalized distance error, the total squared error is distributed as
The unconditional distribution
is simply found by marginalization.
As the columns of the codebook matrix are jointly independent, we conclude that for
we have
For
${\mathrm{P}}_{\delta}(\delta )$ having support in the vicinity of
$\delta =0$, it is shown in
Appendix C to converge to
for exponential aspect ratios.
The large matrix limit (
40) does not depend on the statistics of the normalized distance error. It is indifferent to the accuracy of the quantization. This looks counterintuitive and requires some explanation. To understand that effect, consider a hypothetical equiprobable binary distribution of the normalized distance error with one of the point masses at zero. If we now discard all codewords that lead to a nonzero distance error, we force the distance error to zero. On the other side, we lose half of the codewords, so the rate is reduced by
$\frac{1}{N}$. However, in the limit
$N\to \infty $, that comes for free. If the distribution of the angle error has any nonzero probability accumulated in some vicinity of zero, a similar argument can be made. This above argument is not new. It is common in forward error correction coding and is known as the expurgation argument [
50].
The CDF is depicted in
Figure 2 for a uniform distribution of the distance error. For increasing matrix size, it approaches the unit step function. The difference between the angle error and the total error is small.
The median total squared error for a single approximation step is depicted in
Figure 3 for various rates
R. Note that the computational cost per matrix entry scales linearly with
R for fixed
K. In order to have a fair comparison, the average total squared error is exponentiated with
$1/R$. For large matrices, it converges to the asymptotic value of
$\frac{1}{4}$ found in (
40), which is approached slowly from above. For small matrices, it strongly deviates from that. While for very small matrices low rates are preferred, mediumsized matrices favor moderately high rates between 1 and 2. Having the rate too large, e.g.,
$R=\frac{5}{2}$, also leads to degradations.
We prefer to show the median error over the average error that was used in the conference versions [
4,
43] of this work. The average error is strongly influenced by rare events, i.e., codebook matrices with many close to collinear columns. However, such rare bad events can be easily avoided by means of cutting diversity, cf.
Section 4.4.5. The median error reflects the case that is typical, in practice.
5.5. Total Number of Additions
The exponential aspect ratio has the following impact on the tradeoff between distortion and computational cost: For $K+{S}_{\ell}$ choices from the codebook, the wiring matrix ${\mathbf{W}}_{\ell}$ contains $K+{S}_{\ell}$ nonzero signed digits according to approximation (20). Due to the columnwise decomposition, these are $1+{S}_{\ell}/K$ of them per column. At this point, we must distinguish between wide and tall matrices.
 Wide Matrices:
For the number of additions, the number of nonzero signed digits per row is relevant, as each row of ${\mathbf{W}}_{\ell}$ is multiplied to an input vector x when calculating the product $h={\mathbf{W}}_{\ell}x$. For the standard choice of square wiring matrices, the counting per column versus counting per row hardly makes a difference on average. Though they may vary from row to row, the total number of additions is approximately equal to ${S}_{\ell}$.
 Tall Matrices:
The transposition converts the columns into rows. Thus, ${S}_{\ell}$ is exactly the number of additions.
In order to approximate an $N\times K$ target matrix $\mathbf{T}$ with $N=\frac{1}{R}{log}_{2}K=\mathsf{O}(logK)$ rows, we need approximately ${S}_{\ell}$ additions. For any desired distortion D, the computational cost of the product ${\mathbf{W}}_{\ell}x$ is by a factor $\mathsf{O}(logK)$ smaller than the number of entries in the target matrix. This is the same scaling as in the mailman algorithm. Given such a scaling, the mailman algorithm allows for a fixed distortion D, which depends on the size of the target matrix. The proposed algorithm, however, can achieve arbitrarily low distortion by setting ${S}_{\ell}$ appropriately large, regardless of the matrix size.
Computations are also required for the codebook matrix. All three codebook matrices discussed in
Section 4.4 require at most
K and
$2K$ additions for tall and wide matrices, respectively. Adding the computational costs of wiring and codebook matrices, we obtain
with
$S={\sum}_{\ell =1}^{L}{S}_{\ell}$ and
Normalizing to the number of elements of the
$N\times K$ target matrix
$\mathbf{T}$, we have
The computational cost per matrix entry vanishes with increasing matrix size. This behavior fundamentally differs from stateoftheart methods discussed in Examples 1 and 2, where the matrix size has no impact on the computational cost per matrix entry.
There are slightly less overall additions required, if a
$K\times K$ square matrix is cut into tall submatrices than if it is cut into wide submatrices. Although the vertical cut requires
$K/N$ surplus additions due to (
10), it saves approximately
K additions due to (
41) in comparison to the horizontal cut.
7. ComputationDistortion TradeOff
Combining the results of
Section 5.4,
Section 5.5, and
Section 6.1, we can relate the meansquare distortion to the number of additions.
Section 6.1 empirically confirmed that the distortion is given by
(at least for rates close to unity). Combining (
40) and (
21), we have
for matrices with large dimensions. Furthermore, (
43) relates the number of additions per matrix entry
$\tilde{C}$ to
S and
N as
Combining these three relations, we obtain
This is a much more optimistic scaling than in Example 2. For scalar linear functions, Booth’s CSD representation made the meansquare error reduce by the constant factor of 28 per addition. Due to linear computation coding as proposed in this work, the factor 28 in (
5) turns into
${K}^{2}$, the squared matrix dimension, in (
47). Since we have the free choice to go for either tall or wide submatrices, the relevant matrix dimension here is the maximum of the number of rows and columns.
We may relate the computational cost
$\tilde{C}$ in (
47) to the computational cost
$\left\mathcal{B}\right$ in (
5), the benchmark set by Booth’s CSD representation. For given SQNR, this results in
For
qbit signed integer arithmetic the SQNR is approximately given by SQNR
$\approx {4}^{q1}$. Thus, we obtain
Although this formula is based on asymptotic considerations, it is quite accurate for the benchmark example. Instead of the actual 77% savings compared to CSD, which were found by simulation in
Section 6.3.1, it predicts savings of 79%. However, it does not allow to include adaptive assignments of CSDs.