An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks

: Deep Neural Networks (DNNs) are nowadays a common practice in most of the Artiﬁcial Intelligence (AI) applications. Their ability to go beyond human precision has made these networks a milestone in the history of AI. However, while on the one hand they present cutting edge performance, on the other hand they require enormous computing power. For this reason, numerous optimization techniques at the hardware and software level


Introduction
In the era of Big Data and Internet-of-Things (IoT), Artificial Intelligence (AI) has found an ideal environment and a continuous stream of data from which to learn and thrive.In recent years, the research and development in AI, and more specifically its subset Machine Learning (ML), has exponentially increased, spreading in several discipline fields, and covering many applications.The ML consists of several algorithms and paradigms, in which the most impactful ones are the brain-inspired techniques.Among these, one that is based on Artificial Neural Networks (ANNs) has overcome the human accuracy, namely the Deep Learning (DL) [1], as shown in Figure 1a.Since its recent appearance, the DL showed many advantages over previous techniques, on the ability to work directly on raw data in large quantities combined with a very deep architecture.While the lack of preprocessing makes the process more streamlined, the alternation of a large number of layers increases the accuracy making the DL the technique par excellence.These networks are called Deep Neural Networks (DNNs) and cover a wide range of applications, for instance, business and finance [2][3][4], healthcare such as cancer detection [5][6][7], up to robotics [8,9], and computer vision [10][11][12].
As mentioned above, these networks are very complex and computation/memory-hungry. Therefore it is necessary to provide suitable/specialized hardware platforms for the execution of such algorithms over a consistent data stream.The layers of DNNs can reach a considerable size of up to hundreds of thousands of activations; consequently, the multiplication matrix vectors of an entire network can reach up to require a few billion multiply-and-accumulate (MAC) operations.As shown in Figure 1b, neural networks can pose processing requirements in two different ways, i.e., training during the design phase, and inference during the deployment phase.The inference is run in real-world applications, after the neural network has been trained, and is used to classify or derive predictions from the given inputs in real-world scenarios.While in the inference, the network only experiences the forward-pass, during the training, it experiences both the forward-pass and the backward-pass.During the latter, the prediction is compared with the label, and the error is used to update the weights through the backpropagation process.As a consequence, the training requires a much more extensive computational effort compared to that for the inference.The recent trend in these years have seen DL applications moving towards mobile platforms such as IoT/Edge nodes and smart cyber-physical systems (CPS) devices [13][14][15].Often these devices have stringent constraints in terms of latency, power, and energy, for instance, due to their real-time and battery-powered nature.Moreover, moving the computation from the cloud to the edge reduces the privacy and security threats to which various DL systems are subjected, hence increasing the need for embedded DL [16,17].In short, there is a growing demand for specialized hardware accelerators with optimized memory hierarchies that can meet the enormous compute and memory requirements of different types of complex DNNs, while maintaining a reduced power and energy envelope.
Over the past decade, several architectures have been proposed for the acceleration of DL algorithms.Many papers and surveys on this topic have been produced [18][19][20][21].However, due to rapid developments of DNN hardware, these surveys have either become obsolete or do not represent the emerging trends.Towards this, this paper aims to provide an up-to-date survey covering the state-of-the-art of the last 3 years.The work is therefore not intended as a substitute for existing surveys, but rather as an integral part that can be seen as a continuation of existing surveys.In the following sections, we will deal with the latest architectures with the main focus on new types of dataflow, reconfigurable architectures, variable precision, and sparsity.The reconfigurable architecture, sometimes coupled with adaptable bitwidth, is a flexible solution to accommodate different types of networks and is likely to become the standard in the future.In fact, researchers have shown that networks can be compressed [22,23] and represented on a number of bits that are being reduced over time as techniques are refined.Finally, the sparsity is a technique actively used to eliminate unnecessary operations and to further lower the power envelope of the accelerators, as well as making them faster and more effective.This also instantiates the need for sparse DNN accelerators like [24][25][26].
This paper is organized as follows: Section 2 discusses the background related to DNNs, describing the different models and their key features.Section 3 analyzes dataflows designed for energy-efficient architectures.The discussion goes from temporal and spatial architectures to other important themes such as sparsity, variable bit-width precision and reconfigurable architectures.Section 4 shows the typical memory hierarchy used in accelerators and the methodologies to reduce the power consumed by it.Finally, the paper provides the key takeaways in the conclusion.The acronyms used in this paper are reported in Table A1.

Background: Deep Neural Networks
An Artificial Neural Network (ANN), henceforth called Neural Network (NN), is a mathematical model inspired by the biological neural networks.However, the NN model is too simple to replicate the behavior of its biological counterpart, faithfully.An NN is formed by interconnected nodes, as in a graph, that are organized in layers (see Figure 2).A layer of input nodes receives signals from the outside, which are then processed by some intermediate layers called hidden layers.The result is finally obtained by the last layer, also called output layer.An NN with more than three hidden layers is defined in literature as a Deep Neural Network (DNN) [27].In short, the DNNs/NNs are mainly black-box model representation of a given function.That is, the model and its parameters are learned by finding the transfer function (composed of multiple layers of neurons, weights, and activation functions) from input to output through an extensive training process.
The graphs of NNs are direct, i.e., the connections are oriented, and can be acyclic or cyclic.If the NN is acyclic, it is called feedforward, and the output depends only on the current input.If instead, the NN is cyclic, it is defined recurrent, and the output also depends on the previous inputs.Recurrent NNs are, therefore, models with state/memory.
The nodes of NNs are the neurons, graphically and mathematically described in Figure 3 and Equation (1).A neuron receives n inputs (x 1 , x 2 , . . ., x n ) and returns a scalar output y.For a given neuron, the inputs are multiplied with the weights (w 1 , w 2 , . . ., w n ) and summed together with a bias term b.A non-linear function σ(•), called activation function, is then applied to determine the output of the neuron.Common activation functions are Rectified Linear Unit (ReLU), Sigmoid or Hyperbolic tangent.
input output neuron

Training and Inference
NNs learn to achieve the desired results by modifying their internal parameters, i.e., weights and biases.The phase in which the network learns is called training.Once the network has been trained, it can be used to solve unknown problems during the inference phase when deployed in real-world.
One of the most used learning paradigms is supervised learning, thanks to the large amount of (labeled) data that has become available in the so-called big-data era.Supervised learning requires labeled data, i.e., input-output pairs, where the output is the result that the network should obtain from the related input.Supervised learning consists of three steps repeated until convergence: 1.
Forward pass: the input is fed into the network that produces an output.

2.
Backward pass: a loss L is computed comparing the produced output and the desired output.The loss L is then used for the backpropagation algorithm [28], that applying the chain rule of calculus computes the gradient ∂L ∂w for each weight (and bias) of the network.

3.
Parameters update: each weight and bias is updated by an amount proportional to its gradient.
All the gradients can be multiplied by the same factor, defined learning rate, or more complex optimization algorithms can be used, such as Gradient Descent with Momentum [29] or Adam [30].
Other learning paradigms are unsupervised learning and reinforcement learning.Unsupervised learning works with unlabeled data and consists of finding common patterns and structures that data may have in common.Reinforcement learning involves the network (agent) interacting in an environment.An interpreter assesses the correctness of the interactions and returns a reward or punishment to the agent, who aims to maximize the reward.

Layers
As described in the previous paragraphs, neurons are organized in layers that can have different shapes and characteristics.This section presents a short overview of different layers that are most commonly used in NNs.
Fully Connected (FC) Layers.In FC layers, the neurons are arranged in the shape of a vector (see Figure 4).Considering a layer with C o neurons and C i inputs, each neuron c o receives all the C i inputs (Equation ( 2)).Therefore, each neuron has C i weights and the total number of weights of the layer is The number of inputs and outputs of an FC layer can be high.Consequently, also the weight matrix can have a significant size, making it a critical element, especially on hardware platforms with limited memory.However, it is not always necessary for each neuron to analyze the totality of the inputs, and convolutional layers have been introduced to solve this problem.Convolutional (Conv) Layers.The inputs and outputs of a Conv layer are organized in 2D grids, defined as feature maps (FM).Multiple feature maps can form each input/output: the number of input/output feature maps is referred to as the number of input/output channels.The neurons of the Conv layer, rather than analyzing the whole input, receive only a sub-grid of dimension there are multiple input channels).Horizontally adjacent neurons process grids of adjacent inputs separated by S positions, where S is a parameter known as stride.As shown in Figure 5, neurons that produce values belonging to the same output feature map (OFM) usually share the same kernel of weights.Therefore, each neuron of OFM c o has a kernel of C i × H k × W k weights, and the total number of weights is 3) describes the operations performed in the Conv layer.
Normalization Layers.Batch Normalization (BN) layers are often inserted at various points in the neural networks after Conv or FC layers.As can be seen from Equation (4) describing the BN layers, the values are processed so that their mean is zero, and the variance is 1. γ and β are two trainable parameters inserted to integrate normalization in the training phase.
The BN layers have two primary purposes.They contribute to accelerating the convergence of the training phase.Since the values always maintain a constant distribution, the network does not have to adapt to different ranges at each training step.Moreover, they avoid value saturation of the values inside the network.Since saturating non-linear functions, such as Sigmoid, Tanh, or Softmax, are often used, having values with zero mean and variance 1 prevents too many values from being saturated, which would cause a considerable loss of information and slow down the training process.Pooling Layers.The main purpose of the pooling layers is to shrink the size of the feature maps within the network, to decrease the number of parameters and the number of operations to be performed.In the pooling layers, different sub-grids of the input feature maps are selected.For each sub-grid, a single value is calculated, which is a statistical metric of the group, e.g., the maximum value (MaxPooling) or the average value (AvgPooling).The sub-grids are usually selected of equal size, adjacent and non-overlapping.

DNN Models
Since their origin, DNNs have been developed in a large number and a very diverse types of models in order to achieve high accuracy.The first DNN model to become famous was LeNet [28], a Convolutional Neural Network (CNN) that was used to recognize handwritten digits.The real boom for CNNs, which are the most widely used for object detection and recognition, came in 2012 when AlexNet [31] won the ILSVRC competition [32] by outperforming the earlier methods.Since then, also thanks to the increased availability of computational hardware and memory resources, the DNN models have become more and more complex and precise.Table 1 outlines a timeline of the models that have become more popular, describing the innovations introduced compared to previous models.

Energy-Efficient Architectures
The panorama of hardware solutions for the development and deployment of DNNs is vast, ranging from general-purpose solutions (CPUs and GPUs) and programmable solutions (FPGAs), to special-purpose ASICs.It is not straightforward to define which of these is the best solution, as it is strictly dependent on the application and the corresponding design constraints including even the time-to-market.For example, an edge IoT chip will require a small area, energy-efficient solution, while a cloud computing server for ML will demand a lot of flexibility.Moreover, a time-to-market may impose usage of GPUs or embedded GPUs as the only feasible platform option.
In the following subsections, the pros and cons of each solution will be discussed in detail, making a clear distinction between temporal and spatial architectures (see Figure 6).These two architectures have a similar computational structure, consisting of a set of multiple processing units.However, while the units of the spatial architectures can have internal control, in temporal architectures, the control is centralized.An analogous distinction can be made for the memory: in spatial architectures, the units can have a register file (RF) to store data, while in temporal architectures, the units have no memory capacity.Moreover, in spatial architectures, the units can also be interconnected to exchange data.Summarizing, the computational units of temporal architectures are typically ALUs, while that of spatial architectures are complex Processing Elements (PEs) that can potentially support articulated data movement patterns.

Temporal Architectures
CPUs and GPUs belong to the category of spatial architectures.Vector CPUs have multiple ALUs that can process multiple data in parallel.Most of them adopt the Single-Instruction Multiple-Data (SIMD) execution model, which applies a single instruction to different data simultaneously.Similarly, GPUs are formed by many processing cores, and they use the Single-Instruction Multiple-Threads (SIMT) execution model.CPUs and GPUs are general-purpose chips that must be able to support an extensive range of applications.For this reason, it is infrequent to find hardware optimizations specific for ML and DNNs.An approach commonly adopted is the attempt to better adapting the application to the chosen hardware platform.For example, the convolutional layer, using sub-grids of the original feature maps, requires discontinuous accesses to the memory.In [45], it is shown how to optimize the storage of feature maps to decrease the number of discontinuous accesses to the memory.Since the libraries for Basic Linear Algebra Subroutines (BLAS) are highly optimized, it is also possible to perform convolution lowering [46,47] to transform the convolution into a General Matrix Multiplication (GeMM), or to move in the frequency domain through Fast Fourier Transform (FFT) [48] and perform a point-wise multiplication of matrices.
Among the different available technologies, CPU cores are the least used for DNNs inference and training.CPUs have the advantage of being easily programmable to perform any kind of task.Still, their throughput is limited by the small number of cores and, therefore, by the small number of operations executable in parallel.Figure 7 compares the number of cores of CPUs and GPUs.The Intel Xeon Platinum 9222, a high-end processor used in servers with price over USD 10,000, has a number of floating-point operations per second per Watt (FLOPS/W) similar to the FLOPS/W of the 2014 Nvidia GT 740 GPU with price below USD 100 (∼12GFLOPS/W).High-end GPUs, with FLOPS/W in the order of TERAs, significantly surpass any CPU.However, some attempts have been recently made to accelerate DNNs deployment (inference in particular) on CPUs.At instruction level, Intel introduced DL Boost, a set of features that include AVX-512 Vector Neural Network Instructions (AVX-512-VNNI), part of AVX-512 Instructions [49], to accelerate CNNs algorithms, and Brain floating-point format (bfloat16) [50].Brain floating-point format is a 16-bit format that uses a floating radix point and has a dynamic range similar to that of the 32-bit IEEE 754 single-precision floating-point format.bfloat16 is also supported by ARMv8.6-A and is included in AMD's ROCm libraries.For what concerns the ML libraries, Intel has created BigDL [51], a distributed deep learning library for DNNs algorithms acceleration on CPU clusters.There is also an Intel distribution of Caffe [52], a popular deep learning framework, targeting Intel Xeon processors.Comparison of the number of CPU cores and GPUs.CPUs and GPUs models have been selected for different targets, e.g., personal computers or servers, and different price ranges.For the CPUs, the gray and black lines correspond to the minimum and the maximum cores of a family, respectively.For the GPUs, the black lines represent the number of CUDA cores, and the gray line represents the Tensor cores present in the Nvidia Tesla V100 only.
GPUs are the current workhorses for DNNs' inference and especially training.They contain up to thousands of cores (see Figure 7) to work efficiently on highly-parallel algorithms.Matrix multiplications, the core operations of DNNs, belong to this class of parallel algorithms.Among the GPUs' producers, Nvidia can be considered the winner of the AI challenge.In fact, the most popular DL frameworks, such as TensorFlow [53], PyTorch [54], or Caffe [52], support execution on Nvidia GPUs through the Nvidia cuDNN library [55], a GPU-accelerated library of primitives for DNNs with highly-optimized implementations of standard layers.DL frameworks allow to describe very complex neural networks in a few lines of code and run them on GPUs without needing to know GPU programming.cuDNN is part of CUDA-X AI [56], a collection of Nvidia's GPU acceleration libraries that accelerate DL and ML.
At the hardware level, Nvidia has combined Tensor cores [57] with traditional CUDA cores in some of its platforms.Tensor cores are a new structure designed to accelerate large matrix operations and perform mixed-precision Matrix Multiply-and-Accumulate (MMAC) calculations in a single operation.The recently announced Nvidia Ampere A100 supports a new numerical format called Tensor Format (TF32) that has the range of 32-bit floating point (FP32) numbers and the precision of 16-bit floating point numbers, using a 19-bit representation.TF32 format on A100 architecture provides a 10x performance increase compared to FP32 format on V100 architecture [58].Moreover, the Tensor cores in A100 architecture are optimized to exploit sparsity for an additional boost (2x) of the performances (see Section 3.2.1 for an insight of sparsity in NNs).

Spatial Architectures: Fpgas and Asics
DNNs accelerators implemented on FPGAs (Field-Programmable-Gate-Arrays) and ASICs (Application-Specific-Integrated-Circuit) usually fall into the category of spatial architectures.FPGAs and ASICs differ substantially.The primary purpose of FPGAs is programmability to implement any possible design.They are relatively cost-effective with short time-to-market, and the design flow is simple.However, FPGAs can not be optimized for the various requirements of different applications, are less energy-efficient, and have lower performances than ASICs.On the contrary, ASICs need to be designed and produced for a specific application that cannot be changed over time.The design flow is consequently more complex, and the production cost is higher, but the resulting chip is highly-optimized and energy-efficient.In this section, however, no distinction will be made between ASIC and FPGA based implementations.FPGAs are, in fact, often used to prototype what will then be developed on ASICs.
A hardware accelerator for DNNs (implemented on ASIC or FPGA) typically consists of an array of PEs for computation (see Figure 8).The PEs are interconnected by a Network-on-Chip (NoC) designed to achieve the desired data movement scheme.The three levels of the memory hierarchy are the Register Files (RFs) in the PEs, that store data for inter-PE movements or accumulations, the Global Buffers (GBs), that stores enough values to feed the PEs, and the off-chip memory, usually a DRAM.As seen in the Background Section, the operations in DNNs are mostly simple Multiply-and-Accumulate (MAC) but need to be performed on a considerable amount of data.The real bottleneck in DDNs computation is memory accesses.Therefore, one of the key design issues for memory hierarchy is to reduce the DRAM accesses, since they have a high latency and energy cost.The reuse of the data stored in smaller, faster, and low-energy memories (GLB and RFs) is favored.

PEs array
Register File (RF) Given the memory problem, during the development of the first accelerators for DNNs, the focus was placed on investigating efficient dataflows, i.e., spatial and temporal mapping of operations on PEs, which would reduce the number of off-chip memory accesses by reusing the data already stored in GB and RF, as much as possible.Moreover, operations and data movement must be orchestrated to have good throughput performance as well.From these studies, various accelerators were born that exploit the different possibilities of data reuse offered by DNNs, CNNs in particular.Three kinds of data reuse are identified in Conv layers: For what concerns FC layers, there is an input reuse opportunity since all input values are reused to calculate the output of each neuron.
The first accelerators developed were traditionally classified according to the type of data reuse used.In particular, accelerators are classified as follows: • Weight Stationary (WS): the weights are stored in the RFs of the PEs and kept fixed, while the inputs are distributed coordinated with the movement of the partial sums between the PEs to obtain the correct results.These accelerators exploit weight reuse and convolutional reuse.Popular WS accelerators are [59,60] In recent years, to further improve the performance and energy efficiency of accelerators, attention has been focused on new strategies.In this survey, those that have led to better results and on which the research has invested more will be discussed, namely sparsity exploitation, variable bitwidth accelerators, and reconfigurable accelerators.

Accelerators with Sparsity Exploitation
The exploitation of sparsity involves taking advantage of the high number of zeros present in the matrices of weights and activations, which are therefore scattered matrices.Sparsity is mainly due to two factors: given the redundancy of the weights in an NN, it is usually possible to prune, i.e., put many values to zero [65][66][67].The frequent use of the ReLU function as the activation function resets all the negative values in the matrices of the activations to zero.The number of non-zero values can be reduced to 20-80% and 50-70% for weights and activations, respectively.This factor can be used to avoid multiplication, as the result of zero multiplication is known in advance, and to compress the data when stored in memory.Among the best-known sparsity compression methods, the most used are: Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), Compressed Image Size (CIS), and Run Length Coding (RLC), as depicted in Figure 9.These techniques provide effective and minimal overhead when implemented in hardware.CSR and CSC compress the sparse matrix into three different arrays.The first one represents the non-zero values, the second contains the column index and row index respectively for CSR and CSC, while the third shows the number of non-null elements in the matrix.CIS is composed of an array of non-zero values and a matrix of the same size as the original one.This matrix represents the position of the values contained in the array.This representation is the most hardware friendly since it often does not require any decompression mechanism.Finally, RLC compresses the original data indicating for each value how many times it is repeated.
Accelerators that exploit sparsity have different architectures that allow adapting the computation to the sparse matrices (see Table 2 for a comparison).Cnvlutin [68] uses the CSR scheme to compress the activations but does not consider the sparsity of the weights.In Cambricon-X [69] the PEs store the compressed weights for asynchronous computation, but do not exploit activations sparsity.SCNN [24] uses the CSC scheme for both weights and activations.The values are delivered to an array of multipliers, and the resulting scattered products are summed using a dedicated interconnection mesh.Sparten [70] is based on SCNN architecture, but it improves the distribution of the operations to the multipliers to reduce the overhead.EIE [71] compresses the weights with the CSC scheme and has zero-skipping ability for null activations.Moreover, high energy savings are obtained by avoiding the use of DRAM.Similarly, NullHop [26] applies the CIS scheme to the weights and skips the null activations.ZeNA [72] is the first zero-aware accelerator that can skip the operations with null results induced by both null weights and activations.SqueezeFlow [25] has a mathematical approach to the sparsity problem and introduces the concise convolution rules to avoid the operations with a null result.The RLC scheme is applied to the weights.SqueezeFlow supports sparse and dense models as well.Eyeriss v2 [73] also supports both sparse and dense models.It utilizes the CSC scheme to weights and activations, which are kept compressed both in the memory and computation domain.For higher flexibility, a hierarchical mesh is used for the PEs interconnections.Unique Weight CNN (UCNN) accelerator [74] proposes a generalization of the sparsity problem.Rather than exploiting only the repetition of weights with zero value, it uses the repetition of weights with any value by reusing CNN sub-computations and reducing the model size in memory.

Variable Bitwidth Accelerators
A DNN can have over a hundred million parameters.Therefore, the memory required to store the data during computation is huge.One of the main techniques used to reduce the memory constraints is bitwidth reduction.Rather than expressing the numbers in IEEE 754 32-bit floating-point format, it is possible to represent them in fixed-point format [75], reducing the bitwidth as much as possible without affecting the accuracy significantly.This strategy allows not only to reduce the occupied memory but also to decrease the power consumption associated with computation [76,77].It has been demonstrated that most DNNs can be inferred using 8-bit fixed-point values without accuracy degradation [78,79].However, several studies [80,81] have shown that each layer of a DNN has a different impact on the accuracy, and it is, therefore, possible to use a fine-grained quantization where the bitwidth of the weights and activations is different in each layer.
To exploit not only the memory gain deriving from the bitwidth reduction but also the lower power consumption, hardware accelerators with flexible-bitwidth arithmetic have been developed (see Table 3 for a comparison).Stripes [82] implements variable bitwidth with bit-serial computation.The latency increases linearly with the bitwidth, but this increment can be compensated by heavier exploitation of the inherent parallelism of DNNs.Besides, multipliers, which are one of the most considerable sources of energy consumption excluding memory accesses, are no longer necessary.The bit-serial computation engine consists, in fact, of AND gates and adders only.Stripes is a hybrid architecture that fixes the bitwidth of the weights and provides flexibility for the activations.UNPU [83] has a very similar bit-serial computation engine, but the bitwidth of the activations is fixed to 16-bit while the weights have 1-bit to 16-bit flexibility.Loom [84] architecture is fully temporal since both weights and activations have variable bitwidth and are processed serially.To achieve this flexibility, Loom adopts bit-serial multiplication, that, however, requires the transposition of the inputs.For a more efficient implementation, Loom transposes the outputs rather than the inputs, but the overhead is not negligible.Bit Fusion [85] implements the flexible bitwidth spatially, with an array of PEs that are combined differently depending on the required bitwidth.In detail, the overall computation is partitioned in 2-bit × 2-bit multiplications, followed by shifted additions.BitBlade [86] is based on the Bit Fusion accelerator, but it further optimizes the architecture eliminating the shift-add logic using bitwise summation.

Reconfigurable Accelerators
Given the increasing interest in deep learning, a wide variety of models with very different features and layers have emerged, as seen in Section 2.3.However, the majority of ASIC/FPGA accelerators for DNNs are designed and optimized to support only one type of dataflow.It can be complex to map different layers on these accelerators equally efficiently.To allow for more widespread and mass deployment of ASIC/FPGA accelerators, flexible and easily reconfigurable designs are required to support different types of layers and models.
FlexFlow [87] and DNA [88] are two accelerators that support a flexible dataflow to exploit the different types of reuse and parallelism of Conv layers.However, they only target CNNs.On the other hand, MPNA [89] supports dedicated PE array units for Conv and FC layers.In [90], an ASIC reconfigurable processor targeting hybrid NNs, i.e., networks with different layers, is presented.The PEs are organized into two 16x16 arrays, and they are divided into general PEs and super PEs.The former supports Conv and FC layers, while the latter supports Pooling layers, RNN layers, and non-linear activation functions.Each PE has two 8-bit multipliers that can be used separately or jointly to form a 16-bit multiplier, allowing for a variable bitwidth.The arrays can be arbitrarily partitioned into sub-arrays to process multiple layers or networks in parallel.Project Brainwave [91] is an FPGA platform used in Microsoft servers for real-time AI.The core of Project Brainwave is the NPU, a spatially distributed microarchitecture with up to 96 thousand MACs.The architecture achieves flexibility working with vectors and matrices as data-types, using efficient matrix-vector multipliers and multifunction units that can be programmed to implement a function chosen in a broad set.For easy programming, the architecture has a custom SIMD Instruction Set.MAERI [92] has a set of PEs, each containing a Register File and a multiplier.The reconfigurability is obtained with the interconnections.The activations and weight are delivered to the PEs with a distribution tree fully configurable.Similarly, the outputs of the multipliers are collected by a configurable adder tree.SIGMA [93] introduces the Flexible Dot Product Engine (Flex-DPE), which has a structure similar to MAERI.In the Flex-DPE, the multipliers are in fact arranged in a 1D structure.Thanks to highly flexible distribution and reduction networks, multiple variable-sized dot-products can be performed in parallel.Thanks to the flexible distribution network, SIGMA also supports the acceleration of sparse networks.To maximize energy efficiency, DNPU [94] proposes a heterogeneous architecture with two processors, one optimized for Conv layers and CNNs, the other targeting FC layers and Recurrent Neural Networks.Cerebras has recently announced the Cerebras Wafer Scale Engine (WSE), the largest chip ever built and specialized for DL computing only.The WSE has a huge numbers of flexible cores that support general operations (e.g., arithmetic, logical, load/store operations) but are optimized in particular for tensor operations.The memory, in the order of gygabytes, is on-chip and distributed.

Memory
Optimizing the hardware integration of basic operations is certainly promising from an energy point of view, but it should be considered that inefficient memory management could nullify all this for two main reasons.First, each DNN is composed of billions of multiply and accumulate (MAC) operations between activations and weights.Thus each MAC requires three memory accesses: two for the factors and one for the writeback of the product.Second, off-chip DRAM access is three orders of magnitude larger than a simple floating-point adder [71].Therefore, these two reasons become even more marked considering the current DNN trend, which goes towards increasingly complex models, where the size is scaled up for better accuracy.In this scenario, the reuse of data through an efficient dataflow and conscious use of memory are the basis of the co-design of efficient architectures.
Generally, a hardware accelerator for DNN presents a hierarchical memory structure composed of a few levels as shown in Figure 10.The outer one is typically represented by a DRAM in which all network weights and activations of the current layer are saved.Part of these data is periodically moved to a lower level, close to processing elements.This intermediate layer consists of three SRAM buffers: one for the input activations, one for the weights, and one for the output activations.Typically, the activation and weight buffers are separated since several different bit widths could be used.The lowest level instead, is represented by elements of local memory as registers located within the PEs.These are responsible for the data reuse chosen by the dataflow policy.Moreover, additional memory elements can be inserted depending on the specific design and data flow.Exploiting the memory hierarchy, there is no direct communication between the accelerator and the CPU.The CPU loads the data into the DRAM and, when present, programs the register file.Each location of the register file corresponds to a specific parameter of a DNN layer, i.e., input size, output size, number of filters, filter size.The control unit that drives the accelerator, relies on the data contained in the register file to organize the loops related to the dataflow and to generate memory addresses to move the data from the DRAM to the buffers.Since the DRAM is the most power-hungry element of the hierarchy, many different techniques have been proposed to reduce the number of DRAM accesses.For example, Stoutchinin et al. [95] proposed an analytical model for the optimization of the memory bandwidth in CNN loop-nest.They showed that with minor interface changes, it is possible to reduce the memory bandwidth of a factor 14×. Li et al. [96] instead, proposed an adaptive layer tiling able to minimize the off-chip DRAM accesses called SmartShuttle.This model can exploit different data reuse paradigms, switching from one to the other in order to better fit tiling over several layer sizes.Although the previous works reduce the memory accesses, they do not consider both the latency and energy per access.Putra et al. [97] tried to optimize these two factors for further performance enhancement.

DRAM Register File
The memory hierarchy described above is a generic structure used by most parts of the accelerators.However, in other circumstances, specific designs, optimized for the target application, have been opted for, where hardware key elements have been removed.This is the case of ShiDianNo [61], where the whole accelerator has been embedded inside a phone camera sensor by eliminating the need for an intermediate DRAM to store the pictures data.The absence of the memory coupled with an efficient data pattern access leads to a 60× energy saving compared to the previous architecture DianNao [64].
It is therefore clear that memory is one of the most sensitive points of the entire architecture and that for a low-power system, its size and bandwidth must be correctly sized.Wei et al. [98] proposed a framework for FPGA able to allocate efficiently the on-chip memory exploiting the layer diversity and the lifespan of the intermediate buffers.
Although sparsity and sparse models were primarily designed to avoid unnecessary operations between null activations and weights, they indirectly reduce both the memory accesses and the memory size.Model pruning, coupled with the Rectified Linear Unit (ReLU), produces respectively null weights and null activation.The sparse matrices can be compressed, requiring less memory.Moreover, removing the useless operations (multiply by zero value) sharply reduces the memory bandwidth required, speeding up the execution of the DNN, as mentioned in Section 3.2.1.
Logic-in-memory (LIM) is another method of reducing or even eliminate access to memory.This technique involves the integration of part of the computational logic directly into the memory to work on the data without having to extract them as in the case proposed by Khwa et al. [99].However, this approach can be implemented only in some cases (for example binary networks) and is not feasible with complex state-of-the-art networks [100][101][102].

Hardware Metrics and Comparison
Comparisons between different hardware platforms or accelerators are not always straightforward as designers often present performance depending on the target application.For example, a GPU for server applications is difficult to compare with an accelerator based on ASIC or FPGA for embedded applications.In fact, the power envelopes and the amount of data to process will be the opposite.However, there are some standard metrics that researchers rely on to define the performance of the hardware that refers mainly to the area, power consumption, and the number of operations per second.
Area.The area, generally expressed in squared millimeters or squared micrometers, represents the portion of silicon required to contain all the necessary logic.It strictly depends on the technological node used during the hardware synthesis process and the size of the on-chip memory.
Power.The power consumption comes from the device's power envelope and the application for which it was designed.Battery-powered devices, for example, require extremely efficient accelerators that can overcome the pj per MAC barrier.This latter is a metric widely used to express the efficiency of the computational side of architecture.However, power consumption must also include that resulting from both on-chip and off-chip memories as they are the primary source.
Throughput.Throughput defines how often an accelerator can accomplish a complete convolution, or rather a complete inference.Throughput and latency are derived from the device's operating frequency coupled with the memory bandwidth.Usually, this metric is expressed as billions of operations per second (Gop/s) or as billions of Macs per second (GMAC/s).Considering that a MAC consists of two operations (multiplication and sum), the ratio between Gop/s and GMAC/s is 2 to 1.
There are other metrics that best define an accelerator, such as its flexibility and scalability to new network models or variable bitwidth.As mentioned above, accelerators tend to be very application-dependent, so very often, comparisons between them are complicated and should be evaluated based on datasets or common models.
A comparison between many of the aforementioned models is provided in Table 4, where different hardware platforms and different accelerator models are presented.As expected, general purpose architectures have greater area and power consumption than special purpose architectures, as they are not optimized for a specific application.For each architecture, the main hardware metrics discussed above are reported.

Figure 1 .
Figure 1.(a) How artificial intelligence has evolved over the years.(b) The processes of training and inference in comparison.

Figure 2 .
Figure 2.An abstract example of a neural network.

Figure 3 .
Figure 3.A graphical model of the neuron.

Figure 4 .
Figure 4.A fully connected layer with C i = 5 and C o = 4.

Figure 6 .
Figure 6.Comparison between the temporal and spatial architectures.

Figure 7 .
Figure 7.Comparison of the number of CPU cores and GPUs.CPUs and GPUs models have been selected for different targets, e.g., personal computers or servers, and different price ranges.For the CPUs, the gray and black lines correspond to the minimum and the maximum cores of a family, respectively.For the GPUs, the black lines represent the number of CUDA cores, and the gray line represents the Tensor cores present in the Nvidia Tesla V100 only.

Figure 8 .
Figure 8.Typical design of an hardware accelerator for DNNs.

•
Weight reuse: a kernel of weights is reused HoxWo times for each sub-grid of the input feature maps; • Input reuse: the input feature maps are reused Co times to compute each output feature map.•Convolutional reuse: when the weight kernel slides through the input feature maps, the sub-grids used for computation usually overlap.The input values that fall in the overlapping region are reused to compute two or more output values.

Figure 10 .
Figure 10.Generic memory hierarchy for a DNN accelerator.

Table 1 .
Comparison of the most popular models in the history of DNNs.
. • Output Stationary (OS): each partial sum is kept fixed in a PE, and accumulation is performed until the final sum is reached, while the weights and inputs are distributed in various ways to the PEs.These accelerators can exploit convolutional reuse.Popular OS accelerators are [61,62].• Row Stationary (RS): this dataflow jointly maximizes the reuse of all data, i.e., inputs, weights, and partial sums.In this dataflow, the operations of a row of the convolution are mapped to the same PE.The weights are kept stationary in the PEs.For instance, Eyeriss [63] is an RS accelerator.• No Local Reuse (NLR): This dataflow reduces the accelerator area by eliminating the RFs from the PEs and reading data only from GBs.There is no data reuse.DianNao [64] is an NLR accelerator.

Table 2 .
Comparison of accelerators that exploit sparsity.

Table 3 .
Comparison of accelerators that support variable bitwidth operations.

Table 4 .
Comparison between accelerations implemented on different hardware platforms.