Accelerating Neural Network Inference on FPGA-Based Platforms—A Survey

: The breakthrough of deep learning has started a technological revolution in various areas such as object identiﬁcation, image/video recognition and semantic segmentation. Neural network, which is one of representative applications of deep learning, has been widely used and developed many efﬁcient models. However, the edge implementation of neural network inference is restricted because of conﬂicts between the high computation and storage complexity and resource-limited hardware platforms in applications scenarios. In this paper, we research neural networks which are involved in the acceleration on FPGA-based platforms. The architecture of networks and characteristics of FPGA are analyzed, compared and summarized, as well as their inﬂuence on acceleration tasks. Based on the analysis, we generalize the acceleration strategies into ﬁve aspects—computing complexity, computing parallelism, data reuse, pruning and quantization. Then previous works on neural network acceleration are introduced following these topics. We summarize how to design a technical route for practical applications based on these strategies. Challenges in the path are discussed to provide guidance for future work.


Introduction
In machine learning, deep neural network (DNN) has shown great improvement over traditional algorithms [1]. DNN models have been proposed in many areas such as image classification, detection and segmentation. However, as these models have become increasingly accurate, their data and computing resource requirements have also increased [2]. DNN models have become deeper and have developed great accuracy with high storage complexity and computation, which demands to design specialized accelerators on the deployment platforms for these models. The deployments can be summarized into two kinds-cloud and edge. The cloud deployment needs to transport the data from sensors to data centers. Training and inference of models are executed in data centers. The edge computing performs network inference closed to where data is produced, and models can be pre-trained in data centers. Marchisio et al. [3] splits these deployments into four use-case scenarios of DNNs: (1) offline DNN training in data centers, (2) inference in data centers, (3) online learning on edge device, (4) inference on edge device. In this paper, we focus on the 4th scenarios.
Mostly, edge device cannot provide huge memory and computation resource for DNNs. The power consumption for network inference is also limited. Therefore, the first scenario focuses on designing and offline training highly optimized models through methods like pruning, quantization, shift, and so forth. In another word, it is software optimization. Typical models like ResNet [4], Yolov3 [5] achieve high recognition rate as well as frame rate but have hundreds of millions parameters, which means storage

Convolutional Neural Network
The typical architecture of CNN comprises convolutional layers, pooling layers, activation function, batch normalization, dropout and fully connected layers. These structures are all related with particular computing methods, so the implementation of these submodules is the basic of acceleration. The convolutional layer is composed of a set of convolutional kernels where each neuron acts as a kernel [11]. The feature map is divided into several small blocks by convolutional kernels contain a set of weights and the kernels are slidden on the map. Main operations in convolutional layer are multiplying weights with corresponding elements in feature map and accumulating the results. The purpose of convolutional layer is extracting features from figures. The pooling layer has similar structures and computing operation to the convolutional layers. The difference are the size of convolutional kernels and the sliding step size. The function of pooling layers is to sum up similar information in the neighborhood of the receptive field and output the dominant response within this local region [12]. Activation function follows after convolutional layers and adds non-linear characteristic into feature combination. The convolutional layer and pooling layer are just multiplication and accumulation, which means they are linear process and cannot approximate nonlinearity. However, nonlinearity exists widely in reality scenario so activation functions such as sigmoid, tanh, maxout, SWISH, ReLU, MISH are introduced after convolutional layers to perfect the approximation ability of CNN. Batch normalization is adopted to solve the problem about the internal covariance shift in feature maps which causes slow convergence. Batch normalization unifies the distribution of feature-map values by setting them to zero mean and unit variance [13]. The computing operation of this process is different from others above so new computing module needs to be designed. Dropout skips some connections and units randomly to improve the generalization of CNNs. It is important and efficient in training, not inference, as well as acceleration design. Fully connected layer is introduced at the end of the network to serve as a classifier. Different from the convolutional layer and pooling layer, which are partial processes, the fully connected layer takes input from feature extraction stages and globally analyses the output of all the preceding layers [14]. Nevertheless, the computing structure is still convoluted.
In conclusion, the computing operations in CNN can be divided into three categoriesconvolution, activation function and batch normalization but the inner connections of convolution vary from different layers and requires diverse implementation strategies.

Recurrent Neural Network
Recurrent neural networks are feedforward neural networks augmented by the inclusion of edges that span adjacent time steps, introducing a notion of time to the model [15]. The connections between adjacent time steps are called recurrent edges, which form cycles connecting from a node to itself across time. The architectures of CNNs and RNNs are similar, including convolution layers, pooling layers, activation functions, and so forth. The difference is in the hidden layer, where the output of the neuron will be saved and transmitted to the next time step with a special weight. Figure 1 shows the comparison of CNNs and RNNs. Recurrent edges are exhibited in Figure 2 but the real implementation would be more complex. Take Long Short-Term Memory (LSTM) as an example, the memory cell which performs the function of connect adjacent time steps consists of input node acting as activation function for each time step, input gate deciding maintained new information, internal state keeping constant error, forget gate deciding maintained history information and output gate serving as the output of current time. The complicated architecture of recurrent edges leads to harder acceleration methods than CNNs.

Implementation Requirements
According to the analysis of the architectures of CNNs and RNNs, complicated data routes versus unmatched hardware topology and heavy memory use versus limited on-chip memory space are the contradictions which cause implementation difficulties. Since data routes/connections are complicated and hardware cannot effectively implement them without optimization, we need to split the connections into small pieces and designing processing module will be easier. The performance can be better if parallel computing is adopted in the pieces. Pruning is an additional and no-conflict method because it can reduce the connections by cut them away. However, other than splitting, performance degradation appears due to pruning. The memory use in hardware relates to more limitations. Firstly, on-chip memory is too small to meet the whole model requirement. Secondly, bandwidth and read-write speed (usually determined by clock frequency) are taken into consideration if leveraging off-chip memory. So, it is better to read a part of data into on-chip memory and make all use of it and throw it away without future call. Combining above two method, we can realize an efficient implementation architecture of DNNs on edge platforms. Problems of parameters will be introduced in the Section 4 along with quantization and computing complexity.

Resource-Limited Platform
Besides the architecture of DNNs, the platform of acceleration is another significant module. This part we research four kinds of platforms-FPGA, GPU, ASIC and MCU. They all resource-limited and require optimizations for network implementation. The advantages and disadvantages of each platform are taken into account to conclude the trends of acceleration strategies on them.

FPGA-Based Acceleration
FPGA shows tremendous potential for NN acceleration because of its programmability, enabling the developers to consider both the logic and the algorithm and implement specific logic for target computation. For typical FPGA-based accelerators, there are two architectures, SoC FPGA and standard FPGA. Figure 3 shows their structures. All FPGA accelerators consist of two parts, FPGA and CPU. The difference is whether the FPGA and CPU are arranged on one chip. CPUs are commonly ARMs in edge scenarios.  Both CPUs and FPGAs work with their own external memory and can access each other's memory through inner connections. The SoC FPGA provides a higher inner bandwidth between the CPU and FPGA, while the standard FPGA offers more logic and clock resources. However, the connections between FPGA and ARM are not an important factor to acceleration performance. ARM module cannot afford massive computing operations, so it serves as a switch to launch the whole system and provides some non-linear computing capabilities such as non-maximum suppression. FPGA module is a programmable device which provide many multiplying units and on-chip memory resource. All computation is executed on FPGA. During the calculation, ARM can also shift the index address of input or output data according to the computing steps and transmit it to FPGA. Furthermore, the on-chip storage of both platforms is too small to reach the requirements of popular NN models. Thus, the measurement of choosing which platform relies on performance and resource of chips on it rather than structures in generic scenarios.
It is necessary to introduce external storage units or develop massive FPGAs such as Virtex UltraScale+ VU19P launched by Xilinx or Stratix10 GX 10M announced by Intel. Until now, developers have still worked on common FPGAs with few resources. Table 1 presents the latest accelerator designs on FPGA-based platforms and compares their performance. In general, FPGA-based designs have achieved outstanding energy efficiency, which is important in mobile and edge applications. However, the speed still needs to be improved. As is summarized in Section Section 1, strategies taken in FPGA-based acceleration follow five guidelines-computing complexity, pruning, quantization, computing parallelism and data reuse. Computing complexity can be reduced by shift operation, look-table, approximate computing, stochastic computing and Winograd algorithm. The adoption of special instruction sets such as RSIC-V or ISA is also helpful. Pruning in FPGA acceleration mostly focus on structural methods because irregular tunnels are incompatible with the design philosophy of FPGA. Quantization tends to transform long float-point data to short fix-point one due to the gap between the computing burden of two data forms. Computing parallelism and data reuse are methods taken in accelerator design. The typical rule in acceleration is to make the best of on-chip memory and minimize data transmission from or to off-chip memory. If data transmission is inevitable, bandwidth should be fully utilized. We elaborate these methods as well as relative samples in Section 4.

GPU-Based Acceleration
GPUs are still the most widely used processors in neural network development. A typical GPU consists of many of arithmetic logic units (ALUs) for data processing and a few caches for data retransmission. Controllers merge multiple data accesses into fewer. GPUs can afford massive, parallel and pipelined computation which is important in DNN inference. Meanwhile, the high parallel structure and high inner bandwidth of a GPU determine its excellent performance in DNN training processes. However, common GPUs is not a good choice for edge application due to high power consumption, large space requirement and severe thermal design.
In addition to common GPUs, embedded GPUs have become the other development platforms because of the limits of applications of common GPUs in mobile scenarios. Taking the NVIDIA Jetson TX2 as an example, Figure 4 shows the structure of the TX2. The GPU core in Jetson TX2 is small and have low complexity, enough for network inference but hard to train DNNs on this platform. Jetson TX2 can afford 1.3 TFLOPS (Tera Floating-point operations per second) with single precision floating point data.  Acceleration on embedded GPUs centers on lightweight design of network, namely, quantization and pruning. Unstructured pruning is still ineffective on the GPUs, just like on FPGAs. Nevertheless, quantization methods are unlimited because of professional floating-point units (FPUs) in GPUs. But the units support different precision floating point numbers so consideration about data format is necessary. The deployment of DNNs on embedded GPUs, as well as the acceleration strategies, relies on frameworks such as Tensorflow, Pytorch, which lead to short development times and less program difficulty, compared to FPGAs.

ASIC Acceleration
ASIC designs can achieve higher efficiency than FPGA-based accelerators but require a much longer development cycle and higher cost. They are optimized for particular algorithms or even specific computational operations. Well-known ASIC products include TPU [23], DianNao series [24][25][26][27], and so forth. Table 2 shows some of the latest ASIC accelerators and their performance. The development time and difficulty of ASIC platforms are far beyond embedded GPUs or FPGAs. The co-design of hardware and software makes ASIC chips the most efficient platform with relatively smaller areas. Meanwhile, the co-design, which is maturely realized on embedded GPUs by development frameworks and can be realized through Integrated Development Environment (IDE) on FPGAs (FPGAs act as customize platform that can build specific circuits by code), in ASIC demand developers with professional hardware and software knowledge. In general, ASIC is a good choice for commercial and industrial application rather than explorative research.

Microprogrammed Control Unit Acceleration
Microprogrammed Control Unit (MCU) is another striving direction in hardware acceleration. MCUs have been widely promoted in various application because of cheap cost and stable performance. Namely, deploying and accelerating DNNs on MCUs is the most economical choice. However, MCU, just like its name, Sis good at controlling rather than computing. Complicated logic control units, large cache and high clock frequency guarantee that MCUs can execute multi processes in short latency. Meanwhile, limited floating point units lead to poor computing performance. So lightweight networks are suitable for MCUs. Quantization on MCUs aims at fixed point data with short bit width to reduce computing burden. Structured pruning is also helpful on MCUs. NNoM [35] is proposed as a framework to effectively implement neural networks on MCUs. Evaluating the development time and difficulty of DNN deployment on MCUs is hard on account of a lack of published works but we consider it as a relatively simple task without potentiality bounded by hardware resource.

Why Choose FPGA
After discussing all these resource-limited platforms and barriers in acceleration, we are going to illustrate why we prefer FPGA as acceleration platform. Five aspects are taken into account including-development time (t), development difficulty (d), cost (c), flexibility ( f ), throughput (p). In fact, these indicators vary in different designs even they are in the same hardware, not to mention that ASIC platforms are actually different hardware chips each. Our comparison relies on our research which could be imperfect and reaches compromise, qualitative standards among diverse designs. Development time. The development time of ASIC is the longest without doubt. Then is FPGA. Embedded GPU and MCU is similar if mature frameworks are adopted.
Development difficulty. ASIC is still the first in this aspect due to its complicated design process combine hardware and software. FPGA is the next. MCU is the third one because of its limited hardware resource. Embedded GPU is the last.
Cost. It is obvious that MCU is the cheapest among four categories. Embedded GPU is the second one after mass production. FPGA is a little expensive if we need high performance. Dedicated chips have the potential to become economical if be massproduced. However, ASIC is now still an expensive, rare choice.
Flexibility. FPGA is the most flexible device due to its programmability. Embedded GPU, MCU and AISC are all immutable designs so the flexibility of them is limited.
Throughput. As is shown in Tables 1 and 2, ASIC and FPGA share approximate performance level. Embedded GPU shows excellent capacity of computing floating point data. MCU, limited by poor computing resource, has poor throughput. Figure 5 intuitively compares these platforms and illustrates merits and demerits of each kind chip. Based on Figure 5, we summarize our reasons for choosing FPGA. Firstly, the programmability of FPGA, namely flexibility, shows compatibility of diverse algorithms and acceleration strategies. Secondly, the development time and difficulty are moderate, not a circumstance to ASIC. Thirdly, throughput of FPGA is enough for network deployment, even though it is not up to embedded GPU. Although the price of high-performance FPGAs is not very cheap, we think it is acceptable considering its benefits.

Computing and Memory Oriented Accelerating
In this part, we expound how the architecture and parameter of DNNs determine the acceleration strategies on FPGA-based platforms. The parameter determines the computing form in the networks while the architecture of DNNs is the significant factor of program structure. Illustrating from five points: computing complexity, pruning, quantization, computing parallelism and data reuse, we refer to the latest works of other developers, summarizing and comparing different methods.

Multiplication Optimization
The computing complexity comes from multiplications in network inference, so it is a direct optimization to replace or remove part of multiplications. In [20], the authors propose DiracDeltaNet, which is based on ShuffleNetV2, where all of the 3 × 3 convolutions and 3 × 3 depth wise convolutions are replaced with shift operations and 1 × 1 convolutions. The shift operators aggregate spatial information by copying nearby pixels directly to the center position [20]. A 3 × 3 convolution needs to traverse 9 points. If the number of channels is greater than 9, in order to traverse the information of all pixels, they divide channels into 9 groups and adopt the same shift in each group, while the rest channels choose the center point. However, the contribution of each channel to the output in this approach needs to be evaluated according to the model and data set. An ideal allocation should consider not only the redundancy of features but also the contributions of each shifted feature.
Baluja et al. [36] eliminates all multiplications and float-point operations by deploying a precomputed multiplication table. The authors receive |A| activations after quantizing nonlinear activation functions and confirm |W | weights in the neural network. Then, they compute all of the multiplications and store the results in a A × W table. Figure 6 shows the working process. According to the input of the layer, we can calculate the index and find the multiplication results in a simple 1D array. This approach does avoid the consumption of processing elements, but it presents new challenges for both storage and index by requiring scaling of the boundaries of the multiplication table in the face of different activation functions, and by calculating more multiplication tables if different quantization of weights is adopted.  Table  (last row is for bais, last column is for final layer output) Figure 6. Using a stored multiplication table to avoid multiplication [36]. Calculations are transformed into search operations, which obviously reduces computation.

Approximate Computing
The rising performance demands are expected to outpace the growth in resource budgets; hence, over provisioning of resources alone will not solve the conundrum that awaits the computing industry in the near future [37]. Approximate computing (AC) has become a promising method for solving this problem. In [38], the authors propose an AC strategy that trains a neural network to mimic an approximable code region. This method enables the compiler to invoke a low-power NPU to replace the original code. In [39], the researchers present a technique to accelerate approximable code regions on limitedprecision analog hardware by a NN approach. ApproxANN [40] considers the impact of each neuron on the output quantity and energy consumption and obtains a criticality ranking, whereas unimportant neurons have higher priorities for approximation. Then, the authors use iterative heuristics to decide the number of neurons to be approximated and the approximation strategy. A method is proposed in [41] to transform any given neural network to an approximate neural network (AxNN) and a quality-configurable neuromorphic processing engine (qcNPE) to execute the AxNNs. Xu et al. adopt iterative training in the quality control of approximate computing. They propose an optimization framework to coordinate the training of the classifier and accelerator with a judicious selection of training data [42]. In [43], a new method of approximate computation is introduced, namely, dynamic-voltage-accuracy-frequency-scaling (DVAFS), which can dynamically balance the energy and accuracy. DVAFS reuses inactive arithmetic cells under reduced precision to improve the energy efficiency.

Stochastic Computing
Stochastic computing (SC) is a very unique algorithm that represents and processes information in the form of digitized probabilities [44]. It has low computational complexity but needs long computation times and shows low accuracy. Figure 7 shows a demonstration of SC. The data in SC are coded as the probabilities of observing a 1 at a bit-stream with a given length. The output of the AND gate is the product of two probabilities if two bit-streams are suitably uncorrelated or independent. So, improvement of the accuracy of a stochastic computation requires an exponential increase in the bit-stream length as well as the computation time. A high degree of error tolerance is another feature of SC, especially for transient or soft errors caused by process variations or cosmic radiation [44]. Changing a single bit of output, as shown in Figure 7b, the result would not change greatly because each bit of the output in SC enjoys the same weight.
Stochastic computing can radically simplify the hardware implementation of arithmetic units and has the potential to bring the success of DCNNs to embedded systems [45]. SC-DCNN [46] is the first framework that applies SC to DNN, according to the authors. They propose the most efficient implementations of SC to inner product/convolution, pooling and activation functions. For example, they replace the conventional AND gate of SC to a 16-bit approximate parallel counter (APC) in order to improve the accuracy; they split each of four bit-streams into several segments and infer the largest bit-stream according to the largest segment of the four candidates in order to reduce the latency of the pooling; and they adopt Btanh as an activation function to address different bit-stream lengths. In [47], near-zero weights is removed to improve the accuracy of SC when it was adopted in a DNN. Their experiments show that the XNOR operation, which represents multiplication in bipolar encoding, causes a very large error if the near-zero weights are introduced into the computing. They also scale the weights to a large range in such a way that the weights would become far from the zero center.
SC is also inefficient with storage; thus, some researchers convert bit-streams into binary numbers to avoid the overhead of SC and reduce computing time. BISC-MVM [48] has been developed by introducing a novel stochastic number generator (SNG). By simplifying and restructuring the computation process from BN-to-SN conversion to an SC process and to SN-to-BN conversion, the authors apply this method in DCNN acceleration, reducing the computing time. In [45], authors introduce normalization and drop-out to the SC-based DCNN framework. The authors use an approximate parallel counter, a near-max pooling block and an SC-based rectified linear activation unit to extract the features and propose a novel SC-based normalization design. Researchers have found that SC multiplication would be more accurate after logarithmic quantization and integrates SC and logarithmic quantization [49]. SkippyNN [50] proposes a differential multiply-and-accumulate unit called the DMAC to reduce the computation time of SC-based multiplications in convolutional layers. In the SC domain, the computation time of this product is determined by the length of the bit-streams. For example, x j × w j takes w j clock cycles in BISC-MVM [48]. Thus, SkippyNN uses ∆w = w j − w i in the calculation to reduce the computation time. The weights are reordered in ascending order so that ∆w < w j . The Winograd algorithm was first discovered by Toom [51] and Cook [52] and it was generalized by Winograd [53]. In [54] this method was introduced into neural networks. The Winograd algorithm can decrease multiplications in the convolution at the cost of increasing a small number of additions and shifts which shows great potential for hardware accelerators because most hardware can perform addition by consuming negligible logic and power resources. In [55], researchers propose the application of Winograd on FPGAs. To minimize the bandwidth requirement, they design a line-buff structure that caches the feature map for the Winograd algorithm to reuse the data and propose an efficient Winograd PE to enhance the parallelism during the convolution operation progress. UniWiG [56] is a unified architecture that can accelerate Winograd-based convolution and general matrix multiplication (GEMM) on the same process elements (PEs). Previous studies, such as [55,57,58], all design especially designed PEs for Winograd-based convolution, which means that the CONV layers and FC layers need separate PEs, thus leading to heavy resource utilization. Moreover, the Winograd algorithm is mostly used to accelerate convolutions with a small kernel size, and large kernels need direct convolution. UniWiG transforms the Winograd operation into matrix multiplication by a blocked Winograd filtering algorithm in order to perform the GEMM and Winograd algorithm on the same data path.

Loop Unrolling
To increase the parallelism of computation, the basic strategy is loop unrolling. Loop unrolling is a multi-dimensional expansion of convolution operation. It can be divided into four categories-unrolling convolution kernel, unrolling input channel, unrolling output feature map, unrolling output channel. Though all of these methods utilize the parallelism in network inference, only unrolling convolution kernel and unrolling input channel are commonly used considering their combination with data reuse. Figure 8a demonstrates the process of unrolling convolution kernel. The width and height of th kernel is kx and ky. After initialization, the multiplication of kx × ky weights and features are parallelly performed in a single cycle and accumulated to get intermediate results which are stored in on-chip memory. The final results are restored in off-chip memory and will be transmitted to on-chip memory when involved in subsequent computing.   Figure 8b shows the expansion of the input channel. Pixels in the same region of each channel multiply the relative weights in the kernel in a single clock cycle. Whole convolution operation needs all pixels in feature map are traversed. We can accomplish it with two approaches-feature map precedence or kernel weight precedence, which determine whether we calculate all multiplications about one pixel then move to the next or we prefer to completely utilize one weight and never require it afterwards. We will expound the options and relevant conditions in the part of data reuse.

Pipeline
Another path to realizing high parallelism hides in the loop processing during inference. Single loop contains the data read and write. The generic approach only performs an operation in one clock cycle. Therefore, at least eight clock cycles are needed to complete 3 loops. A method called pipeline are introduced to read, write and calculate at the same time, which can improve the effective utilization rate of bandwidth and data storage. In [59], three versions of pipeline are presented: single loop optimization, nested loop optimization and array partition optimization. Figure 9a is the common flow of loop processing with three steps-reading, computing and writing. Figure 9b shows how a pipeline reduces processing time from 8 clock cycles to 3 by executing reading, computing and writing simultaneously. Nested loop optimization adopts a strategy named rewinding to deal with nested loops. When the first loop is finished, there will be an idle period for reestablishing a new reading and writing operations for the next loop. Therefore, we rewind the address from end to start in advance before the first loop finished and preload the data for the next loop. In this way, the first and second loop are connected just like a single one. When the above methods are employed, the inner bandwidth of on-chip memory becomes a bottleneck. Only one element can be updated form BRAM in one clock cycle. If one loop is much longer than others, it would make other computing to wait for its accomplishment. On account of this phenomenon, array partition optimization is introduced to loop processing, where a single loop is divided into two part and processed in two data routes. It can decrease the inner-layer latency obviously. Other methods such as dataflow, ping-pong operation follow the same guideline. RD  (d) Array partition optimization Figure 9. Comparison of processing with and without pipeline [59]. Pipeline can increase the efficiency and speed of the computing process, especially there are sparse hardware resources. According to the application situation, the method can be divided into three categories: single loop optimization, nested loop optimization, and array partition optimization. These techniques can be combined.

In-Memory Processing
In-memory processing (IMP) is a newly developed technology which is suitable for parallel computing. The computation is moved from digital logic to analog domain. The application of IMP in hardware acceleration can be divided into several types-one type is using the inherent dot-product characteristics of the crossbar architecture to accelerate matrix multiplication [60]; another type is to utilize the analog nature of eNVMs to implement a neuromorphic network; and the third type is to avoid ADC/DAC blocks in the IMP by implementing logic using memristor switching. Most IMP technologies depend on emerging nonvolatile memory technologies (eNVMs), such as phase change memory (PCM) [61] and resistive RAM (RRAM) [62], especially RRAM which has been wide used.
RRAM becomes a promising solution in the implementation of CNNs on embedded platforms because of its capacity for calculating the matrix-vector product with high precision [63], high parallelism and the characteristic of energy efficiency. RRAM devices are able to support a large number of signal connections within a small footprint by taking advantage of the ultra-integration density [64]. RRAM can realize a resistive cross-point structure called the crossbar. The RRAM crossbar can store weights as conductance values of cells, and feature maps are converted into input voltage signals. The output feature maps can be read out through the accumulated currents on the bitlines. Figure 10 shows the basic structure of the RRAM crossbar and the operations on it, which are elementary units in the computing process. The crossbar and operations on it are independent. M × N times M-V products which replace multiplications can be performed concurrently and intermediate results are the accumulated currents on the bitlines. RRAM offer concurrent computing ability based on its circuit configuration while methods we proposed above center on program structure. Namely, they are orthogonal because those methods are software design, but RRAM is an excellent hardware module for realizing them. FPGA and RRAM are also compatible. There are several challenges while applying RRAM in NN accelerators. It mainly suffers from two challenges: parametric variation and switching variation [65]. Parametric variation is caused by imperfect fabrication such as line-edge roughness, oxide thickness fluctuations and random discrete dopants [66]. Switching variation is due to driving circuits. Any changes in the current or voltage during programming would cause a large variation in the resistance. Consequently, the resistance of a memristor might not be changed as required. In [67], researchers present a fault-tolerant training method as well as an online fault detection system for RRAM-based neural network adaption. Researchers avoid mapping the large-weight synapses to the abnormal memristors by deriving a weight-memristor mapping for variations and defects [65].
In addition to the fact that faults can occur during implementation, there is divergence between NN networks and RRAM structures. First, RRAM is used as an analog computing device, and it is difficult to directly store large amounts of analog intermediate results, while some other functions in the CNN, such as max pooling, are difficult to implement in analog circuits [64]. Therefore, the analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) are necessary for RRAM, which causes more than 98% consumption of power and area. In [64], the DACs are eliminated by quantizing the data between the layers into single bits and reduces the employment of ADCs while merging the results of RRAM crossbars by using quantized data as selection signals. Second, RRAM crossbar-based architectures cannot take advantage of the sparsity of neural networks since RRAM is a dense device. In [68], researchers propose a neural network computation architecture called SNrram, which is based on RRAM to leverage the sparsity in both the weights and activations. In [69], researchers present MaxNVM, which is a codesign of sparse coding and eNVM technologies (i.e., RRAM). The authors find a balance between the density and reliability.

Loop Tiling
In addition to computing, we must allocate the on-chip memory reasonably. Due to cost and other factors, the on-chip memory is commonly not large enough to meet the requirements of storing all the weights, input feature maps and intermediate calculation results on the chip. Therefore, data must be stored in off-chip memory but there is a big gap between bandwidth of off-chip and on-chip memory. In the convolution computation, the problem of reading off-chip memory is involved both in and between layers. In the layer, blocks of weights or pixels need to be read from off-chip memory and intermediate should be stored in on-chip memory while final results are stored in off-chip memory (determined by memory space) and reload from it if used. Between the layer, we have to load the new pixels and weights of next layer which are saved in off-chip memory. In order to avoid the network inefficiency caused frequent memory reading and writing, efficient memory architecture design and reasonable data reuse are necessary.
Loop Tiling is the basic strategy. As is shown in Figure 11, the input feature map is divided into blocks of Tix × Tiy size and kernel weights are in Tkx × Tky blocks. Only Ti f input channels and To f output channel are taken into account in a single part of whole loop. After the loop is partitioned, input feature maps and weights of block size are respectively read and stored on the chip during calculation. In order to reduce the reading and writing times of off-chip storage to the greatest extent, it is necessary to carry out data reuse after block circulation, and the choice of data reuse needs to be considered comprehensively according to block size and block calculation times. In addition, as we mentioned above, the block loop after cyclic partitioning can also apply a higher-level flow design like pipeline or dataflow to improve the parallelism of elements reading, computing, and writing.  After introducing loop tiling, another choice appears. We have the final results of several regions in the layer and start to calculate the next part. Whether should we wait the whole layer finished or we just start to calculate the next layer with existing results along with the next part of current layer? The answer depends on the resource consumption and bandwidth requirement of current layer. The key point is that only calculating a single layer needs to write results back to DRAM which could be directly transmitted and used in the next layer. If parameters are huge, such as the first and second layer, and the rest resource can afford the computing of the next layer, we should start the next layer to save bandwidth.
If the layer is small with more kernels and channels, namely huge potential parallelism with fewer parameters, starting the next layer is a not wise attempt. Furthermore, if the bandwidth cannot reach the requirements of the first case (we need to write intermediate results to BRAM and final results back to DRAM) and on-chip memory is big enough (the second case requires results for next layer computing, so they are stored in BRAM), the second scheme is better.

Parameter Reordering
When the block inputs are load into chip, latency exist owning to discrete memory access since we initialize parameters and stored them in the order but in usage, they are not continuous due to loop tiling. So, reordering the parameters as we required can improve the utilization efficiency of bandwidth. Figure 12 demonstrates the flow of parameter reordering. Inputs whether pixels or weights are saved in the order of they are gathered not in the logic order. The continuous access to DRAM (commonly serves as off-chip memory) determines large burst length, namely, high bandwidth. Figure 12. Parameter reordering. According to loop tilling, only a small share of input map is introduced. Thus, the sequence of the feature map can be reorganized to speed the process of memory reading.

Near-Memory Process
The long data path of reading or writing is another cause of latency in memory access which makes near-memory process (NMP) an effective strategy. Fundamentally different from IMP, the underlying principle of NMP is processing in proximity of memory-by physically placing monolithic compute units (GPU, FPGA, ASIC, CPU and CGRA) closer to monolithic memory-to minimize the data transfer cost [60]. Previous work, such as EXECUBE [70] and DIVA [71], is not being widely applied due to the cost and manufacturability of the implementation. Along with the progress in die stacking technology, the challenges in NMP have been alleviated, and newly developing memories contribute to the application of NMP in the hardware acceleration of neural networks. TensorDIMM [72] is a NMP architecture for embeddings and tensor operations. This method gathers tensors in "near-memory" and copies a single, reduced result to GPU memory, which can reduce the latency of the gathering as well as the size of the data transmission. Moreover, commodity DRAM devices are still leveraged in this method, so the cost and manufacturability are acceptable.
In [73], authors concentrate on accelerating the training in hardware accelerators. The authors propose a near-memory acceleration engine called NTX, which is an FP streaming coprocessor on TCDM. Combining NTX with a general RSIC-V processor core on a TCDM that provides a shared memory space with single-cycle access, this method is proven to require less area and power consumption as well as less latency.
These considerations are based on ideal conditions. In practice, the storage read latency, data transmission latency, time consumption of computing exists, and the existence of such factors makes optimization more complicated. How to use pipeline covering the time delay, maximum bandwidth utilization and processing elements, minimize the latency in the accelerator? It is hard to give the optimal decision and various factors in the actual situation require deliberateness and experiments.

Reconfigurable Convolutional Kernels
The reconfigurable convolutional kernel is proposed in [74]. The authors find that interval parameters in a convolution do not change over a long time in weight stationary CNNs, which provides feasibility in adopting reconfiguration. This work presents a scheme that uses a variant of Chapman's KCM technique [75], fast LUT -based reconfiguration [76,77], pipelined compressor trees [78,79] and faithful rounding [80]. The authors construct their reconfigurable architecture based on LUTs, which is the fundamental unit in FPGAs. The reconfigurable cell consists of two LUTs controlled by the same input. One of the LUTs is called shadow LUT, which means that it is not involved in current calculations and can be activated in a single clock cycle. The reconfigurable time can be ignored if it is faster than the convolution operations with the same parameters, and results can be achieved by switching the select signal during the computation. However, more memory resources are required. This method provides a new direction of data reuse.

Pruning
Weight reduction is a common method for making the models sparse. There are many approaches that have been proposed. One kind is to approximate the weight matrix with a low-rank representation [1], which has been applied in [81][82][83][84], and so forth. Another method is pruning. By utilizing the inherent redundancy in the neural network, pruning directly removes weights with small absolute values with negligible accuracy loss. In [85][86][87], this solution is applied in different processes, and divergent pruning criteria can be used for compact networks.
Pruning can be divided into two categories: structured pruning and unstructured pruning. Unstructured pruning just remove the connections but reserve the neurons. A neuron will not be deleted if at least one connection exists. The unrestricted sparsity lead to irregular computation and memory accesses, limiting the realizable parallelism, especially when the implementation is on FPGAs [19] or GPUs [88]. Unique encoding formats like CSC, CSR are adopted to optimize memory access. Structured pruning delete unimportant neurons so the network is still density. This is hardware friendly but the importance of all connections linking to the removed neurons is difficulty for detailed analysis.
An algorithm combining unstructured pruning and structured pruning is proposed in [85]. A hardware-friendly compact model is generated. They trim unimportant connections and neurons according to the weight and size of the output, removing connections (neurons) if and only if the weight (neuron output) is smaller than a predefined threshold. Specifically, it is divided into two steps: deleting weights and deleting neurons. They treat connections with low effective weights as irrelevant and remove them. In each iteration, only the least important weights (such as the top 1%) in each layer are removed, and then the entire DNN is retrained to restore its performance. After each iteration of insignificant weight pruning, if all inputs or outputs of a neuron are deleted, the neuron pruning is performed to remove the neuron. Furthermore, they propose a local region convolution algorithm, which allows the convolution kernel to convolve only the regions of the image of interest. This method drastically reduces the number of Floating-point operations per second (FLOP).
In [19], a proposed pruning method called "bank-balanced sparsity" is used in the long short-term memory (LSTM) network on the FPGA platform. In a bank-balanced sparsity (BBS) pattern, each matrix is split into multiple equal-sized banks, and each bank has the same number of nonzero values [19]. The relatively large weights would remain so that the model accuracy can be maximally maintained. Parallelism can also be utilized inside and between the banks instead of using block sparsity. Figure 13 shows an example of the comparison of unstructured sparsity, block sparsity and BBS with a 50% sparsity ratio. As shown in Figure 13, unstructured sparsity selects the smallest number (50%) of weights globally; block sparsity divides the entire matrix into 8 banks with a 2 × 2 size and represents each bank with their average weight; and bank-balanced sparsity splits each matrix row into 2 equal banks and fine-grained prunes within every bank. In order to realize effective implementation of BBS in FPGA, they put forward the Compressed Sparse Banks (CSB) encoding format using the balance characteristic of the BBS, eliminating the need for decoding.  Figure 13. Comparison of unstructured sparsity, block sparsity and bank-balanced sparsity [19]. Unstructured sparsity selects the smallest number (50%) of weights globally; block sparsity divides the entire matrix into 8 banks with a 2 × 2 size and represents each bank with their average weight; and bank-balanced sparsity splits each matrix row into 2 equal banks and fine-grained prunes within every bank.
In [89], the sparsity in Resistive Random-Access Memory (ReRAM) accelerators is different from that in common methods because of the characteristics of ReRAM. This kind of accelerator consists of multiple processing engines (PEs) connected with on-chip interconnects [89]. Each PE is composed of multiple computation units (CUs) that have multiple crossbar arrays. The acceleration of the convolution and fully connected layers depends on these crossbars. Weights are stored as conductance values of ReRAM cells, and feature maps are converted into input voltage signals; the output feature maps can be read out through the accumulated currents on the bitlines [89]. Therefore, we must find all-zero rows/columns of a crossbar array for compression and all-zero decomposed input bits of a crossbar array to exploit the activation sparsity. In [90], ReCom is proposed to regularize the distribution of zero weights and to find more all-zero rows/columns. In [68], all-zero filters are used to compress the models. A special method based on k-means clustering is presented in [91]. Authors exploit the weight and activation sparsity together by practical fine-grained OU-based computations [89]. In [92], authors put forward an approach that reduces the computation by the ReLU function.
Generic sparsification exposes several inefficiencies on GPUs [93]. Similar to FPGAs, GPUs are not optimized for sparse matrices since they have different rows and columns from the original matrix. It is difficult to partition the workload evenly into GPUs [94]. Second, the number of nonzero elements in each row is unknown until runtime, which makes it difficult to choose an optimal tiling scheme for data reuse [88]. Furthermore, the long latency of the memory access is nonnegligible even if a sparse matrix theoretically has less computation. In [88], researchers run 2 sparsity on an NVIDIA Tesla V100 GPU, and the result shows that even if we increase the sparsity to 96%, the sparse layer cannot achieve the same performance of the dense layer, only 73% of it. Therefore, to realize the advantages of sparsity on GPUs, we need optimized methods. A pruning methodology called Vector Sparse is presented with iterative vector wise sparsification and retraining for CNNs and RNNs [88]. Collaboratively designed with Tensor Core, the ultimate accuracy of Vector Sparse exhibits a negligible difference between the network having 75% sparsity and the dense network, while Vector Sparse is 63% faster than the dense counterparts on the GPU CUDA Cores.
ASIC CNN accelerators show higher performance and efficiency than other platform accelerators, and the sparsity technique can be more individual. Eyeriss [30] uses a power gating unit to power off the multiplier when the activation or weight is zero [95]. EVI-SION [31] adopts the same strategy. The sparsity of the activations and weights can be used to improve the energy efficiency but not the performance because the total clock cycles remain unchanged. Eyeriss v2 [96] integrates a new PE architecture that enables sparse weights and input activations to be handled directly in the compressed domain, resulting in not only energy efficiency but also throughput improvement. STICKER [33] proposes autonomous neural network processors and multi-sparsity compatible convolution PE arrays, routing nonzero results to 2-way memory banks and ultimately realizing high energy efficiency and computation acceleration at the same time. Moreover, [95] presents an N-way group association architecture to reduce the output memory overhead in sparse CNN accelerators.

Quantization
Quantization is a parameter-level optimization strategy. The influence of parameters on network inference focuses the speed and precision. The weights and activations are represented and stored by floating point data in most neural networks, which retains information but leads to slow computing, especially on FPGA. Replacing this representation with low-bit and fixed-bit data can not only reduce bandwidth usage and memory storage space but also simplify the computation and reduce the cost of each operation, nevertheless, sacrificing the precision. In this part, we discuss three quantization methods: linear quantization, nonlinear quantization and binary neural networks.
Linear quantization represented the floating points with fixed points. The large gap between the range of the floating points and fixed points makes most of the weights and activations underflow or overflow if the nearest fixed points is adopted. Scaling and biasing factors can avoid this phenomenon by changing the value range of parameters. In [81], authors indicate that the range of the weights and activations in a single network is limited and shows a difference between different layers. Thus, the authors adopt diverse bit widths for the weights and activations according to their layers. Researchers attempt to maintain a large data width for only the first and last layers, and the middle layers would be quantized to 3 or 2 bits [97]. However, it is proved in [98] that selecting the same data width for all of the values in the network, even the width differs in each layer, would not be a good choice. The authors propose Shape Shifter, which suggests groupings of weights and activations and uses a specific data width to encode each group, the sizes of which vary from 16 to 256 values. In addition to shortening the bits of weights and activations, we can scale the data on the basis of the exponent sign or a logarithm. REQ-YOLO implements a heterogeneous weight quantization using the alternating direction method of multipliers (ADMM) [99]. For some convolutional layers, they adopt isometric quantization; for other convolutional layers, they use the mixed powers-of-two-based quantization. The mixed weight representation based on two powers consists of symbol bit part and amplitude bit part. The first bit represents symbol bit, and the last five bits represent amplitude bit. The product of the input and the weight is the sum of the two shifted values. In [100], they adopt a logarithmic data representation in the data quantization. This method reduces the bit width of both the weights and the activations, and it further reduces the complexity of the processing elements (PEs) and the energy consumption.
Different from linear quantization, nonlinear quantization introduces a look-up table, assigning weights and activations to binary codes in the table. In [101], researchers use a hash function to create a look-up table and train the values in it. Weight sharing and clusters are adopted in data quantization in [102]. Table-based neural units [103] propose to quantify all parts of the neural networks and replace the activation-weight-multiply step with a simple table-based look-up. In [104], the authors formulate the weights and activations quantization operation as a differentiable nonlinear function and explore a simple and uniform method for quantization. This method achieves lossless results with only 3 bits of weight quantization on ResNet-18 and obtains slightly better results than ADMM on object detection tasks.
The binary neural network (BNN) is the result of extreme quantization, where data can have only two possible values, namely, −1(0) or +1 [105]. The benefit of this method is that it has less memory storage because all weights and activations are represented by 1 bit. Furthermore, the multiplication operations can be simplified as XNOR or Bit count operations which shows hardware-friendly properties. The challenge is to optimize the strategies of model binarization to maintain the model accuracy. In [106], Bi-Real Net is proposed, which focuses on reducing the information loss by adding a strategy called Bi-Real. Researchers show that the number of neurons is more influential to the BNN and improves the performance of a BNN by ensemble methods [107]. Liu et al. propose that the binarization of the activations is the main cause of the large performance loss of binary networks and suggest applying multiple binarizations to the activations [108]. In [109], authors takes advantage of BNNs, combining parallel SRAM arrays with BNNs and proposing HBNN and XNOR-BNN. Binary convolutions are also adopted on SRAM to accelerate the BNN [110]. The authors present two proposals, namely, one based on a charge sharing approach to perform vector XNOR and approximate pop count and another based on bitwise XNOR followed by a digital bit-tree adder for an accurate pop count.

Technique Discussion
After discussing various techniques in Section 4, we summarize how to design a technical route for practical application in this part. As shown in Figure 14, starting from model and deployment platform (M-P), all techniques are around the base circle and stratified according to their optimization targets. The inner layers include multiplication optimization, approximate computing, stochastic computing and Winograd algorithm, all of which centers on reducing computing complexity, namely, remove or replace multiplications, and cannot be employed together. The second layer is quantization. After the computing format is decided, the length and range of parameters are taken into account. Afterwards, pruning becomes the third layer, which change the inner connection structure of network. The mutual effect of quantization and pruning need to be paid attention since they both influence the precision of network. The fourth layer is IMP and NMP. These technologies are adopted in hardware designs as optional choices. IMP is a potential way to parallel computing. NMP can reduce the memory access latency through shortening the data route. The outermost is about efficient computing and memory access, including loop tiling, loop unrolling and pipeline. The three strategies can be united. Loop unrolling and pipeline are basic optimizations but prominently improve the speed of network inference. Loop tiling is the foundation of data reuse. Reconfigurable convolutional kernels are a special and optional approach which reuses the data by retaining the state of LUTs concerned to former computing, at the cost of on-chip resource. Parameter reordering is an optional optimization method when loop tiling is used.
The technical route can be drawn on the figure from inside out. The innermost and outermost layers are necessary as the start and end. The innermost layer determines the model and platform in the acceleration task. The outermost layer is essential because without at least one of these strategies, acceleration is impossible on the edge platforms especially FPGAs. Other layers can be skipped because they are all optional methods. L 1 describes a ordinary design of accelerator which only adopts quantization, pruning. L 2 utilizes multiplication optimizations such shift operations. Quantization and pruning are also taken into consideration. This design requires a extra hardware platform supporting NMP. The outermost methods are all in use to realize the optimal accelerator structure.
There are still challenges in the path. The quantization approaches we introduced in Section 4 are various and all reach pretty good performance as well as precision. It is hard to differentiate these methods and provide a guideline for quantization strategy selection. Furthermore, though shorter bit width of parameters is normally regarded as better choice for faster computing and transmitting, the latency of each layer is determined by the most time-consuming operation which means low-bit computing sometimes may not be able to obviously reduce the delay. Namely, longer bit width can be explored and precision may be improved. Challenges also exist in pruning. The struggle between structured and unstructured pruning is presented in Section 4. Besides the ambiguous effects to precision and inference efficiency of both kinds of methods, the role of pruning in hardware accelerator design is more confusing. Different from quantization which directly improve throughput, pruning brings more complicated memory access but same computing format, which means no hardware efficiency improved. When it comes to segmentation networks, the encode-decode connections make sparseness of network difficult to analyze. In addition to common detailed convolution structure, there are still heteroideus operators such as atrous convolution, depthwise separable convolution, deformable convolution. It is necessary to redesign memory structure and data reuse scheme. The generalization of acceleration frameworks is another topic worth studying. Most implementations are based on specific platforms and algorithms. Xilinx has proposed a framework called DPU with Vitis-AI library, which has officially supported Resnet50, Inception V1/V2/V3, Mobilenet V2, Yolov1, Yolov2, Yolov3, FPN, SP-net, and so forth. However, segmentation algorithms like DeepLab, SetNet still need developers' future work.  Figure 14. Technical route design. The normal acceleration scheme usually involves different gradations from the bit width of weights to the selection of different platforms and models. In this process, several techniques which are listed on the figure will be introduced. They can be chosen individually but tight connections will be built in the final realization.

Conclusions
In this paper, we review resource-limited platforms and neural networks about acceleration. We analyze the architecture of networks and characteristics of hardware platforms, generalizing their effects on acceleration strategies. Potential techniques for acceleration are divided into five topics-computing complexity, computing parallelism, data reuse, pruning and quantization. We illustrate their technical details and benefits as well as disadvantages. Our research provides a feasible method to design technical routes for acceleration tasks. The research also shows that there are still challenges about method selection/evaluation. There are many special structures waiting for optimization. The generalization of acceleration frameworks is another issue. More research is needed in both hardware and software to make neural networks practical in production.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: