HCM: Hardware-Aware Complexity Metric for Neural Network Architectures

Convolutional Neural Networks (CNNs) have become common in many fields including computer vision, speech recognition, and natural language processing. Although CNN hardware accelerators are already included as part of many SoC architectures, the task of achieving high accuracy on resource-restricted devices is still considered challenging, mainly due to the vast number of design parameters that need to be balanced to achieve an efficient solution. Quantization techniques, when applied to the network parameters, lead to a reduction of power and area and may also change the ratio between communication and computation. As a result, some algorithmic solutions may suffer from lack of memory bandwidth or computational resources and fail to achieve the expected performance due to hardware constraints. Thus, the system designer and the micro-architect need to understand at early development stages the impact of their high-level decisions (e.g., the architecture of the CNN and the amount of bits used to represent its parameters) on the final product (e.g., the expected power saving, area, and accuracy). Unfortunately, existing tools fall short of supporting such decisions. This paper introduces a hardware-aware complexity metric that aims to assist the system designer of the neural network architectures, through the entire project lifetime (especially at its early stages) by predicting the impact of architectural and micro-architectural decisions on the final product. We demonstrate how the proposed metric can help evaluate different design alternatives of neural network models on resource-restricted devices such as real-time embedded systems, and to avoid making design mistakes at early stages.


INTRODUCTION
Domain-specific systems were found to be very efficient, in general, and when developing constrained devices such as IoT, in particular. A system architect of such devices must consider hardware limitations (e.g., bandwidth and local memory capacity), algorithmic factors (e.g., accuracy and representation of data), and system aspects (e.g., cost, power * Equal contribution. Figure 1: Our 3×3 kernel 8-bit processing engine (PE) layout using the TSMC 28nm technology. The carry-save adder can fit 12-bit numbers, which is large enough to store the output of the convolution.
envelop, battery life, and more). Many IoT and other resourceconstrained devices provide support for applications that use convolutional neural networks (CNNs). Such algorithms can achieve spectacular performance in various tasks covering a wide range of domains such as computer vision, medicine, autonomous vehicles, etc. Notwithstanding, CNNs contain a vast number of parameters and require a significant amount of computation during inference, thus monopolizing hardware resources and demanding massively parallel computation engines; see teh example shown in Fig. 1.
These requirements have led to great interest in using custom-designed hardware for efficient inference of CNNs that would allow the promise of neural networks to be used in real-life applications by deploying them on low-power edge devices. Developing such systems requires a new set of design tools due to the tight entanglement between the algorithmic aspects, the chip architecture and the constraints the end product needs to meet. In particular, great efforts were made to develop low-resource CNN architectures [14,24,27,33]. One example of such architectural changes is the splitting of the regular 3 × 3 convolutions into a channel-wise 3 × 3 convolution followed by a 1 × 1 one. Another way to reduce the computational burden is to quantize the CNN parameters, weights and activations, employing low-bit integer representation of the data instead of the expensive floating point representation. Recent quantization-aware training schemes [8,10,16,34,35] achieve near-baseline accuracy for as low as 2-bit quantization. The benefit of quantizing the CNN is twofold: both the number of gates required for each multiplyaccumulate (MAC) operation and the amount of routing are reduced. The decision regarding which algorithm to choose may depend on the architecture (e.g., FPGA or ASIC), the accuracy requirements, and their impact on performance and power. Thus, the architect needs to make these fundamental decisions early in the developing process and no existing tool can help predict these design factors ahead of time.
The impact of the high-level structure of the accelerator, e.g., the type of CNN levels and the representation of the operands, on the power, the area and the performance of the final product needs to be defined and predicted at an early stage of the project. Recent research has shown that ASIC-based architectures are the most efficient solution for CNN accelerators both in datacenters [6,17,22] and in realtime platforms [5,11,25]. Accordingly, we demonstrate the proposed metric and design tool on an implementation of a streaming [23] ASIC-based convolutional engine. Nevertheless, our methodology can be applied for the evaluation of other types of architectures, such as FPGA-based accelerators [1,2,29]. In both cases, the development process includes an important trade-off between the logical gates area, their routing on the silicon and the performance of the resulting system. Unfortunately, all these parameters also depend on the representation of the data, and its impact on both communication and computation. To date, there is no quantitative metric for this trade-off available at the design stage of the CNN accelerator and no tool exists that can assist the architect to predict the impact of high level decisions on the important design parameters. Ideally, the designer would like to have an early estimation of the chip resources required by the accelerator as well as the performance, accuracy and power it can achieve.
A critical difficulty in trying to predict the design parameters for CNN-based systems is the lack of a proper complexity metric. Currently, the most common metric for calculating the computational complexity of CNN algorithms is the number of MAC operations denoted as OPS (or FLOPS in case of floating-point operations). This metric, however, does not take into account the data format or additional operations performed during the inference, such as memory accesses and communication. For that reason, the number of FLOPS does not necessarily correlate with runtime [18] or the required amount of computational resources. This paper proposes to use a different metric for assessing the complexity of CNNbased architectures: the number of bit operations (BOPS) as presented by Baskin et al. [3]. We show that BOPS is well-suited to the task of comparing different architectures with different weight and activation bitwidths.

Contribution.
This paper makes the following contributions: Firstly, we study the impact of CNN quantization on the hardware implementation in terms of computational resources and memory bandwidth considerations. Specifically, we study a single layer in the neural network.
Secondly, we extend the previously proposed computation complexity for quantized CNNs, termed BOPS [3], with a communication complexity analysis to identify the performance bottlenecks that may arise from the data movement.
Thirdly, we extend the roofline model [32] to accommodate this new notion of complexity. We also demonstrate how this tool can be used to assist architecture-level decisions at the early design stages.
Lastly, we implement a basic quantized convolution block with various bitwidths on 28nm processes to demonstrate an accurate estimation of the power/area of the hardware accelerator. This allows changing high-level design decisions at early stages and saves the designer from major mistakes that otherwise would be discovered too late. We compare our estimations with previous approaches and show significant improvement in accuracy of translation between algorithmic complexity to hardware resource utilization.
The rest of the paper is organized as follows: Section 2 reviews the related work; Section 3 describes a proposed hardware-aware complexity metric; Section 4 provides roofline analysis of CNN layer design using the proposed metric; Section 5 provides the experimental results using common CNN architecture and Section 6 concludes the paper.

RELATED WORK
In this section, we provide an overview of prior work that proposed metrics for estimating the complexity and power/energy consumption of different workloads, focusing on neural networks. The most commonly used metric for evaluating computational complexity is FLOPS [19]: the amount of floating point operations required to perform the computation. In the case of integer operations, the obvious generalization of FLOPS is OPS, which is just the number of operations. A fundamental limitation of these metrics is the assumption that the same data representation is used for all operations; otherwise, the calculated complexity does not reflect the real one. Wang et al. [31] claim that FLOPS is an inappropriate metric for estimating the performance of workloads executed in datacenters and proposed a basic operations metric that uses a roofline-based model, taking into account the computational and communication bottlenecks for more accurate estimation of the total performance.
In addition to general-purpose metrics, other metrics were developed specifically for evaluation of neural network complexity. Mishra et al. [20] define the "compute cost" as the product of the number of fused multiplyâȂŞadd (FMA) operations and the sum of the width of the activation and weight operands, without distinguishing between floating-and fixedpoint operations. Using this metric, the authors claimed to have reached a 32× "compute cost" reduction by switching from FP32 to binary representation. Still, as we show further in our paper, this is a rather poor estimate for the hardware resources/area needed to implement the computational element. Jiang et al. [15] notes that a single met-ric cannot comprehensively reflect the performance of deep learning (DL) accelerators. They investigate the impact of various frequently-used hardware optimizations on a typical DL accelerator and quantify their effects on accuracy and throughout under-representative DL inference workloads. Their major conclusion is that high hardware throughput is not necessarily highly correlated with the end-to-end high inference throughput of data feeding between host CPUs and AI accelerators. Finally, Baskin et al. [3] propose to generalize FLOPS and OPS by taking into account the bitwidth of each operand as well as the operation type. The resulting metric, named BOPS (binary operations), allows area estimation of quantized neural networks including cases of mixed quantization.
The aforementioned metrics do not provide any insight on the amount of silicon resources needed to implement them. Our work, accordingly, functions as a bridge between the CNN workload complexity and the real power/area estimation.

COMPLEXITY METRIC
In this section, we describe our hardware-aware complexity metric (HCM), which takes into account the CNN topology, and define the design rules of efficient implementation of quantized neural networks. The HCM metric assesses two elements: the computation complexity, which quantifies the hardware resources needed to implement the CNN on silicon, and the communication complexity, which defines the memory access pattern and bandwidth. We describe the changes resulting from switching from a floating-point representation to a fixed-point one, and then present our computation and communication complexity metrics. All results for the fixedpoint multiplication presented in this section are based on the Synopsys standard library multiplier using TSMC's 28nm process.

The impact of quantization on hardware implementation
Currently, the most common representation of weights and activations for training and inference of CNNs is either 32bit or 16-bit floating-point numbers. The fixed-point MAC operation, however, requires significantly fewer hardware resources, even for the same input bitwidth. To illustrate this fact, we generated two multipliers: one for 32-bit floatingpoint 1 and the other for 32-bit fixed-point operands. The results in Table 1 show that a fixed-point multiplier uses approximately eight time less area, gates, and power than the floating-point counterpart. Next, we generated a convolution with a k × k kernel, a basic operation in CNNs consisting of k 2 MAC operations per output value. After switching from floating-point to fixed-point, we explored the area of a single processing engine (PE) with variable bitwidth. Note that accumulator size depends on the network architecture: the maximal bitwidth of the output value is b w b a + log 2 (k 2 ) + log 2 (n), where n is number of input features. Since the extreme values are very rare, however, it is often possible to reduce the accumulator width without harming the accuracy of the network [6]. 1 FPU100 from https://opencores.org/projects/fpu100    2 shows the silicon area of the PE as a function of the bitwidth. We performed a polynomial regression and observed a quadratic dependence of the PE area on the bitwidth, with the coefficient of determination R 2 = 0.9999877. This nonlinear dependency demonstrates that quantization impact a network hardware resources is quadratic: reducing bitwidth of the operands by half reduces area and, by proxy, power approximately by a factor of four (contrary to what is assumed by, e.g., Mishra et al. [20]).

Computation
We now present the BOPS metric defined in Baskin et al. [3] as our computation complexity metric. In particular, we show that BOPS can be used as an estimator for the area of the accelerator. The area, in turn, is found to be linearly related to the power in case of the PEs.
The computation complexity metric describes the amount of arithmetic "work" needed to calculate the entire network or a single layer. BOPS is defined as the number of bit operations required to perform the calculation: the multiplication of n-bit number by m-bit number requires n · m bit operations, while addition requires max(n, m) bit operations. In particular, Baskin et al. [3] show that a k × k convolutional layer with b a -bit activations and b w -bit weights requires bit operations, where n and m are, respectively, the number of input and output features of the layer. The formula takes into account the width of the accumulator required to accommodate the intermediate calculations, which depends on n.
The BOPS of an entire network is calculated as a sum of the BOPS of the individual layers. Creating larger accelerators that can process more layers in parallel involves simply replicating the same individual PE design. In Fig. 3, we calculated BOPS values for the PEs from Fig. 2 and plotted them against the area. We conclude that for a single PE with variable bitwidth, BOPS can be used to predict the PE area with high accuracy.   Next, we tested the predictive power of BOPS scaling with the size of the design. We generated several designs with variable bitwidths, b w = b a ∈ {4, 6, 8}, and variable numbers of PEs n = m ∈ {4, 8, 16} used to accommodate multidimensional inputs and outputs that typically arise in real CNN layers. Fig. 4 shows that the area depends linearly on the BOPS for the range of two orders of magnitude of total area with goodness of fit R 2 = 0.9980. We conclude that since BOPS provides a high-accuracy approximation of the area and power required by the hardware, it can be used as an early estimator. While the area of the accelerator depends on the particular design of the PE, this only affects the slope of the linear fit, since the area is still linearly dependent on the amount of PEs. An architect dealing with algorithms only can use definitions such as the number of input features and output features, kernel size etc. and get an early estimation how much power is needed to solve the network, without having any knowledge about VLSI constraints in advance. Using information such as a number of input/output features and kernel size, it is possible to immediately assess the amount of area the PEs occupy on the silicon.

Communication
Another important aspect of hardware implementation of CNN accelerators is memory communication. The transmission of data from the memory and back is often overlooked by hardware implementation papers [1,5,28] that focus on the raw calculation ability to determine the performance of their hardware. In many cases, there is a difference between the calculated performance and real-life performance, since reallife implementations of accelerators are often memory-bound [17,21,30].
For each layer, the total memory bandwidth is the sum of the activation and weight sizes read and written from memory. In typical CNNs used, e.g., in vision tasks, the first layers consume most of their bandwidth for activations, whereas in deeper layers that have smaller but higher-dimensional feature maps (and, consequently, a bigger number of kernels), weights are the main source of memory communication.
We assume that each PE can calculate one convolution result per clock cycle and the resulting partial sum is saved in the cache. In Fig. 5, we show typical memory access progress at the beginning of the convolutional layer calculation. At first stage, the weights and the first k rows of the activations are read from memory at maximal possible speed to start the calculations as soon as possible. After the initial data are loaded, the unit reaches a "steady state", in which it needs to read from the memory only one new input value per clock cycle (other values are already in the cache). We assume the processed signals to be two-dimensional (images), which additionally requires k new values to be loaded in the beginning of each new row.
Note that until the weights and the first activations are loaded, no calculations are performed. The overhead bandwidth of the pre-fetch stage can be mitigated by doing work in greater batch sizes, loading the weights once and reading several inputs for the same weights. By doing this, we minimize the penalty for reading the weights compared to reading the actual input data to perform the calculation. In the case of real-time processing, however, larger batches are not possible We focus on the latter real-time streaming regime in this paper because of its great importance in a range of applications including automotive, security, and finance. The memory access pattern depicted in Fig. 5 must be kept in mind when designing the hardware, since it may limit the performance of the accelerator and decrease its power efficiency.

ROOFLINE ANALYSIS
So far, we discussed the use of BOPS for the prediction of the physical parameters of the final product, such as the expected power and area. In this section, we extend the BOPS model to a system level, by introducing the OPS-based roofline model. The traditional roofline model, as introduced by Williams et al. [32], suggests depicting the dependencies between the performance (e.g., GFLOPS/second) and the operation density (the average number of operations per information unit transferred over the memory bus). Now, for each machine we can draw "roofs": the horizontal line that represents its computational bounds and the diagonal line that represents its maximal memory bandwidth. An example of the roofline for three applications assuming infinite compute resources and memory bandwidth is shown in Fig. 6. The maximum performance a machine can achieve for any application is visualized by the area below both bounds, shaded in green.
Since, as indicated in Section 3.1, FLOPS cannot be used for efficient estimation of the complexity of quantized CNNs, we introduce a new model that is based on the BOPS metric presented in Section 3.2. This model, to which we refer as the OPS-based roofline model, replaces the GFLOPS/s axis of the roofline plot with a performance metric more adequate for neural networks, e.g., number of operations per second (OPS/s), and the second metric that measures the computational complexity with operations per bit (OPS/bit). Using generic operations and bits allows plotting quantized accelerators with different bitwidths on the same plot.
As an example of the proposed approach, we use two different ResNet-18 layers (a deep layer, which is computationallyintensive, and an early one, which is memory-intensive) on four different accelerator designs: 32-bit floating-point, 32bit fixed-point, and quantized 8-bit and 4-bit fixed-point. The accelerators were implemented using standard ASIC design tools, as detailed in Section 5 and were built using the TSMC 28nm technology, using standard 2.4GHz DDR-4 memory  Figure 6: Roofline example. In the case of App1, memory bandwidth prevents the program from achieving its expected performance. In the case of App2, the same happens due to limited computational resources. Finally, App3 represents a program that could achieve its maximum performance on a given system. with a 64-bit data bus.
The first example employs an accelerator with a silicon area of 1mm 2 and 800MHz clock speed. The task is the 11 th layer of ResNet-18 that has a 3 × 3 kernel and 256 input and output features of dimension 14 × 14 each. Looking at Table 1, it is possibly to fit only 85 32-bit floating-point multipliers in 1mm 2 . That allows installation of 9 PEs (without taking into account the area required for the accumulators of the partial sums) and calculation of convolutions with the 3 × 3 × 3 × 3 kernel in a single clock. Using the known areas of 4-bit, 8-bit and 16-bit PEs, we extrapolate the area of the 32-bit fixed point PE to be 16676µm. From these data, we can place 60 PEs with 7 × 7 × 3 × 3 kernels, 220 PEs with 14 × 14 × 3 × 3 kernels and 683 PEs with 26 × 26 × 3 × 3 kernels, for 32-bit, 16-bit and 8-bit fixed-point PEs, respectively, on the given area.
To calculate the amount of OPS/s required by the layer, under the assumption that a full single pixel is produced every clock, we need to calculate the amount of MAC operations required to calculate one output pixel (n × m × (k 2 + 1)) and multiply it by the accelerator frequency. To calculate the OPS/bit for each design, we divide the amount of MAC operations in the layer by the total number of bits transferred over the memory bus, which includes the weights, the input and the output activations. The layer requires 524.288 TOPS /s to be calculated without stalling for memory access and computation. The available performance of the accelerators is summarized in Table 2 and visualised using the proposed OPS-based roofline analysis in Fig. 7.
In this example, the application's requirements are out of the scope of the product definition. On one hand, all accelerators are computationally bound (all horizontal lines are below the application's requirements), indicating that we do not have enough PEs to calculate the layer in one run. On the other hand, even if we decide to increase the computational density by using stronger quantization or by increasing the silicon area (and the cost of the accelerator), we would still hit the memory bound (represented by the diagonal line). In this case, the solution should be found at the algorithmic level or by changing the product's targets; e.g., we can calculate the layer in parts, increase the silicon area of while decreasing the frequency in order not to hit memory wall, or decide to use another algorithm. Our second example explores the feasibility of implementing the second layer of ResNet-18 that has a 3 × 3 kernel and 64 input and output features of dimension 56 × 56. For this example, we increase the silicon area to 6mm 2 and lower the frequency to 100MHz, as proposed earlier, and add a 4-bit quantized accelerator for comparison purposes. The layer requires 4.1 GOPS /s. The accelerators results are summarized in Table 3 and visualised with the OPS-based roofline analysis in Fig. 8.
From Fig. 8 we can see that our 32-bit and 16-bit accelerators are still computationally bound, while the 8-bit and 4-bit quantized accelerators meet the demands of the layer. In particular, the 8-bit accelerator is located at the border of computational ability, meaning this solution has nearly optimal resource allocation, since the hardware is fully utilized. Still, the final choice of the configuration depends on other parameters such as the accuracy of the CNN.
Both examples demonstrate that decisions made at early stages have a critical impact on the quality of the final product. For example, applying an aggressive quantization to the network or increasing the silicon size may not improve the overall performance of the chip if its performance is memory-bound. From the architect's point of view, it is important to balance between the computation and data transfer.   Nonetheless, this balance can be achieved in different ways: at the micro-architecture level, at the algorithmic level or by changing the data representation. The architect may also consider (1) changing the hardware to provide faster communication (requires more power and is more expensive), (2) appling communication bandwidth compression algorithms [4,7], (3) using fewer number of bits to represent weights and activations (using 3-or 4-bit representation may solve the communication problem, at the cost of reducing the expected accuracy), or (4) changing the algorithm to transfer data slower (even though that solves the bandwidth issue, the possible drawback is a reduced throughput of the whole system). The proposed OPS-based roofline model helps the architect to choose alternative. After making major architectural decisions we can use BOPS to get an estimation of the impact of different design choices on the final product, such as the expected area, power, optimal operational point, etc. The next section will examine these design processes from the system design point of view.

HCM METRIC EVALUATION
After introducing the use of BOPS as a metric for the hardware complexity of CNN-based algorithms and the use of the OPS-based roofline model to help the architect understand how the decisions at the algorithmic level may impact the characterizations of the final product, this section aims to provide a holistic view of the design process of systems with CNN accelerators. We conducted an extensive evaluation of the design and the implementation of a commonly used CNN  architecture for ImageNet [26] classification, ResNet-18 [12]. We also compared our metric to prior art [20] in terms of correspondence between complexity score to hardware utilization for CNN parameters with various bitwidths.

Experimental methodology
We start the evaluation of the HCM metric with a comprehensive review of the use of BOPS as part of the design and implementation process of a CNN accelerator. This section shows the trade-offs involved in the process and verifies the accuracy of the proposed model. It focuses on the implementation of a single PE since PEs are directly affected by the quantization process. The area of an individual PE depends on the chosen bitwidth, while the change in the amount of input and output features changes both the required number of PEs and the size of the accumulator. The leading example we use implemented an all-to-all CNN accelerator that can calculate n input features and m output features in parallel, as depicted in Fig. 9. For simplicity, we choose an equal number of input and output features. In this architecture, all the input features are routed to each of the m blocks of PEs, each calculating a single output feature. The implementation input features PEs output features Figure 9: All-to-all topology with n × m processing elements.
was done for an ASIC using the TSMC 28nm technology library, 800MHz system clock and in the nominal corner of V DD = 0.81V. For the power analysis, input activity factor, and sequential activity factor, we used the value of 0.2. The tool versions are listed in Table 4. For brevity, we present only the results of experiments at 800 MHz clock frequency. We performed additional experiments at 600 MHz and 400 MHz (obviously, neither BOPS nor the area of an accelerator depends on the chip frequency), but do not show these results. As shown in Section 4, lowering the frequency of the design can help to avoid the memory bound, but incurs the penalty of slower solution time.
Our results show a high correlation between the area of the design and BOPS. The choice of an all-to-all topology shown in Fig. 9 was made because of an intuitive understanding of how the accelerator calculates the outputs of the network. This choice, however, has a greater impact on the layout's routing difficulty, with various alternatives such as broadcast or systolic topologies [6]. For example, a systolic topology, a popular choice for high-end NN accelerators [17], eases the routing complexity by using a mesh architecture. Although it reduces the routing effort and improves the flexibility of the input/output feature count, it requires a more complex control for the data movement to the PEs.
To verify the applicability of BOPS to different topologies, we also implemented a systolic array shown in Fig. 10, where each 1 × 1 PE is connected to 4 neighbors with the ability to bypass any input to any output without calculations. The input feature accumulator is located at the input of the PE. This topology generates natural 4 × 1 PEs, but with proper control, it is possible to create flexible accelerators. In the

System-level design using HCM
In this section, we analyze the acceleration of ResNet-18 using the proposed metrics and show the workflow for early estimation of the hardware cost when designing an   accelerator. We start the discussion by targeting an ASIC that runs at 800MHz, with 16 × 16 PEs and the same 2.4GHz DDR-4 memory with a 64-bit data bus as used in Section 4. The impact of changing these constraints is discussed at the end of the section. For the first layer, we replace the 7 × 7 convolution with three 3 × 3 convolutions, as proposed by He et al. [13]. This allows us to simplify the analysis by employing universal 3 × 3 PEs for all layers.
We start the design process by comparing different alternatives using the new proposed OPS-based-roofline analysis since it helps to explore the design trade-offs between the multiple solutions. We calculate the amount of OPS/s provided by 16 × 16 PEs at 800MHz and the requirements of each layer. To acquire the roofline, we need to calculate the OPS/bit, which depend on the quantization level. For ResNet-18, the current art [9] achieves 69.56% top-1 accuracy on ImageNet for 4-bit weights and activations, which is only 0.34% less than 32-bit floating-point baseline (69.9%). Thus we decided to focus on 2-, 3-and 4-bit quantization both for weights and activations, which can achieve 65.17%, 68.66%, and 69.56% top-1 accuracy, correspondingly.
For a given bitwidth, OPS/bit is calculated by dividing the total number of operations by the total number of bits transferred over the memory bus, consisting of reading weights and input activations and writing output activations. Fig. 12 presents OPS-based roofline for each quantization bitwidth. Please note that for each layer we provided two points: the red dots are the performance required by the layer, and the green dots are the equivalent performance using partial-sum computation. Fig. 12 clearly indicates that this accelerator is severely limited by both computational resources and lack of enough bandwidth; the system is computationally bounded, which could be inferred from the fact that it does not have enough PEs to calculate all the features simultaneously. Nevertheless, the system is also memory-bound for any quantization level, meaning that adding more PE resources would not solve the problem. It is crucial to make this observation at the early stages of the design since it means that micro-architecture changes would not be sufficient to solve the problem.
One possible solution, as presented in Section 4, is to divide the channels of the input and output feature maps into smaller groups, and use more than one clock cycle to calculate each pixel. In this way, the effective amount of the OPS/s required for the layer is reduced. In the case that the number of feature maps is divisible by the number of available PEs, the layer will fully utilize the computational resources, which is the case for every layer except the first one. Reducing the number of PEs, however, also reduces the data efficiency, and thus the OPS/bit also decreases, shifting the points to the left on the roofline plot. Thus, some layers still require more bandwidth from the memory than what the latter can supply. In particular, in the case of 4-bit quantization, most of the layers are memory-bound. The only option that properly utilizes the hardware is 2-bit quantization, for which all the layers except one are within the memory bound of the accelerator. Another option for solving the problem is to either change the neural network topology being used, or add a data compression scheme on the way to and from the memory [4,7]. Adding compression will reduce the effective memory bandwidth requirement and allow adding more PEs in order to meet the performance requirements -at the expense of cost and power.
At this point, BOPS can be used to estimate the power and the area of each alternative for implementing the the accelerator using the PE micro-design. In addition, we can explore other trade-offs, such as the influence of modifying some parameters that were fixed at the beginning: lowering the ASIC frequency will decrease the computational bound, which reduces the cost and only hurts the performance if the network is not memory-bounded. An equivalent alternative is to decrease the number of PEs. Both procedures will reduce the power consumption of the accelerator as well the computational performance. The system architect may also consider changing the parameters of the algorithm being used, e.g., change the feature sizes, use different quantization for the weights and for the activations, include pruning, and more.
It is also possible to reverse design order: start with a BOPS estimate of the number of PEs that can fit into a given area, and then calculate the ASIC frequency and memory bandwidth that would allow full utilization the accelerator. This can be especially useful if the designer has a specific area or power goal.
To summarize this section, from the architecture point of view it is extremely important to be able to predict, at the  Figure 13: Comparison of BOPS and "compute cost" [20] predictive power. BOPS -5% error, "compute cost" -15%. early stages of the design, if the proposed (micro)architecture is going to meet the project targets. At the project exploration stage, the system architect has plenty of alternatives to choose from to make the right trade-offs (or even negotiate to change the product definition and requirements. Introducing such alternatives later may be painful or even near to impossible.

Comparison with prior metrics
In this section, we compare the BOPS [3] metric to another complexity metric, introduced by Mishra et al. [20]. A good complexity metric should have a number of properties. First, it should reflect the real cost of the design. Second, it should be possible to calculate it from micro-designs or prior design results, without needing to generate complete designs. Last, it should generalize well, providing meaningful predictions for a wide spectrum of possible design parameter values. We compare our choice of computational complexity assessment, BOPS, with the "compute cost" proposed by Mishra et al. [20]. To analyze the metrics, we use our real accelerator area results from Section 5 and error bands of linear extrapolation of the measured values. To remind the reader, BOPS and "compute cost" are defined as follows: The error of predicting a new point with "compute cost" is 15% within 2 orders of magnitude, whereas using BOPS, is only 5%. As shown in Fig. 13, "compute cost" introduces a systematic error: each of the distinguishable groups of three points corresponding to a single value of the number of input and output features creates a separate prediction line. This may lead to higher errors in case of extrapolation from a single value of the number of input and output features or a wide range of the considered bitwidth.

DISCUSSION AND CONCLUSIONS
CNN accelerators are commonly used in different systems, starting from IoT and other resource-constrained devices, and ending in datacenters and high-performance computers. Designing accelerators that meet tight constraints is still a challenging task, since the current EDA and design tools do not provide enough information to the architect. To make the right choice, the architects need to understand at the early stages of the design the impact of high-level decisions they make on the final product, and to be able to make a fair comparison between different design alternatives.
In this paper, we showed that one of the fundamental shortcomings of the current design methodologies and tools is the use of GFLOPS as a metric for estimating the complexity of existing hardware solutions. The first contribution of this paper is the definition of the HCM as a metric for hardware complexity. We demonstrated its application to the prediction of such product characteristics as power, performance, etc.
The second contribution of the paper is the introduction of the OPS-based roofline model as a supporting tool for the architect at the very early stages of the development. We showed that this model allows the comparison of different alternatives of the design and the determination of the optimality and feasibility of the solution.
Lastly, we provided several examples of realistic designs, using an actual implementation with standard design tools and a mainstream process technology. By applying the proposed metric, we could build a better system and indicate to the system architect that certain CNN architectures may better fit the constraints of a specific platform. In particular, our metric confirmed that CNN accelerators are more likely to be memory, rather that computationally bound [17,30].
Although this paper is mainly focused on ASIC-based architectures, the same methodology can be applied to many other systems, including FPGA-based implementations and other system-specific domains that allow trading off accuracy and data representation with different physical parameters such as power, performance, and area.