# Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks

^{1}

^{2}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

- FPGAs are more energy efficient compared with GPUs and CPUs.
- FPGAs have parallel computing resources with high performance.
- Reconfigurability in FPGAs provides significant flexibility to explore CNNs’ design options and alternatives.
- FPGAs provide high security [31].

- Designed and validated an algorithm to quantize the full CNN model. The novelty of the algorithm is that it combines quantization-aware training (QAT) and post-training quantization techniques (PQT), and it provides full model quantization (weights and activations) without increasing the number of neurons.
- Explored, designed, and verified multiple hardware designs of the quantized CNN; each design with different quantization bits to fit the FPGA capacity and resources.
- Performed compilation and synthesis of the hardware designs in the FPGA device, which is Altera Stratix IV.
- Analyzed and modeled the resources, timing, throughput, power, and energy results.
- Estimated the performance metrics for a sample DNN design using the derived models.

## 2. Related Work

#### 2.1. Parameter Quantization

#### 2.2. Quantize the Entire Model: Weights and Activations

#### 2.3. Discussion

- Parameter quantization reduces CNN complexity efficiently, but misses out on potential reductions from activation quantization. Full model quantization targets both computation and memory but requires careful accuracy consideration.
- Some of the quantization methods add additional neuron operations to mitigate the accuracy drop, which in turn increases model complexity [17].
- Large NN implementation techniques (e.g., tiling and pipelining) are different from small NNs [6].
- Surprisingly, most research on quantized designs fail to model key performance metrics like power and energy.

- Proposing a full-model quantization algorithm without requiring extra neurons;
- Modeling the impact of quantization on resources (e.g., energy, power, and area);
- It should scale up to DNNs.

## 3. Research Methodology

#### 3.1. Research Objectives

#### 3.2. Selection of CNN

- The model must fit in the FPGA device used in this research (i.e., Altera® Stratix® IV), and hence should be a small (i.e., lightweight) model;
- The model should be known to the research community.

- The rectified linear unit (ReLU) activation function in C1 and C2;
- The sigmoid activation function in C3 and FC1;
- Softmax in the output layer.

#### 3.3. Algorithm Development and Simulations

- The primary program implements the LeNet model and is capable of quantizing weights and/or activations of any layer with the desired bit width. This program is trained and tested with grayscale images of size $28\times 28$ for the 10 digits (i.e., 0,1,…,9). The training and testing data are chosen from the MNIST dataset [59]. The training parameters and hyper-parameters are listed in Table 2. One important constraint for our algorithm, which is discussed in Section 4, is to maintain the original architecture of the LeNet, which means it is not possible to add layers or neurons to the quantized CNN.
- Another program was written to generate the files containing quantized weights to be used in the hardware design. The weights must be expressed in binary or hexadecimal format.
- To debug and verify the hardware design, one more program was written to compare the model layer outputs with the hardware results and to determine the mismatching layer.

#### 3.4. Hardware Design

^{TM}. Our HDL design includes the CNN HDL implementation and the testbench to validate the results. The simulations are performed using the ModelSim simulator [60]. The HDL design incorporates the quantized parameters computed by our algorithm, which is verified at the end based on the testbench simulations.

^{TM}software tool to process the HDL. Table 3 provides the versions of the FPGA device, tools, and simulator. Technical details regarding the software tool are available in [62]. Then, the HDL design goes through a flow similar to the flows discussed in published research, such as [63,64]. Finally, the compiled design is being analyzed and examined using the following performance metrics, which are thoroughly discussed next: resource utilization, timing, throughput, power, and energy.

#### 3.4.1. Resource Utilization Performance Metric

#### 3.4.2. Timing and Throughput Performance Metrics

#### 3.4.3. Power and Energy Performance Metrics

^{TM}tool [62], computes the power dissipation of our implementation. The power is computed based on the resources, routing information, and node activity. The node activity is extracted from the waveform files (i.e., value-change dump files), which are produced by the ModelSim during simulating the design. Power Analyzer

^{TM}reports core dynamic power (in mW), which includes four components: DSP block, combinational, register, and clock. Finally, the energy consumed during the processing of an image is the product of the power consumption and the processing time per image.

#### 3.5. Performance Evaluation

## 4. The Proposed Algorithm

- Phase-I: Quantization training, which quantizes weights through dedicated training procedures.
- Phase-II: Post-training quantization, which further refines the model by optimally quantizing activations. No retraining is performed after the quantization because our experiments showed that it offered few advantages and could even have drawbacks.

#### 4.1. Phase-I: Quantizing Weights

- Model is initially trained with a single precision floating point.
- Set N = ${N}_{W\_MAX}$.
- Quantize weights (and biases) to N-bit values, where $I=F=\frac{N}{2}$.
- Train the model:
- Perform forward pass with N-bit quantization on weights and biases.
- Perform back-propagation with floating points for the entire model.

- Save quantized weights (and biases) as Model_WQ
_{N}. - Decrement N by ${\delta}_{W}$.
- Check if iterations are completed:
- If $N\ge {N}_{W\_MIN}$: Go to step 3.
- Otherwise: terminate Phase-I.

#### 4.2. Phase-II Quantizing Activations

- Set $N={N}_{A\_MAX}$.
- Load Model_WQ
_{N}, which is generated by Phase-I. - For the weights that have small integer part (i.e., integer part < ${2}^{I-1}$), assign more bits to the fractional part to achieve better accuracy.
- Run regression images through the model.
- For each layer
_{i}, $i\in 1,\dots ,7$, perform funnel bit assignment:- (a)
- Compute the maximum and minimum values of activation outputs.
- (b)
- Determine the number of bits I required to store the integer part. This is done by computing the number of bits to store the maximum and minimum values of activations: ${I}_{1}$, ${I}_{2}$.
- ${I}_{1}=lo{g}_{2}\left(\mathrm{Max}\phantom{\rule{4.pt}{0ex}}\mathrm{Activation}\phantom{\rule{4.pt}{0ex}}\mathrm{Value}\right)$.
- ${I}_{2}=lo{g}_{2}\left(abs\left(\mathrm{Min}\phantom{\rule{4.pt}{0ex}}\mathrm{Activation}\phantom{\rule{4.pt}{0ex}}\mathrm{Value}\right)\right)$.
- Set I to the larger value of ${I}_{1}$ and ${I}_{2}$.

- (c)
- Perform bit assignment:
- If $I<N$: assign I bits to the integer part and F bits to the fraction part, where $F=N-I$.
- If $I\ge N$: assign all N bits to upper bits of the integer part and no bits are assigned to the fraction part. Effectively, $I=N$ and $F=0$.

- (d)
- Run regression images and record accuracy. When computing activation output:
- If an activation is above the $MAX\_VALU{E}_{I.F}$, then saturate the output to $MAX\_VALU{E}_{I.F}$.
- If an activation is below the $MIN\_VALU{E}_{I.F}$, then saturate the output to $MIN\_VALU{E}_{I.F}$.

- (e)
- Decrement I by one, repeat steps (c)–(e) to find out the optimum assignment.

- Save model as Model_Q
_{N}. - Decrement N by $\delta $
_{A}. - Check if iterations are completed:
- If $N\ge {N}_{A\_MIN}$: Go to step 2.
- Otherwise: Terminate Phase-II.

## 5. Hardware Design

- Layer units realize the model layers corresponding to Figure 2 and Table 1. The units perform the computations of the LeNet model. They include:
- -
- CONV 1, CONV 2, and CONV 3, which implement the convolutional layers C1, C2, and C3, respectively.
- -
- POOL 1 and POOL 2, which implement the S1 and S2 layers, respectively.
- -
- FC 1 and FC 2 are the fully connected layer implementations.

- Memory units implement the data memory and the interfacing logic. The interface logic enables continuous data access by using two data memories, as explained below. The units include:
- -
- Memory interface (Mem I/F), which is a dedicated unit to facilitate communication between the CNN model and the memory system.
- -
- Two data memories, ${\mathrm{Mem}}_{A}$ and ${\mathrm{Mem}}_{B}$, which store the model’s input data and activation values.
- -
- ROMs to store the weights and biases used by the model.

#### 5.1. Design Configurability

#### 5.2. Layer Units

- If the addition result > $MAX\_VALU{E}_{I.F}$, then the output is set to $MAX\_VALU{E}_{I.F}$.
- If the addition result < $MIN\_VALU{E}_{I.F}$, then the output is set to $MIN\_VALU{E}_{I.F}$.

#### 5.3. Memory Units

#### 5.4. Execution Flow

## 6. Implementation Results

#### 6.1. Resource Utilization

- The second and third columns list the LUTs and register utilization.
- The fourth column to the seventh columns list the 9-bit, 12-bit, 18-bit, and 36-bit multipliers used in the design.
- The eighth column computes the total LUTs, with multipliers estimated as discussed in Section 3.
- The ninth column normalizes the results with respect to the 16-bit design.
- The last column lists the logic utilization, which is calculated as the ratio of the used resources to the total available resources. Looking at the trend of the logic utilization versus N in this column clearly shows that it is not possible to fit more than a 16-bit design in the FPGA.

#### 6.2. Timing Analysis

- $Tim{e}_{ProcessOneImage}$ is the time required to process one image.
- $Cycle{s}_{ProcessOneImage}$ represents the total number of clock cycles needed to process a single image.
- $CycleTime$ refers to the duration of one clock cycle.

#### 6.3. Power and Energy Consumption

- The combined power consumption of the clock and registers, known as sequential power, accounts for only 2% of the total power, playing a minor role in overall power consumption. This is primarily because these sequential circuits occupy a smaller area compared to the more power-hungry data path logic.
- The power consumption of the DSP blocks is around 5% of the total power. This indicates that the datapath resources like multipliers and adders are well-designed for low power.
- The combinational cells are the primary source of power consumption, accounting for approximately 93% of the total power. This is primarily due to the presence of random logic, high routing overhead, and large data selection components like multiplexers and demultiplexers.
- The final column of Table 7 showcases the superior power saving of the 8-bit design, consuming around 41% less power compared to the 16-bit design. While the 12-bit design offers a modest 2% power saving over the 16-bit option. This unexpectedly low power saving in the 12-bit design prompted further investigation. We believe the root cause might be the combined nature of combinational cell power: it includes both block power and routing power. The routing power is the power consumed by the metal wires and the routing resources that connect the logic blocks. It increases with the wire length and complexity of routing paths. Interestingly, the 12-bit design displayed minimal change, even a slight increase in routing power compared to the 16-bit design. We suspect this anomaly might indicate an issue with the software tool’s routing algorithm, potentially favoring byte-aligned sizes (8-bit and 16-bit) and hindering efficiency for non-aligned designs like the 12-bit one. While other factors could contribute to the 12-bit power consumption, Table 7 clearly demonstrates the scalability of power consumption with a 41% reduction achieved by halving the bit-width from 16 to 8. This power saving is achieved by 41% power reduction in the combinational cell power and 35% reduction in the DSP power.

## 7. Discussion

#### 7.1. Modeling Performance Metrics

#### 7.2. Applying the Proposed Algorithm to DNN

- $M=3$. This number is chosen as a rough estimate, and it can easily be changed, and it means that we need to partition the design into three partitions, each fits in an FPGA device, where Partition
_{i}is quantized with ${N}_{i}$-bit as captured in Figure 7. - The complexity of each DNN layer is the same as those of the LeNet Model, otherwise, the models should be scaled up.
- The design is fitted on the same FPGA device utilized in this research. If another device is used, then update the models.
- The DNN model is quantized using the same N-bit values as we used before (i.e., Each ${N}_{i}$ is chosen in the range from 16-bit to 8-bit).

_{1}to Partition

_{M}. For a Partition

_{i}, we perform the following steps:

- Run Phase-I of the algorithm on Partition
_{i}to Partition_{M}; - Run Phase-II of the algorithm to quantize Partition
_{i}and select ${N}_{i}$; - For the rest of the steps, Partition
_{i}is quantized at ${N}_{i}$.

- Quantize Partition
_{1};- -
- Run Phase-I of the algorithm on the entire DNN (i.e., Partition
_{1}, Partition_{2}and Partition_{3}); - -
- Run Phase-II of the algorithm to quantize Partition
_{1}and select ${N}_{1}$; - -
- For the rest of the steps, Partition
_{1}is quantized at ${N}_{1}$.

- Quantize Partition
_{2}- -
- Run Phase-I of the algorithm on Partition
_{2}and Partition_{3}; - -
- Run Phase-II of the algorithm to quantize Partition
_{2}and select ${N}_{2}$, where ${N}_{2}$ ≤ ${N}_{1}$; - -
- For the rest of the steps, Partition
_{2}is quantized at ${N}_{2}$.

- Quantize Partition
_{3}- -
- Run Phase-I of the algorithm on Partition
_{3}; - -
- Run Phase-II of the algorithm to quantize Partition
_{3}and select ${N}_{3}$, where ${N}_{3}$ ≤ ${N}_{2}$.

## 8. Conclusions and Future Works

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

CNN | Convolutional Neural Network |

NN | Neural Network |

DNN | Deep Neural network |

RCD | Resource-Constrained Devices |

FC | Fully Connected |

SVD | Singular Value Decomposition |

FPGA | Field-Programmable Gate Array |

QAT | Quantization-Aware Training |

PTQ | Post-Training Quantization |

BNN | Binarized Neural Network |

Q$n.m$ | Fixed-point representation with n bits for the integer part, m bits for the fractional part |

NN parameters | Weights/synapses and biases |

Feature map | Output of convolutional layer |

FSM | Finite State Machine |

STE | Straight-Through Estimator |

MNIST | Modified National Institute of Standards and Technology |

TIMIT | Texas Instruments/Massachusetts Institute of Technology |

## References

- Zhang, W. A Survey of Field Programmable Gate Array-Based Convolutional Neural Network Accelerators. Int. J. Electron. Commun. Eng.
**2020**, 14, 419–427. [Google Scholar] - Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl.
**2020**, 32, 1109–1139. [Google Scholar] [CrossRef] - Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv
**2021**, arXiv:2103.13630. [Google Scholar] - Kim, Y.D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; Shin, D. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv
**2015**, arXiv:1511.06530. [Google Scholar] - Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar] [CrossRef]
- Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput. Archit. News
**2014**, 42, 269–284. [Google Scholar] [CrossRef] - Mohd, B.J.; Hayajneh, T.; Vasilakos, A.V. A survey on lightweight block ciphers for low-resource devices: Comparative study and open issues. J. Netw. Comput. Appl.
**2015**, 58, 73–93. [Google Scholar] [CrossRef] - Zhuang, B.; Shen, C.; Tan, M.; Liu, L.; Reid, I. Towards effective low-bitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7920–7928. [Google Scholar]
- Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv
**2016**, arXiv:1612.01064. [Google Scholar] - Miyashita, D.; Lee, E.H.; Murmann, B. Convolutional neural networks using logarithmic data representation. arXiv
**2016**, arXiv:1603.01025. [Google Scholar] - Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits
**2016**, 52, 127–138. [Google Scholar] [CrossRef] - Ma, Y.; Suda, N.; Cao, Y.; Seo, J.s.; Vrudhula, S. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–8. [Google Scholar]
- Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.s.; Cao, Y. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 16–25. [Google Scholar]
- Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv
**2016**, arXiv:1606.06160. [Google Scholar] - Novac, P.E.; Boukli Hacene, G.; Pegatoquet, A.; Miramond, B.; Gripon, V. Quantization and deployment of deep neural networks on microcontrollers. Sensors
**2021**, 21, 2984. [Google Scholar] [CrossRef] [PubMed] - Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. (TRETS)
**2019**, 12, 1–26. [Google Scholar] [CrossRef] - Abdelouahab, K.; Bourrasset, C.; Pelcat, M.; Berry, F.; Quinton, J.C.; Sérot, J. A Holistic Approach for Optimizing DSP Block Utilization of a CNN implementation on FPGA. In Proceedings of the 10th International Conference on Distributed Smart Camera, Paris, France, 12–15 September 2016; pp. 69–75. [Google Scholar]
- Zhang, X.; Zou, J.; He, K.; Sun, J. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 38, 1943–1955. [Google Scholar] [CrossRef] [PubMed] - Novikov, A.; Podoprikhin, D.; Osokin, A.; Vetrov, D.P. Tensorizing neural networks. Adv. Neural Inf. Process. Syst.
**2015**, 28. [Google Scholar] - Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv
**2016**, arXiv:1602.07360. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Huang, P.; Wu, H.; Yang, Y.; Daukantas, I.; Wu, M.; Zhang, Y.; Barrett, C. Towards Efficient Verification of Quantized Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Stanford, CA, USA, 25–27 March 2024; Volume 38, pp. 21152–21160. [Google Scholar]
- Cheng, L.; Gu, Y.; Liu, Q.; Yang, L.; Liu, C.; Wang, Y. Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A Survey. IEEE Trans. Sustain. Comput.
**2024**, 1–18. [Google Scholar] [CrossRef] - Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv
**2015**, arXiv:1510.00149. [Google Scholar] - Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; Pensky, M. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 806–814. [Google Scholar]
- Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 5–10 December 2016; pp. 2082–2090. [Google Scholar]
- Chang, Z.; Liu, S.; Xiong, X.; Cai, Z.; Tu, G. A survey of recent advances in edge-computing-powered artificial intelligence of things. IEEE Internet Things J.
**2021**, 8, 13849–13875. [Google Scholar] [CrossRef] - Weng, O. Neural Network Quantization for Efficient Inference: A Survey. arXiv
**2021**, arXiv:2112.06126. [Google Scholar] - Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv
**2013**, arXiv:1308.3432. [Google Scholar] - Ducasse, Q.; Cotret, P.; Lagadec, L.; Stewart, R. Benchmarking quantized neural networks on FPGAs with FINN. arXiv
**2021**, arXiv:2102.01341. [Google Scholar] - Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 65–74. [Google Scholar]
- Fahim, F.; Hawks, B.; Herwig, C.; Hirschauer, J.; Jindariani, S.; Tran, N.; Carloni, L.P.; Di Guglielmo, G.; Harris, P.; Krupa, J.; et al. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. arXiv
**2021**, arXiv:2103.05579. [Google Scholar] - Pappalardo, A.; Franco, G.; Nickfraser. Xilinx/Brevitas: CNV Test Reference Vectors r0. Available online: https://zenodo.org/records/3824904 (accessed on 16 April 2024).
- Fan, H.; Ferianc, M.; Que, Z.; Li, H.; Liu, S.; Niu, X.; Luk, W. Algorithm and hardware co-design for reconfigurable cnn accelerator. In Proceedings of the 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), Taipei, Taiwan, 17–20 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 250–255. [Google Scholar]
- Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access
**2020**, 8, 116569–116585. [Google Scholar] [CrossRef] - Haris, J.; Gibson, P.; Cano, J.; Agostini, N.B.; Kaeli, D. Secda: Efficient hardware/software co-design of fpga-based dnn accelerators for edge inference. In Proceedings of the 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Belo Horizonte, Brazil, 26–28 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 33–43. [Google Scholar]
- Li, F.; Zhang, B.; Liu, B. Ternary weight networks. arXiv
**2016**, arXiv:1605.04711. [Google Scholar] - Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; Chen, Y. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv
**2017**, arXiv:1702.03044. [Google Scholar] - Holt, J.L.; Baker, T.E. Back propagation simulations using limited precision calculations. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA USA, 8–12 July 1991; IEEE: Piscataway, NJ, USA, 1991; Volume 2, pp. 121–126. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 8–14 December 2015; pp. 3123–3131. [Google Scholar]
- Park, J.; Sung, W. FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only. arXiv
**2016**, arXiv:1602.01616. [Google Scholar] - Larkin, D.; Kinane, A.; O’Connor, N. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Proceedings of the International Conference on Neural Information Processing, Hong Kong, China, 3–6 October 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1178–1188. [Google Scholar]
- Farabet, C.; LeCun, Y.; Kavukcuoglu, K.; Culurciello, E.; Martini, B.; Akselrod, P.; Talay, S. Large-scale FPGA-based convolutional networks. Scaling Mach. Learn. Parallel Distrib. Approaches
**2011**, 13, 399–419. [Google Scholar] - Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 609–622. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv
**2014**, arXiv:1412.7024. [Google Scholar] - Vanhoucke, V.; Senior, A.; Mao, M.Z. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS 2011), Granada, Spain, 12–15 December 2011; pp. 1–8. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] - Mohd, B.J.; Abed, S.; Hayajneh, T.; Alshayeji, M.H. Run-time monitoring and validation using reverse function (RMVRF) for hardware trojans detection. IEEE Trans. Dependable Secur. Comput.
**2019**, 18, 2689–2704. [Google Scholar] [CrossRef] - Zhang, X.; Heys, H.M.; Li, C. Fpga implementation and energy cost analysis of two light-weight involutional block ciphers targeted to wireless sensor networks. Mob. Netw. Appl.
**2013**, 18, 222–234. [Google Scholar] [CrossRef] - Mohd, B.J.; Hayajneh, T.; Yousef, K.M.A.; Khalaf, Z.A.; Bhuiyan, M.Z.A. Hardware design and modeling of lightweight block ciphers for secure communications. Future Gener. Comput. Syst.
**2018**, 83, 510–521. [Google Scholar] [CrossRef] - Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data
**2021**, 8, 1–74. [Google Scholar] - Ghaffari, S.; Sharifian, S. FPGA-based convolutional neural network accelerator design using high level synthesize. In Proceedings of the 2016 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS), Tehran, Iran, 14–15 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
- Rongshi, D.; Yongming, T. Accelerator implementation of lenet-5 convolution neural network based on fpga with hls. In Proceedings of the 2019 3rd International Conference on Circuits, System and Simulation (ICCSS), Nanjing, China, 13–15 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 64–67. [Google Scholar]
- Paul, D.; Singh, J.; Mathew, J. Hardware-software co-design approach for deep learning inference. In Proceedings of the 2019 7th International Conference on Smart Computing & Communications (ICSCC), Sarawak, Malaysia, 28–30 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
- Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag.
**2012**, 29, 141–142. [Google Scholar] [CrossRef] - Intel. ModelSim-Intel FPGA Edition Software. Available online: https://www.intel.com/content/www/us/en/software-kit/750368/modelsim-intel-fpgas-standard-edition-software-version-18-1.html (accessed on 24 January 2024).
- Intel. Stratix IV FPGAs Support. Available online: https://www.intel.com/content/www/us/en/support/programmable/support-resources/devices/stratix-iv-support.html (accessed on 20 January 2024).
- Intel. Quartus Prime Software. Available online: https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/download.html (accessed on 24 January 2024).
- Mohd, B.J.; Hayajneh, T.; Khalaf, Z.A.; Vasilakos, A.V. A comparative study of steganography designs based on multiple FPGA platforms. Int. J. Electron. Secur. Digit. Forensics
**2016**, 8, 164–190. [Google Scholar] [CrossRef] - Mohd, B.J.; Hayajneh, T.; Khalaf, Z.A. Optimization and modeling of FPGA implementation of the Katan Cipher. In Proceedings of the Information and Communication Systems (ICICS), 2015 6th International Conference on, Amman, Jordan, 7–9 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 68–72. [Google Scholar]
- Ullah, S.; Rehman, S.; Shafique, M.; Kumar, A. High-performance accurate and approximate multipliers for FPGA-based hardware accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
**2021**, 41, 211–224. [Google Scholar] [CrossRef] - Weste, N.H.; Harris, D. CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed.; Pearson Education: London, UK, 2022. [Google Scholar]
- Henriksson, M.; Gustafsson, O. Streaming Matrix Transposition on FPGAs Using Distributed Memories. In Proceedings of the 2023 IEEE Nordic Circuits and Systems Conference (NorCAS), Aalborg, Denmark, 31 October–1 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
- Chu, T.; Luo, Q.; Yang, J.; Huang, X. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognit.
**2021**, 111, 107647. [Google Scholar] [CrossRef]

Layer | Description | Feature Map Size ($\mathit{W}\times \mathit{L}\times \mathit{D}$) | Filter Size | Stride | Parameters (Weights and Biases) |
---|---|---|---|---|---|

Input | input image | $32\times 32\times 1$ | - | - | - |

C1 | convolutional | $28\times 28\times 6$ | $5\times 5$ | 1 | 156 |

S1 | pooling | $14\times 14\times 6$ | $2\times 2$ | 2 | |

C2 | convolutional | $10\times 10\times 16$ | $5\times 5$ | 1 | 2416 |

S2 | pooling | $5\times 5\times 16$ | $2\times 2$ | 2 | |

C3/FC0 | convolutional layer | $1\times 1\times 120$ | $5\times 5$ | 1 | 48,120 |

FC1 | Fully connected | 84 | - | - | 10,164 |

FC2 (Output) | Fully connected | 10 | - | - | 850 |

Name | Value |
---|---|

Dataset | MNIST [59] |

Batch Size | 50 |

Epochs | 200 |

Optimizer | ADAM |

Learning Rate | 0.05 |

Loss Function | Cross Entropy |

Name | Value |
---|---|

FPGA Device | Altera Stratix IV version EP4SGX230 |

FPGA Software Tool | Quartus Prime Standard Edition version 17.0 |

Verilog Simulator | ModelSim-Intel FPGA Standard Edition version 10.5b |

Parameter | Description |
---|---|

N | Number of bits of the quantized number, which consists of an integer part and a fraction part. |

I | Number of bits in the integer part in the quantized number |

F | Number of bits in the fraction part of the quantized number |

$VALU{E}_{I.F}$ | The signed value of the N-bit fixed point number, with I-bit representing integer part and F-bit representing the fraction part. |

$MAX\_VALU{E}_{I.F}$ | The maximum signed value of an N-bit fixed point number, with I bits representing integer part and F-bit representing the fraction part. |

$MIN\_VALU{E}_{I.F}$ | The minimum signed value of an N-bit fixed point number, with I-bit representing integer part and F-bit representing the fraction part. |

${N}_{W\_MAX}$ | The initial number of bits to quantize weights; used in Phase-I of the algorithm. It is recommended that ${N}_{W\_MAX}$ = 32-bit, which is the integer size. |

${N}_{W\_MIN}$ | The final number of bits to quantize weights; used in Phase-I of the algorithm. To ensure the results accuracy, we set ${N}_{W\_MIN}$ = 8. |

${\delta}_{W}$ | The amount by which the number of quantized bits is reduced in each iteration of Phase-I. The typical value for $\delta $_{W} = 2 to decrement both I and F by one. |

Model_WQ_{N} | The CNN model with N-bit quantized weights. |

${N}_{A\_MAX}$ | The initial number of bits to quantize activations; used in Phase-II of the algorithm. Typically, ${N}_{A\_MAX}\le {N}_{W\_MAX}$. |

${N}_{A\_MIN}$ | The final number of bits to quantize activations; used in Phase-II of the algorithm. To ensure the results accuracy, ${N}_{A\_MIN}=8$. |

${\delta}_{A}$ | The amount by which the number of quantized bits is reduced in each iteration of Phase-II. Typical value for $\delta $_{A} $=4$. |

Model_Q_{N} | The CNN model with N-bit quantized weights and activations. |

Des. | LUTs | Reg. Utilization | 9-b M. | 12-b M. | 18-b M. | 36-b M. | Tot. LUT | Norm. Res. | Logic Util. |
---|---|---|---|---|---|---|---|---|---|

16-bit | 76,217 | 821 | 0 | 0 | 20 | 5 | 93,778 | 1 | 82% |

12-bit | 60,683 | 741 | 0 | 20 | 0 | 5 | 73,544 | 0.78 | 65% |

8-bit | 45,866 | 659 | 20 | 0 | 5 | 0 | 50,840 | 0.54 | 49% |

Design | Fmax | Norm. Fmax |
---|---|---|

16-bit | 43.33 | 1.000 |

12-bit | 44.06 | 1.017 |

8-bit | 44.11 | 1.018 |

Design | Comb. Cells | Registers | Clock | DSP Power | Total Power | Norm. Total Power |
---|---|---|---|---|---|---|

16-bit | 857.77 | 3.79 | 9.94 | 52.84 | 924.34 | 1.0 |

12-bit | 855.56 | 2.26 | 9.04 | 37.97 | 904.83 | 0.98 |

8-bit | 501.26 | 3.23 | 10.32 | 34.61 | 549.42 | 0.59 |

Design | Total Energy | Norm. Energy |
---|---|---|

16-bit | 2.82 | 1.0 |

12-bit | 2.72 | 0.96 |

8-bit | 1.65 | 0.58 |

Total LUT | Ave. Logic Util. % | Power (W) | Energy (mJ) |
---|---|---|---|

218,154 | 65.4% | 2.38 | 7.18 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mohd, B.J.; Ahmad Yousef, K.M.; AlMajali, A.; Hayajneh, T.
Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks. *Electronics* **2024**, *13*, 1727.
https://doi.org/10.3390/electronics13091727

**AMA Style**

Mohd BJ, Ahmad Yousef KM, AlMajali A, Hayajneh T.
Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks. *Electronics*. 2024; 13(9):1727.
https://doi.org/10.3390/electronics13091727

**Chicago/Turabian Style**

Mohd, Bassam J., Khalil M. Ahmad Yousef, Anas AlMajali, and Thaier Hayajneh.
2024. "Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks" *Electronics* 13, no. 9: 1727.
https://doi.org/10.3390/electronics13091727