# Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors

^{1}

^{2}

^{*}

## Abstract

**:**

^{2}and energy consumption of 4.445 µJ per training of one image. Compared with 32-bit architecture, the size and the energy are reduced by 4.7 and 3.91 times, respectively. Therefore, the CNN structure using floating-point numbers with an optimized data path will significantly contribute to developing the AIoT field that requires a small area, low energy, and high accuracy.

## 1. Introduction

- Designing optimized floating-point operators, i.e., Adder, Multiplier, and Divider, in different precisions.
- Proposing two custom floating-point formats for evaluation purposes.
- Designing an inference engine and a training engine of CNN to calculate the effect of precision on energy, area, and accuracy of CNN accelerator.
- Designing a mixed-precision accelerator in which convolutional block is implemented in higher precision to obtain better results.

## 2. Architecture CNN Training Accelerator

#### 2.1. CNN Architecture

#### 2.1.1. SoftMax Module

_{1}-A

_{N}is the representation of the output layer’s values Z

_{1}-Z

_{N}with a probability between 0 and 1.

#### 2.1.2. Gradient Descent Generator Module

## 3. Architecture of Floating-Point Arithmetic for CNN Training

#### 3.1. General Floating-Point Number and Arithmetic

- (1)
- Sign (Positive/Negative).
- (2)
- Precision (Significant digit of real number, mantissa).
- (3)
- Number of digits (Index range).

- Before performing the actual computation, original floating-point numbers A and B are partitioned into {sign A, exponent A, mantissa A} and {sign B, exponent B, mantissa B}.
- For each separated element, perform a calculation suitable for the operation:
- Sign: In addition/subtraction, the output sign is determined by comparing the mantissa and exponent of both inputs. A Not Gate and a Multiplexer are placed at the sign of input B to reverse the sign to use the same module for subtraction, while for multiplication/division, the sign of both inputs is calculated by XOR operation on the two input signs.
- Exponent: In the case of difference in exponent values, the larger exponent value is selected among the two inputs. For the input with a smaller exponent, the mantissa bits are shifted towards the right to align the two numbers to the same decimal point. The difference between the two inputs’ exponent size determines the number of times the right shift to be performed.
- Mantissa: this calculates the value of the Mantissa through an unsigned operation. There is a possibility that the result of the addition/subtraction operation for Mantissa bits becomes 1 bit larger than the Mantissa bit of both inputs. Therefore, to get precise results, we increased the size of Mantissa bits for both inputs twice and then performed the addition/subtraction of Mantissa based on the calculation results of Mantissa, whether MSB is 0 or 1. If the MSB is zero, a normalizer is not required. If MSB is 1, the normalizer moves the previously calculated Exponent bit and the Mantissa bit to obtain the final merged results.

- Finally, each calculated element is combined into one in the floating-point former block to make a resultant floating-point output.

#### 3.2. Variants of Floating-Point Number Formats

#### 3.3. Division Calculation Using Reciprocal

## 4. Proposed Architecture

#### 4.1. Division Calculation Using Signed Array

#### 4.2. Floating Point Multiplier

#### 4.3. Overall Architecture of the Proposed CNN Accelerator

#### 4.4. CNN Structure Optimization

^{th}-iteration is increased while the overall data width (DW) remains constant as the Mantissa bit is consequently decreased. After fixing the exponent bit width, the algorithm calculates the performance metric (accuracy and power) using the new floating-point data format. In the experiment of this paper, the new floating-point format before Mantissa optimization was found to be (Sign, Exp, DW-Exp-1) with DW of 16 bits, Exp = 8, and Mantissa = 16 – 8 – 1 = 7 bits. Then, the algorithm optimizes each layer’s precision format by gradually increasing the Mantissa by 1 bit until the target accuracy is met using the current DW. When all layers are optimized for minimal power consumption while meeting the target accuracy, it stores a combination of optimal formats for all layers. Then, it increases the data width DW by 1 bit for all layers and repeats the above procedure to search for other optimal formats, which can offer a trade-off between accuracy and area/power consumption. The above procedure is repeated until the DW reaches maximum data width (MAX DW), which is set to 32 bits in our experiment. Once the above search procedure is completed, the final step compares the accuracy and power of all search results and determines the best combination of formats with minimum power while maintaining the target accuracy.

## 5. Results and Analysis

#### 5.1. Comparison of Floating-Point Arithmetic Operators

#### 5.2. Evaluation of the Proposed CNN Training Accelerator

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Liu, Z.; Liu, Z.; Ren, E.; Luo, L.; Wei, Q.; Wu, X.; Li, X.; Qiao, F.; Liu, X.J. A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications. In Proceedings of the 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Miami, FL, USA, 15–17 July 2019; pp. 447–452. [Google Scholar] [CrossRef]
- Hassija, V.; Chamola, V.; Saxena, V.; Jain, D.; Goyal, P.; Sikdar, B. A Survey on IoT Security: Application Areas, Security Threats, and Solution Architectures. IEEE Access
**2019**, 7, 82721–82743. [Google Scholar] [CrossRef] - Dong, B.; Shi, Q.; Yang, Y.; Wen, F.; Zhang, Z.; Lee, C. Technology evolution from self-powered sensors to AIoT enabled smart homes. Nano Energy
**2020**, 79, 105414. [Google Scholar] [CrossRef] - Tan, F.; Wang, Y.; Li, L.; Wang, T.; Zhang, F.; Wang, X.; Gao, J.; Liu, Y. A ReRAM-Based Computing-in-Memory Convolutional-Macro With Customized 2T2R Bit-Cell for AIoT Chip IP Applications. IEEE Trans. Circuits Syst. II: Express Briefs
**2020**, 67, 1534–1538. [Google Scholar] [CrossRef] - Wang, Z.; Le, Y.; Liu, Y.; Zhou, P.; Tan, Z.; Fan, H.; Zhang, Y.; Ru, J.; Wang, Y.; Huang, R. 12.1 A 148nW General-Purpose Event-Driven Intelligent Wake-Up Chip for AIoT Devices Using Asynchronous Spike-Based Feature Extractor and Convolutional Neural Network. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 436–438. [Google Scholar] [CrossRef]
- Imteaj, A.; Thakker, U.; Wang, S.; Li, J.; Amini, M.H. A Survey on Federated Learning for Resource-Constrained IoT Devices. IEEE Internet Things J.
**2021**, 9, 1–24. [Google Scholar] [CrossRef] - Lane, N.D.; Bhattacharya, S.; Georgiev, P.; Forlivesi, C.; Jiao, L.; Qendro, L.; Kawsar, F. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In Proceedings of the 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Vienna, Austria, 11–14 April 2016; pp. 1–12. [Google Scholar]
- Venkataramanaiah, S.K.; Ma, Y.; Yin, S.; Nurvithadhi, E.; Dasu, A.; Cao, Y.; Seo, J.-S. Automatic Compiler Based FPGA Accelerator for CNN Training. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 166–172. [Google Scholar]
- Lu, J.; Lin, J.; Wang, Z. A Reconfigurable DNN Training Accelerator on FPGA. In Proceedings of the 2020 IEEE Workshop on Signal Processing Systems (SiPS), Coimbra, Portugal, 20–22 October 2020; pp. 1–6. [Google Scholar]
- Narayanan, D.; Harlap, A.; Phanishayee, A.; Seshadri, V.; Devanur, N.R.; Ganger, G.R.; Gibbons, P.B.; Zaharia, M. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, ON, Canada, 27–30 October 2019; pp. 1–15. [Google Scholar]
- Jeremy, F.O.; Kalin, P.; Michael, M.; Todd, L.; Ming, L.; Danial, A.; Shlomi, H.; Michael, A.; Logan, G.; Mahdi, H.; et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 1–14. [Google Scholar]
- Asghar, M.S.; Arslan, S.; Kim, H. A Low-Power Spiking Neural Network Chip Based on a Compact LIF Neuron and Binary Exponential Charge Injector Synapse Circuits. Sensors
**2021**, 21, 4462. [Google Scholar] [CrossRef] [PubMed] - Diehl, P.U.; Cook, M. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Front. Comput. Neurosci.
**2015**, 9, 99. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kim, S.; Choi, B.; Lim, M.; Yoon, J.; Lee, J.; Kim, H.D.; Choi, S.J. Pattern recognition using carbon nanotube synaptic transistors with an adjustable weight update protocol. ACS Nano
**2017**, 11, 2814–2822. [Google Scholar] [CrossRef] [PubMed] - Merrikh-Bayat, F.; Guo, X.; Klachko, M.; Prezioso, M.; Likharev, K.K.; Strukov, D.B. High-performance mixed-signal neurocom- puting with nanoscale floating-gate memory cell arrays. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 29, 4782–4790. [Google Scholar] [CrossRef] [PubMed] - Woo, J.; Padovani, A.; Moon, K.; Kwak, M.; Larcher, L.; Hwang, H. Linking conductive filament properties and evolution to synaptic behavior of RRAM devices for neuromorphic applications. IEEE Electron. Device Lett.
**2017**, 38, 1220–1223. [Google Scholar] [CrossRef] - Sun, Q.; Zhang, H.; Li, Z.; Wang, C.; Du, K. ADAS Acceptability Improvement Based on Self-Learning of Individual Driving Characteristics: A Case Study of Lane Change Warning System. IEEE Access
**2019**, 7, 81370–81381. [Google Scholar] [CrossRef] - Park, D.; Kim, S.; An, Y.; Jung, J.-Y. LiReD: A Light-Weight Real-Time Fault Detection System for Edge Computing Using LSTM Recurrent Neural Networks. Sensors
**2018**, 18, 2110. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kumar, A.; Goyal, S.; Varma, M. Resource-efficient machine learning in 2 KB RAM for the Internet of Things. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1935–1944. [Google Scholar]
- Truong, N.D.; Nguyen, A.D.; Kuhlmann, L.; Bonyadi, M.R.; Yang, J.; Ippolito, S.; Kavehei, O. Integer Convolutional Neural Network for Seizure Detection. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2018**, 8, 849–857. [Google Scholar] [CrossRef] - Sim, J.; Lee, S.; Kim, L.-S. An Energy-Efficient Deep Convolutional Neural Network Inference Processor With Enhanced Output Stationary Dataflow in 65-Nm CMOS. IEEE Trans. VLSI Syst.
**2020**, 28, 87–100. [Google Scholar] [CrossRef] - Das, D.; Mellempudi, N.; Mudigere, D.; Kalamkar, D.; Avancha, S.; Banerjee, K.; Sridharan, S.; Vaidyanathan, K.; Kaul, B.; Georganas, E.; et al. Mixed precision training of convolutional neural networks using integer operations. arXiv
**2018**, arXiv:1802.00930. [Google Scholar] - Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
- Fleischer, B.; Shukla, S.; Ziegler, M.; Silberman, J.; Oh, J.; Srinivasan, V.; Choi, J.; Mueller, S.; Agrawal, A.; Babinsky, T.; et al. A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 18–22 June 2018; pp. 35–36. [Google Scholar]
- Lian, X.; Liu, Z.; Song, Z.; Dai, J.; Zhou, W.; Ji, X. High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2019**, 27, 1874–1885. [Google Scholar] [CrossRef] - Iwata, A.; Yoshida, Y.; Matsuda, S.; Sato, Y.; Suzumura, N. An artificial neural network accelerator using general purpose 24 bit floating point digital signal processors. In Proceedings of the International 1989 Joint Conference on Neural Networks, Washington, DC, USA, 18–22 June 1989; Volume 2, pp. 171–175. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, S.; Zhang, R.; Liu, C.; Huang, D.; Zhou, S.; Guo, J.; Guo, Q.; Du, Z.; Zhi, T.; et al. Fixed-Point Back-Propagation Training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2327–2335. [Google Scholar]
- Mujawar, S.; Kiran, D.; Ramasangu, H. An Efficient CNN Architecture for Image Classification on FPGA Accelerator. In Proceedings of the 2018 Second International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, 9–10 February 2018; pp. 1–4. [Google Scholar] [CrossRef]
- Chen, C.-Y.; Choi, J.; Gopalakrishnan, K.; Srinivasan, V.; Venkataramani, S. Exploiting approximate computing for deep learning acceleration. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 821–826. [Google Scholar]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv
**2017**, arXiv:1710.03740. [Google Scholar] - Christopher, B.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
- IEEE Std 754-2019 (Revision of IEEE 754-2008); IEEE Standard for Floating-Point Arithmetic. IEEE: New York, NY, USA, 2019; pp. 1–84.
- Hong, J.; Arslan, S.; Lee, T.; Kim, H. Design of Power-Efficient Training Accelerator for Convolution Neural Networks. Electronics
**2021**, 10, 787. [Google Scholar] [CrossRef] - Zhao, W.; Fu, H.; Luk, W.; Yu, T.; Wang, S.; Feng, B.; Ma, Y.; Yang, G. F-CNN: An FPGA-Based Framework for Training Convolutional Neural Networks. In Proceedings of the 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), London, UK, 6–8 July 2016; pp. 107–114. [Google Scholar]
- Neil, D.; Liu, S.-C. Minitaur, an Event-Driven FPGA-Based Spiking Network Accelerator. IEEE Trans. Very Large-Scale Integr. (VLSI) Syst.
**2014**, 22, 2621–2628. [Google Scholar] [CrossRef]

**Figure 4.**The representation of floating-points (

**a**) 16-bit custom floating points; (

**b**) 16-bit brain floating point; (

**c**) 24-bit custom floating point; (

**d**) 32-bit single precision.

Total Bits | Common Name | Significand Bits | Exponent Bits | Exponent Bias |
---|---|---|---|---|

16 | Custom | 10 | 6 | ${2}^{5}-1=31$ |

16 | Brain Floating | 8 | 8 | ${2}^{7}-1=127$ |

24 | Custom | 16 | 8 | ${2}^{7}-1=127$ |

32 | Single-Precision | 24 | 8 | ${2}^{7}-1=127$ |

Clock Frequency | 50 Mhz | 100 Mhz | ||
---|---|---|---|---|

Reciprocal | Signed Array | Reciprocal | Signed Array | |

Area (µm^{2}) | 38,018.24 | 6253.19 | 38,039.84 | 8254.21 |

Processing delay (ns) | 70.38 | 21.36 | 64.23 | 10.79 |

Total Energy (pJ) ^{a} | 78.505 | 4.486 | 112.927 | 5.019 |

^{a}Total energy is an energy per division operation.

S. No | Precision Format | Formats for Individual Layers | Mantissa | Exponents | Training Accuracy | Test Accuracy | Dynamic Power |
---|---|---|---|---|---|---|---|

1 | IEEE-32 | All 32-bits | 24 | 8 | 96.42% | 96.18% | 36 mW |

2 | Custom-24 | All 24-bits | 16 | 8 | 94.26% | 93.15% | 30 mW |

3 | IEEE-16 | All 16-bits | 11 | 5 | 12.78% | 11.30% | 19 mW |

N-Bits | Common Name | Area (µm^{2}) | Processing Delay (ns) | Total Energy (pJ) |
---|---|---|---|---|

16 (1,8,7) | Brain Floating | 1749.96 | 10.79 | $0.402$ |

24 (1,8,15) | Custom | 2610.44 | 10.80 | $0.635$ |

32 (1,8,23) | Single-Precision | 3895.16 | 10.75 | $1.023$ |

N-Bits | Common Name | Area (µm^{2}) | Processing Delay (ns) | Total Energy (pJ) |
---|---|---|---|---|

16 (1,8,7) | Brain Floating | 1989.32 | 10.80 | 0.8751 |

24 (1,8,15) | Custom | 2963.16 | 10.74 | 1.5766 |

32 (1,8,23) | Single-Precision | 5958.07 | 10.76 | 3.3998 |

N-Bits | Common Name | Area (µm^{2}) | Processing Delay (ns) | Total Energy (pJ) |
---|---|---|---|---|

16 (1,8,7) | Brain Floating | 1442.16 | 10.80 | 0.6236 |

24 (1,8,15) | Custom | 3624.12 | 10.79 | 1.9125 |

32 (1,8,23) | Single-Precision | 8254.21 | 10.85 | 5.019 |

S. No | Precision Format | Formats for Individual Layers | Mantissa Bits | Exponent Bits | Training Accuracy | Test Accuracy | Dynamic Power |
---|---|---|---|---|---|---|---|

1 | IEEE-16 | All 16-bits | 11 | 5 | 11.52% | 10.24% | 19 mW |

2 | Custom-16 | All 16-bits | 10 | 6 | 15.78% | 13.40% | 19 mW |

3 | Custom-16 | All 16-bits | 9 | 7 | 45.72% | 32.54% | 19 mW |

4 | Brain-16 | All 16-bits | 8 | 8 | 91.85% | 90.73% | 20 mW |

5 | CONV Mixed-18 | Conv/BackConv-18 Rest 16-bits ^{a} | 10/8 | 8 | 92.16% | 91.29% | 21 mW |

6 | CONV Mixed-20 | Conv/BackConv-20 Rest 16-bits ^{a} | 12/8 | 8 | 92.48% | 91.86% | 22 mW |

7 | CONV Mixed-23 | Conv/BackConv-23 Rest 16-bits ^{a} | 15/8 | 8 | 92.91% | 92.75% | 22 mW |

8 | CONV Mixed-24 | Conv/BackConv-24 Rest 16-bits ^{a} | 16/8 | 8 | 93.32% | 93.12% | 23 mW |

9 | FC1 Mixed-32 | FC1/BackFC1-32 Rest 20-bits ^{b} | 24/12 | 8 | 93.01% | 92.53% | 26 mW |

10 | FC2 Mixed-32 | FC1/BackFC1-32 Rest 22-bits ^{c} | 24/14 | 8 | 93.14% | 92.71% | 27 mW |

^{a}Rest 16-bit modules are Pooling, FC1, FC2, Softmax, Back FC1, Back FC2 and Back Pooling.

^{b}Rest 20-bit modules are Convolution, Pooling, FC2, Softmax, Back FC2, Back Pooling and Back Conv.

^{c}Rest 16-bit modules are Convolution, Pooling, FC1, Softmax, Back FC1, Back Pooling and Back Conv.

Criteria | [34] | [28] | [35] | [33] | Proposed |
---|---|---|---|---|---|

Precision | FP 32 | FP 32 | Fixed Point 16 | FP 32 | Mixed |

Training dataset | MNIST | MNIST | MNIST | MNIST | MNIST |

Device | Maxeler MPC-X | Artix 7 | Spartan-6 LX150 | Xilinx XCZU7EV | XILINX XCZU9EG |

Accuracy | - | 90% | 92% | 96% | 93.32% |

LUT | 69,510 | 7986 | - | 169,143 | 33,404 |

FF | 87,580 | 3297 | - | 219,372 | 61,532 |

DSP | 23 | 199 | - | 12 | 0 |

BRAM | 510 | 8 | 200 | 304 | 7.5 |

Operations (OPs) | 14,149,798 | - | 16,780,000 | 114,824 | 114,824 |

Time Per Image (µs) | 355 | 58 | 236 | 26.17 | 13.398 |

Power (W) | 27.3 | 12 | 20 | 0.67 | 0.635 ^{a} |

Energy Per Image (µJ) | 9691.5 | 696 | 4720 | 17.4 | 8.5077 |

^{a}Calculated by Xilinx Vivado (Power = Static power + Dynamic power).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Junaid, M.; Arslan, S.; Lee, T.; Kim, H.
Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors. *Sensors* **2022**, *22*, 1230.
https://doi.org/10.3390/s22031230

**AMA Style**

Junaid M, Arslan S, Lee T, Kim H.
Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors. *Sensors*. 2022; 22(3):1230.
https://doi.org/10.3390/s22031230

**Chicago/Turabian Style**

Junaid, Muhammad, Saad Arslan, TaeGeon Lee, and HyungWon Kim.
2022. "Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors" *Sensors* 22, no. 3: 1230.
https://doi.org/10.3390/s22031230