# Templatized Fused Vector Floating-Point Dot Product for High-Level Synthesis

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- A templatized fused vector FP dot product C++ model is presented, which brings the efficiency of fused FP architectures to High Level Synthesis for the first time, allowing the design of efficient and customized architectures.
- Experimental results demonstrate that the proposed designs lead to area and latency savings at the same clock frequency target. For 32-bit standard floats, this benefit comes with a power increase, while for reduced precision 16-bit bfloats [5], power is, in fact, reduced with the proposed architecture.

## 2. Fused Vector Dot Product for HLS

`fast_float<M, E>`, where constants

`M`and

`E`refer to the size of the mantissa and the exponent fields, respectively. Single-precision floats correspond to

`fast_float<23, 8>`, while bfloat16 is equivalent to

`fast_float<7, 8>`. FastFloat4HLS contains type-cast functions that allow the conversion from standard C++ floating point datatypes to

`fast_float`. Also, similar to all other FP libraries available for HLS [17,18,19], FastFloat4HLS implements primitive arithmetic operators, allowing the designer to implement any algorithm in hardware using typical C++ behavior modeling.

#### 2.1. Using the Dot Product in C++

`dot`that is defined as follows:

`fast_float`configuration.

#### 2.2. Architecture of the Fused FP Dot Product

#### 2.2.1. Multiplication of Fractions

`hls_unroll`pragma guides the HLS tool to generate

`N`parallel instances, equal to the amount of individual multiplications that are defined by the template parameter

`N`. The two versions of the exponent and the multiplication of the two fractions are computed in parallel, as all three operations are independent to each other. The correct exponent is selected depending on the value of the product, when this becomes available.

#### 2.2.2. Alignment of Products

#### 2.2.3. Addition

#### 2.2.4. Normalization and Rounding

`LZC`function, which receives the input A and starts the recursion by calling the

`lzc_s`function. In each recursive step of

`lcz_s`,

`lzc_reduce`decides if the number of leading zeros is an odd, or even number. Initially,

`lzc_reduce`is applied to the whole input and, in each one of the following steps, the input is reduced to half by computing the logic OR of neighbor bits. When only one bit remains, the recursion stops. If the input is the all-zero vector from the beginning, flag

`ZF`is asserted. In this case, this implementation [26] treats the remaining bits of the leading-zero count as “don’t care”. Otherwise, the complementary value of the inverted sequence of the intermediate results, which gets returned by the top function

`LZC`at the end of the operation, indicates the number of leading zeros in A.

`lzc_reduction`unit, and, as it moves to the next level, its size is reduced to half, until a single bit remains. At each level, the produced output is inverted before its value is used.

## 3. Evaluation

#### 3.1. Identifying State-of-the-Art Non-Fused FP Vector Dot Product Configurations

#### 3.2. Comparisons with the Proposed Fused Vector FP Dot Product Architecture

#### 3.3. Performance Summary of Fused Dot Product Architectures

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar]
- Jouppi, N.P.; Hyun Yoon, D.; Ashcraft, M.; Gottscho, M.; Jablin, T.B.; Kurian, G.; Laudon, J.; Li, S.; Ma, P.; Ma, X.; et al. Ten lessons from three generations shaped Google’s TPUv4i: Industrial product. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1–14. [Google Scholar]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE
**2017**, 105, 2295–2329. [Google Scholar] [CrossRef] [Green Version] - Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv
**2021**, arXiv:2103.13630. [Google Scholar] - Wang, S.; Kanwar, P. BFloat16: The secret to high performance on Cloud TPUs. Google Cloud Blog
**2019**. Available online: https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus (accessed on 10 September 2022). - Andersch, M.; Palmer, G.; Krashinsky, R.; Stam, N.; Mehta, V.; Brito, G.; Ramaswamy, S. NVIDIA Hopper Architecture. Available online: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth (accessed on 11 September 2022).
- Agrawal, A.; Mueller, S.M.; Fleischer, B.M.; Sun, X.; Wang, N.; Choi, J.; Gopalakrishnan, K. DLFloat: A 16-b Floating Point format designed for Deep Learning Training and Inference. In Proceedings of the International Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, 10–12 June 2019. [Google Scholar]
- Micikevicius, P.; Stosic, D.; Burgess, N.; Cornea, M.; Dubey, P.; Grisenthwaite, R.; Ha, S.; Heinecke, A.; Judd, P.; Kamalu, J.; et al. FP8 Formats for Deep Learning. arXiv
**2022**, arXiv:2209.05433. [Google Scholar] - Tambe, T.; Yang, E.Y.; Wan, Z.; Deng, Y.; Janapa Reddi, V.; Rush, A.; Brooks, D.; Wei, G.Y. Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference. In Proceedings of the Design Automation Conference (DAC), San Francisco, CA, USA, 20–24 July 2020; pp. 1–6. [Google Scholar]
- Kim, D.; Kim, L.S. A floating-point unit for 4D vector inner product with reduced latency. IEEE Trans. Comp.
**2008**, 58, 890–901. [Google Scholar] [CrossRef] - Saleh, H.H.; Swartzlander, E.E. A floating-point fused dot-product unit. In Proceedings of the 2008 IEEE International Conference on Computer Design, Lake Tahoe, CA, USA, 12–15 October 2008; pp. 427–431. [Google Scholar]
- Sohn, J.; Swartzlander, E.E. Improved architectures for a floating-point fused dot product unit. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Austin, TX, USA, 7–10 April 2013; pp. 41–48. [Google Scholar]
- Kaul, H.; Anders, M.; Mathew, S.; Kim, S.; Krishnamurthy, R. Optimized fused floating-point many-term dot-product hardware for machine learning accelerators. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, 10–12 June 2019; pp. 84–87. [Google Scholar]
- Hickmann, B.; Chen, J.; Rotzin, M.; Yang, A.; Urbanski, M.; Avancha, S. Intel Nervana Neural Network Processor-T (NPP-T) Fused Floating Point Many-Term Dot Product. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Portland, OR, USA, 7–10 June 2020; pp. 133–136. [Google Scholar]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv
**2017**, arXiv:1710.03740. [Google Scholar] - Burgess, N.; Milanovic, J.; Stephens, N.; Monachopoulos, K.; Mansell, D. Bfloat16 processing for neural networks. In Proceedings of the 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, 10–12 June 2019; pp. 88–91. [Google Scholar]
- Siemens EDA. Algorithmic C (AC) Datatypes Reference Manual. Available online: https://github.com/hlslibs/ac_types (accessed on 11 September 2022).
- Thomas, D.B. Templatised soft floating-point for high-level synthesis. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 28 April–1 May 2019; pp. 227–235. [Google Scholar]
- Xilinx. Vitis HLS Hardware Design Methodology—Arbitrary Precision Datatypes—Floats and Doubles. Available online: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Floats-and-Doubles (accessed on 11 September 2022).
- Uguen, Y.; Dinechin, F.D.; Lezaud, V.; Derrien, S. Application-specific arithmetic in high-level synthesis tools. ACM Trans. Archit. Code Optim. (TACO)
**2020**, 17, 1–23. [Google Scholar] [CrossRef] [Green Version] - IC-Lab-DUTH Repository. FastFloat4HLS C++ Library. Available online: https://github.com/ic-lab-duth/Fast-Float4HLS (accessed on 11 September 2022).
- De Dinechin, F.; Pasca, B.; Normale, E. Custom arithmetic datapath design for FPGAs using the FloPoCo core generator. IEEE Design Test Comp.
**2011**, 28, 18–27. [Google Scholar] [CrossRef] - Hickmann, B.; Bradford, D. Experimental Analysis of Matrix Multiplication Functional Units. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, 10–12 June 2019; pp. 116–119. [Google Scholar]
- Käsgen, P.; Weinhardt, M. Using Template Metaprogramming for Hardware Description; Universität Tübingen: Tübingen, Germany, 2018. [Google Scholar]
- Fingeroff, M. High-Level Synthesis: Blue Book; Xlibris Corporation: Bloomington, IN, USA, 2010. [Google Scholar]
- Dimitrakopoulos, G.; Galanopoulos, K.; Mavrokefalidis, C.; Nikolos, D. Low-Power Leading-Zero Counting and Anticipation Logic for High-Speed Floating Point Units. IEEE Trans. Very Large Scale Integr. (VLSI)
**2008**, 16, 837–850. [Google Scholar] [CrossRef] - Siemens EDA. Questa Advanced Simulator. Available online: https://eda.sw.siemens.com/en-US/ic/questa/simulation/advanced-simulator/ (accessed on 11 September 2022).
- Cadence. Genus Synthesis Solution. Available online: https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/genus-synthesis-solution.html (accessed on 11 September 2022).
- Cadence. Innovus Implementation System. Available online: https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/soc-implementation-and-floorplanning/innovus-implementation-system.html (accessed on 11 September 2022).
- Galal, S.; Horowitz, M. Latency Sensitive FMA Design. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Tübingen, Germany, 25–27 July 2011; pp. 129–138. [Google Scholar]
- Seidel, P.M.; Even, G. On the design of fast IEEE floating-point adders. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), Vail, CO, USA, 11–13 June 2001; pp. 184–194. [Google Scholar]

**Figure 2.**An implementation of a matrix-vector multiplication using the dot product operator of Fast-Float4HLS.

**Figure 5.**A recursive template meta-programming approach for the design of a maximum-element identification hardware unit with logarithmic depth.

**Figure 6.**The C++ description that generates the addition tree by adding the positive or the negative value of the intermediate shifted product, depending on its sign.

**Figure 9.**Various pipelined organizations enabled by the FloPoco RTL generator for a 4-term dot product unit built from efficient FP multipliers and adders.

**Figure 10.**The design-space exploration of 4- and 8-term dot products for the bfloat16 FP format generated with FloPoCo [22].

**Figure 11.**The final layout for (

**a**) the proposed and (

**b**) the state-of-the-art non-fused 4-term dot product architectures, assuming a bfloat16 representation.

**Figure 12.**The error introduced by the fused and non-fused architectures for two different FP formats when computing dot products with increasing number of terms, relative to the same computation implemented with double-precision (64-bit) FP arithmetic.

**Table 1.**Comparison of the proposed 4- and 8-term fused FP dot product units relative to state-of-the-art non-fused designs.

4-Term | Proposed | State-of-the-Art Non-Fused | |||||
---|---|---|---|---|---|---|---|

Area (um^{2}) | Power (mW) | Lat. | Area (um^{2}) | Power (mW) | Lat. | ||

FP32 | 21,124 | 8.50 | 3 | 22,778 | 5.90 | 5 | |

500 MHz | BF16 | 5057 | 1.83 | 3 | 5151 | 2.47 | 3 |

FP32 | 31,518 | 13.26 | 6 | 32,568 | 11.32 | 12 | |

1 GHz | BF16 | 6750 | 4.62 | 6 | 7803 | 4.42 | 10 |

8-Term | Proposed | State-of-the-Art Non-Fused | |||||

Area (um^{2}) | Power (mW) | Lat. | Area (um^{2}) | Power (mW) | Lat. | ||

FP32 | 50,847 | 14.17 | 3 | 51,304 | 13.66 | 7 | |

500 MHz | BF16 | 10,096 | 4.11 | 3 | 11,422 | 5.78 | 4 |

FP32 | 60,863 | 25.07 | 7 | 67,953 | 25.62 | 19 | |

1 GHz | BF16 | 14,405 | 9.39 | 6 | 17,614 | 9.44 | 14 |

Design | Templatized | Open Source | #Terms | FP Format | Technology | Frequency (GHz) | Area (um^{2}) ×1000 | Latency (Cycles) |
---|---|---|---|---|---|---|---|---|

[10] | No | No | 4 | single | 180 nm | 0.08 | ∼620 | 1 |

[11] | No | No | 2 | single | 45 nm | 0.37 | 16.10 | 1 |

[12] | No | No | 2 | single | 45 nm | 1.50 | 33.29 | 3 |

[13] | No | No | 32 | bfloat16 | 10 nm | 1.11 | ∼2.75 | 5 |

[14] | No | No | 32 | bfloat16 | 45 nm | N/A | N/A | 10 |

Proposed | Yes | Yes * | 4 | single | 45 nm | 1.00 | 31.52 | 6 |

8 | bfloat16 | 45 nm | 1.00 | 14.41 | 6 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Filippas, D.; Nicopoulos, C.; Dimitrakopoulos, G.
Templatized Fused Vector Floating-Point Dot Product for High-Level Synthesis. *J. Low Power Electron. Appl.* **2022**, *12*, 56.
https://doi.org/10.3390/jlpea12040056

**AMA Style**

Filippas D, Nicopoulos C, Dimitrakopoulos G.
Templatized Fused Vector Floating-Point Dot Product for High-Level Synthesis. *Journal of Low Power Electronics and Applications*. 2022; 12(4):56.
https://doi.org/10.3390/jlpea12040056

**Chicago/Turabian Style**

Filippas, Dionysios, Chrysostomos Nicopoulos, and Giorgos Dimitrakopoulos.
2022. "Templatized Fused Vector Floating-Point Dot Product for High-Level Synthesis" *Journal of Low Power Electronics and Applications* 12, no. 4: 56.
https://doi.org/10.3390/jlpea12040056