Area- and Power-Efficient Reconfigurable Architecture for Multifunction Evaluation
Abstract
:1. Introduction
- The proposed segmentor reduces the bit width of the slope after quantification. It uses a similar number of segments with a smaller bit width of the slope to achieve the same precision as in [30].
- The hardware architecture for a single function was optimized by means of Booth encoding to reduce the number of partial products. Additionally, compressors were introduced to shorten the critical path.
- The hardware performance of the proposed architecture for a single function exhibits clear improvements over state-of-the-art approaches in terms of area, power, latency and so on.
- Reconfigurable technology was applied for multifunction implementation by reusing computing resources to improve the computing density.
2. Theoretical Background
2.1. PWL Method
2.2. Minimization of MAE
2.3. Software-Based Segmentor
3. Proposed Approach
3.1. Proposed Segmentor
- (1)
- Calculate the slope and y-intercept of the current segment based on the starting and ending points as follows:The slope was quantified by rounding to the nearest value with a fractional bit width of qw. If the first truncated bit is 0, then the lower bits were directly truncated. Otherwise, one carry was added after truncation. For example, the binary number 1.10101 can be quantified with three fractional bits by rounding it to 1.101. In contrast, the binary number 1.10110 can be quantified as 1.110. Obviously, rounding quantification has a low accuracy loss but more hardware overhead. It is suitable for data that are prepared by a software platform and stored on a chip under design. In our design, the slope and y-intercept were quantified by means of a rounding operation. In the segmentor, the operation of quantification by rounding was simulated asCorresponding to the first binary number, 1.65625 () in decimal format was quantified as 1.625 () according to (16), whereas 1.6875 () was quantified as 1.75 (). Thus, it can be seen that the simulation for decimal numbers expressed in (16) agrees with the rounding quantification operation for binary numbers. Accordingly, the multiplication output can be simulated asThis output is provided as an input for addition. Hence, it must also be quantified to reduce the width of the adder. To avoid truncating the output after minimizing MAE by shifting the linear function in the y-direction, m was quantified to a fractional bit width of iw, which is the same as that of the input. The quantification of m was also executed by the hardware circuit. To avoid an increase in hardware overhead, m was directly quantified via truncation. The lower bits were directly truncated without considering carry operations. For example, the binary numbers 1.10101 and 1.10110 were both quantified as 1.101 by truncation to three fractional bits. The operation of quantification by truncation was simulated asCorresponding to the same binary numbers considered above, 1.65625 () and 1.6875 () in decimal format were both quantified as 1.625 () according to (18). Therefore, by neglecting the quantification and optimization of the y-intercept, the output of the linear function can be expressed as
- (2)
- (3)
- Simulate the multiplication by means of (14)–(18). Accordingly, the y-intercept should be quantified before the addition operation. The quantification operation for the y-intercept is also based on rounding to the same fractional bit width as the input. The quantitative simulation of the y-intercept is expressed asHence, the expression for the linear function was updated to
- (4)
- Calculate MAE with (2).
3.2. Performance Analysis and Parameter Selection
3.3. Design Flow of the Proposed Approach
4. Hardware Architecture
4.1. Single-Function Implementation
4.2. Multifunction Implementation without Reuse of Computing Resources
4.3. Multifunction Implementation with Reuse of Computing Resources
5. Implementation Results and Comparison
5.1. Results of Single-Function Architecture Implementation
5.2. Results of Multifunction Architecture Implementation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
MAE | Maximum absolute error |
Predefined MAE in the segmentor | |
MXE | Maximum error value |
MIE | Minimum error value |
MAE before optimization with the method in Section 2.2 | |
MAE after optimization with the method in Section 2.2 | |
k | Slope of the linear function |
kq | Slope of the linear function after quantification |
b | Y-intercept of the linear function |
bq | Y-intercept of the linear function after quantification |
sp | Starting point of the current segment |
ep | Ending point of the current segment |
lp | Leftmost point of the bisection window |
rp | Rightmost point of the bisection window |
leg | Number of inputs |
qw | Fractional bit width of the intermediate data |
sw | Fractional bit width of the slope |
bw | Bit width of the slope (including integral and fractional parts) |
iw | Fractional bit width of the inputs and outputs |
PWL | Piecewise linear |
FP16 | Half-precision floating-point format |
ulp | Unit in the last place |
DNNs | Deep neural networks |
pp | Partial product |
RTL | Register-transfer level |
ASIC | Application-specific integrated circuit |
NR | Newton–Raphson |
HPA | High-degree polynomial approximation |
APF | Area per function |
PPF | Power per function |
References
- Harris, D. A powering unit for an OpenGL lighting engine. In Proceedings of the Conference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat.No.01CH37256), Pacific Grove, CA, USA, 4–7 November 2001; Volume 2, pp. 1641–1645. [Google Scholar] [CrossRef] [Green Version]
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. Survey of Machine Learning Accelerators. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 22–24 September 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Ellaithy, D.M.; El-Moursy, M.A.; Ibrahim, G.H.; Zaki, A.; Zekry, A. Double Logarithmic Arithmetic Technique for Low-Power 3-D Graphics Applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2144–2152. [Google Scholar] [CrossRef]
- Wang, Z.; Lin, J.; Wang, Z. Accelerating Recurrent Neural Networks: A Memory-Efficient Approach. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2763–2775. [Google Scholar] [CrossRef]
- Luo, Y.; Wang, Y.; Ha, Y.; Wang, Z.; Chen, S.; Pan, H. Generalized Hyperbolic CORDIC and Its Logarithmic and Exponential Computation with Arbitrary Fixed Base. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 2156–2169. [Google Scholar] [CrossRef]
- Mopuri, S.; Acharyya, A. Low Complexity Generic VLSI Architecture Design Methodology for Nth Root and Nth Power Computations. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 4673–4686. [Google Scholar] [CrossRef]
- Wang, Y.; Luo, Y.; Wang, Z.; Shen, Q.; Pan, H. GH CORDIC-Based Architecture for Computing N th Root of Single-Precision Floating-Point Number. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 864–875. [Google Scholar] [CrossRef]
- Chen, H.; Cheng, K.; Lu, Z.; Fu, Y.; Li, L. Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 2652–2656. [Google Scholar] [CrossRef]
- Wu, R.; Chen, H.; He, G.; Fu, Y.; Li, L. Low-Latency Low-Complexity Method and Architecture for Computing Arbitrary Nth Root of Complex Numbers. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2529–2541. [Google Scholar] [CrossRef]
- Kornerup, P.; Muller, J.M. Choosing starting values for certain Newton–Raphson iterations. Theor. Comput. Sci. 2006, 351, 101–110. [Google Scholar] [CrossRef]
- Aslan, S.; Oruklu, E.; Saniie, J. Realization of area efficient QR factorization using unified division, square root, and inverse square root hardware. In Proceedings of the 2009 IEEE International Conference on Electro/Information Technology, Windsor, ON, Canada, 7–9 June 2009; pp. 245–250. [Google Scholar] [CrossRef]
- Vestias, M.P.; Neto, H.C. Revisiting the Newton-Raphson Iterative Method for Decimal Division. In Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications, Chania, Greece, 5–7 September 2011; pp. 138–143. [Google Scholar] [CrossRef]
- Rodriguez-Garcia, A.; Pizano-Escalante, L.; Parra-Michel, R.; Longoria-Gandara, O.; Cortez, J. Fast fixed-point divider based on Newton-Raphson method and piecewise polynomial approximation. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 9–11 December 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Jain, R.; Pandey, N. Realization of Regula-Falsi Iteration based Double Precision Floating Point Division. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), Tirunelveli, India, 15–17 June 2020; pp. 88–92. [Google Scholar] [CrossRef]
- de Dinechin, F.; Tisserand, A. Multipartite table methods. IEEE Trans. Comput. 2005, 54, 319–330. [Google Scholar] [CrossRef] [Green Version]
- De Caro, D.; Petra, N.; Strollo, A.G.M. Reducing Lookup-Table Size in Direct Digital Frequency Synthesizers Using Optimized Multipartite Table Method. IEEE Trans. Circuits Syst. I Regul. Pap. 2008, 55, 2116–2127. [Google Scholar] [CrossRef]
- Low, J.Y.L.; Jong, C.C. A Memory-Efficient Tables-and-Additions Method for Accurate Computation of Elementary Functions. IEEE Trans. Comput. 2013, 62, 858–872. [Google Scholar] [CrossRef]
- Hsiao, S.F.; Wu, P.H.; Wen, C.S.; Meher, P.K. Table Size Reduction Methods for Faithfully Rounded Lookup-Table-Based Multiplierless Function Evaluation. IEEE Trans. Circuits Syst. II Express Briefs 2015, 62, 466–470. [Google Scholar] [CrossRef]
- Hsiao, S.F.; Wen, C.S.; Chen, Y.H.; Huang, K.C. Hierarchical Multipartite Function Evaluation. IEEE Trans. Comput. 2017, 66, 89–99. [Google Scholar] [CrossRef]
- Chen, H.; Yang, H.; Song, W.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. Symmetric-Mapping LUT-Based Method and Architecture for Computing XY-Like Functions. IEEE Trans. Circuits Syst. I: Regul. Pap. 2021, 68, 1231–1244. [Google Scholar] [CrossRef]
- Lee, D.U.; Cheung, R.; Luk, W.; Villasenor, J. Hardware Implementation Trade-Offs of Polynomial Approximations and Interpolations. IEEE Trans. Comput. 2008, 57, 686–701. [Google Scholar] [CrossRef] [Green Version]
- Strollo, A.G.; De Caro, D.; Petra, N. Elementary Functions Hardware Implementation Using Constrained Piecewise-Polynomial Approximations. IEEE Trans. Comput. 2011, 60, 418–432. [Google Scholar] [CrossRef]
- De Caro, D.; Napoli, E.; Esposito, D.; Castellano, G.; Petra, N.; Strollo, A.G.M. Minimizing Coefficients Wordlength for Piecewise-Polynomial Hardware Function Evaluation With Exact or Faithful Rounding. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 64, 1187–1200. [Google Scholar] [CrossRef]
- Ellaithy, D.M.; El-Moursy, M.A.; Zaki, A.; Zekry, A. Dual-Channel Multiplier for Piecewise-Polynomial Function Evaluation for Low-Power 3-D Graphics. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 790–798. [Google Scholar] [CrossRef]
- An, M.; Luo, Y.; Zheng, M.; Wang, Y.; Dong, H.; Wang, Z.; Peng, C.; Pan, H. Piecewise Parabolic Approximate Computation Based on an Error-Flattened Segmenter and a Novel Quantizer. Electronics 2021, 10, 2704. [Google Scholar] [CrossRef]
- Liu, C.W.; Ou, S.H.; Chang, K.C.; Lin, T.C.; Chen, S.K. A Low-Error, Cost-Efficient Design Procedure for Evaluating Logarithms to Be Used in a Logarithmic Arithmetic Processor. IEEE Trans. Comput. 2016, 65, 1158–1164. [Google Scholar] [CrossRef]
- Loukrakpam, M.; Choudhury, M. Error-aware design procedure to implement hardware-efficient antilogarithmic converters. Circuits Syst. Signal Process. 2019, 38, 4266–4279. [Google Scholar] [CrossRef]
- Sun, H.; Luo, Y.; Ha, Y.; Shi, Y.; Gao, Y.; Shen, Q.; Pan, H. A Universal Method of Linear Approximation With Controllable Error for the Efficient Implementation of Transcendental Functions. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 177–188. [Google Scholar] [CrossRef]
- Dong, H.; Wang, M.; Luo, Y.; Zheng, M.; An, M.; Ha, Y.; Pan, H. PLAC: Piecewise Linear Approximation Computation for All Nonlinear Unary Functions. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2014–2027. [Google Scholar] [CrossRef]
- Lyu, F.; Mao, Z.; Zhang, J.; Wang, Y.; Luo, Y. PWL-Based Architecture for the Logarithmic Computation of Floating-Point Numbers. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 1470–1474. [Google Scholar] [CrossRef]
- Liu, W.; Liao, Q.; Qiao, F.; Xia, W.; Wang, C.; Lombardi, F. Approximate Designs for Fast Fourier Transform (FFT) With Application to Speech Recognition. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 4727–4739. [Google Scholar] [CrossRef]
- Mittal, S. A survey of techniques for approximate computing. ACM Comput. Surv. (CSUR) 2016, 48, 1–33. [Google Scholar] [CrossRef] [Green Version]
- Lyu, F.; Xu, X.; Wang, Y.; Luo, Y.; Wang, Y.; Pan, H. Ultralow-Latency VLSI Architecture Based on a Linear Approximation Method for Computing Nth Roots of Floating-Point Numbers. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 715–727. [Google Scholar] [CrossRef]
- Goldberg, D. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. (CSUR) 1991, 23, 5–48. [Google Scholar] [CrossRef]
- Shukla, S.; Fleischer, B.; Ziegler, M.; Silberman, J.; Oh, J.; Srinivasan, V.; Choi, J.; Mueller, S.; Agrawal, A.; Babinsky, T.; et al. A Scalable Multi-TeraOPS Core for AI Training and Inference. IEEE Solid-State Circuits Lett. 2018, 1, 217–220. [Google Scholar] [CrossRef]
- Choi, S.; Sim, J.; Kang, M.; Choi, Y.; Kim, H.; Kim, L.S. An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices. IEEE J. Solid-State Circuits 2020, 55, 2691–2702. [Google Scholar] [CrossRef]
- Kuang, S.R.; Wang, J.P.; Guo, C.Y. Modified booth multipliers with a regular partial product array. IEEE Trans. Circuits Syst. II Express Briefs 2009, 56, 404–408. [Google Scholar] [CrossRef]
- Li, B.; Fang, L.; Xie, Y.; Chen, H.; Chen, L. A unified reconfigurable floating-point arithmetic architecture based on CORDIC algorithm. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, Australia, 11–13 December 2017; pp. 301–302. [Google Scholar] [CrossRef]
- Chen, H.; Jiang, L.; Yang, H.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions. Electronics 2020, 9, 1739. [Google Scholar] [CrossRef]
Functions | sw | Number of Segments | Range of kq | Bits of kq | Number of Partialproducts | |||
---|---|---|---|---|---|---|---|---|
Sign | Integral | Fractional | Total | |||||
5 | 15 | (0.7, 1.4) | 1 | 1 | 5 | 7 | 4 | |
5 | 14 | (0.7, 1.5) | 1 | 1 | 5 | 7 | 4 | |
5 | 14 | (−1, −0.2) | 1 | 0 | 5 | 6 | 3 | |
5 | 11 | (−0.5, −0.1) | 1 | 0 | 4 | 5 | 3 | |
6 | 7 | (0.3, 0.5) | 1 | 0 | 5 | 6 | 3 | |
Multi-function | 6 | 57 | (−1, 1.5) | 1 | 1 | 6 | 8 | 4 |
Functions | Range of bq | Bits of bq | Range of Output | Bits of Output | ||||
---|---|---|---|---|---|---|---|---|
Integral | Fractional | Total | Integral | Fractional | Total | |||
(0.6, 1] | 1 | 10 | 11 | [1, 2) | 1 | 10 | 11 | |
(0, 0.3) | 0 | 10 | 10 | [0, 1) | 0 | 10 | 10 | |
(0.7, 1] | 1 | 10 | 11 | (0.5, 1] | 1 | 10 | 11 | |
(0.8, 1] | 1 | 10 | 11 | (0.7, 1] | 1 | 10 | 11 | |
(1, 1.1) | 1 | 10 | 11 | [0, 1.5) | 1 | 10 | 11 | |
Multi-function | (0, 1.1) | 1 | 10 | 11 | [0, 2) | 1 | 10 | 11 |
Function | Method | Freq. (GHz) | Delay (ns) | Area (µm2) | Power (mW) | MAE (×10−3) |
---|---|---|---|---|---|---|
Proposed | 1.64 | 0.61 | 2576.16 | 1.280 | 0.97 | |
[30] PWL | 1.52 | 0.66 | 2836.08 | 1.559 | 0.97 | |
+7.89% | −7.58% | −9.16% | −17.90% | −0% | ||
[5] CORDIC | 1.25 | 9.6 | 12,841.20 | 2.651 | 1.23 | |
+31.20% | −93.65% | −79.94% | −51.72% | −21.14% | ||
Proposed | 1.56 | 0.64 | 1556.28 | 0.754 | 0.97 | |
[30] PWL | 1.49 | 0.67 | 3508.56 | 1.670 | 0.97 | |
+4.70% | −4.48% | −55.64% | −54.85% | −0% | ||
[8] CORDIC | 1.25 | 9.6 | 11,590.20 | 2.721 | 1.28 | |
+24.80% | −93.33% | −86.57% | −72.29% | −24.22% | ||
Proposed | 1.92 | 0.52 | 1994.40 | 1.049 | 0.97 | |
[30] PWL | 1.49 | 0.67 | 3039.12 | 1.470 | 0.96 | |
+28.86% | −22.39% | −34.38% | −28.64% | – | ||
[11] NR | 1.25 | 4.8 | 13,334.00 | 5.735 | 1.22 | |
+53.60% | −89.17% | −85.04% | −81.71% | −20.49% | ||
Proposed | 2.13 | 0.47 | 1575.00 | 0.855 | 0.96 | |
[30] PWL | 1.45 | 0.69 | 2817.36 | 1.349 | 0.95 | |
+46.90% | −31.88% | −44.10% | −36.62% | – | ||
[11] NR | 1.25 | 7.2 | 16,566.48 | 6.016 | 1.21 | |
+70.40% | −93.47% | −90.49% | −85.79% | −20.66% | ||
Proposed | 2.17 | 0.46 | 1413.36 | 0.845 | 0.98 | |
[30] PWL | 1.61 | 0.62 | 2706.12 | 1.498 | 0.98 | |
+34.78% | −25.81% | −47.77% | −43.59% | −0% | ||
[38] CORDIC | 1.11 | 10.8 | 10,386.00 | 2.006 | 1.35 | |
+95.50% | −95.74% | −86.39% | −57.88% | −27.41% | ||
Multi-function | Without reusing | 1.25 | 0.8 | 5329.80 | 1.965 | 0.98 |
With reusing | 1.25 | 0.8 | 3332.16 | 1.069 | 0.98 | |
– | – | −37.48% | −45.60% | −0% |
Design | CMOS Technology | Frequency (GHz) | Area (µm2) | Power (mW) | MAE | Number of Functions | APF (µm2) | PPF (mW) |
---|---|---|---|---|---|---|---|---|
Proposed | TSMC 65-nm | 1.25 | 3332.16 | 1.069 | 9.8 × 10-4 | 5 | 66.43 | 0.21 |
[30] PWL | 1 | 3669.84 | 1.100 | 9.8 × 10-4 | 5 | 733.97 | 0.22 | |
+25% | −9.20% | −2.82% | −0% | – | −9.20% | −2.82% | ||
Proposed | TSMC 40-nm | 1.5 | 4827.95 | 1.630 | 9.8 × 10-4 | 5 | 965.59 | 0.33 |
[39] CORDIC | 1.5 | 4364.57 | 1.89 | 10-4 a | 2 | 2182.29 | 0.95 | |
+0% | – | −13.76% | – | – | −55.75% | −65.26% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, S.; Zhao, G.; Wang, Y.; Lyu, F.; Wang, Y.; Pan, H.; Luo, Y. Area- and Power-Efficient Reconfigurable Architecture for Multifunction Evaluation. Electronics 2022, 11, 3391. https://doi.org/10.3390/electronics11203391
Zheng S, Zhao G, Wang Y, Lyu F, Wang Y, Pan H, Luo Y. Area- and Power-Efficient Reconfigurable Architecture for Multifunction Evaluation. Electronics. 2022; 11(20):3391. https://doi.org/10.3390/electronics11203391
Chicago/Turabian StyleZheng, Sifan, Guodong Zhao, Yu Wang, Fei Lyu, Yuxuan Wang, Hongbing Pan, and Yuanyong Luo. 2022. "Area- and Power-Efficient Reconfigurable Architecture for Multifunction Evaluation" Electronics 11, no. 20: 3391. https://doi.org/10.3390/electronics11203391
APA StyleZheng, S., Zhao, G., Wang, Y., Lyu, F., Wang, Y., Pan, H., & Luo, Y. (2022). Area- and Power-Efficient Reconfigurable Architecture for Multifunction Evaluation. Electronics, 11(20), 3391. https://doi.org/10.3390/electronics11203391