# Designing Energy-Efficient Approximate Multipliers

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Related Works

_{[n−1:0]}= a

_{n−1}, …, a

_{0}and the m-bit multiplier B

_{[n−1:0]}= b

_{n−1}, …, b

_{0}are 2′s complement numbers represented as given in (1). As it is well known, the basic multiplication algorithm first computes the bitwise ANDs between the operand A and the bits of B. Then, in order to obtain the generic partial product PPj, with j = 0, …, m−1, the j-th result produced by the AND operation related to the bit b

_{j}, is left shifted by j bit positions and sign extended to (n + m) bits. Finally, as shown by (2), the exact product Pe

_{[n+m−1:0]}is calculated by accumulating the m computed PPs. It is important to highlight that the simpler behavior of a multiplier processing unsigned operands can be easily derived from (1) and (2) by just removing the initial minus sign.

_{[n+m−1:0]}as given in (3). In this case, in order to treat unsigned inputs correctly, A and B must be zero extended to (n + 1)- and (m + 1)-bit, respectively.

_{M}and B

_{M}still represent 2′s complement numbers, the sub-words A

_{L}and B

_{L}are unsigned numbers. This makes the management of signs information necessary to compute the sub-products P

_{ML}, P

_{LM}, and P

_{LL}much simpler than what is required for calculating P

_{MM}. Obviously, the overall computation is even easier when unsigned operands are processed. Furthermore, it is easy to understand that, independent of the adopted algorithm, the modular approach could be applied recursively to compute the sub-products, as shown, for example, in [27].

## 3. The Novel Approximation Strategy

#### 3.1. The New 3-Bit Encoding Logic for Least Significant Sub-Words

_{in}and P

_{out}. As shown in the following, coded digits CDx are then aligned and OR-ed to finally furnish ${A}_{La}$ and ${B}_{La}$.

#### 3.2. The NR4EL Multiplication

## 4. Accuracy and Implementation Results

_{ka_kb}indicates a multiplier designed as described here that approximates ka LSBs of A and kb LSBs of B. This section presents results obtained for both symmetric and asymmetric designs. Performances achieved by our proposal are discussed and compared with competitors. All quality measures, in terms of average error (AE), error rate (ER), normalized mean error distance (NMED), mean relative error distance (MRED), defined as reported [30], and number of effective bits (NoEB), introduced in [8], have been obtained through exhaustive C++ simulations. It is worth noting that accuracy tests for multipliers with operands word lengths greater than 16-bit are excessively time consuming. Therefore, as in all the previous works [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] for such cases, only the hardware characteristics are provided.

#### 4.1. Design Space Exploration

#### 4.2. ASIC Implementations

_{2_6}signed design achieves an energy saving higher than 80%, with a negligible impact on the speed performances. The 2StepTrunc signed architecture [8] shows an energy saving with respect to its baseline of ~76%, and, even though it reaches an interesting delay reduction, the achieved quality level is quite lower than the New

_{2_6}. On the other side, while the C-Full circuit [9] dissipates the same energy as the proposed one, it shows a much lower gain with respect to the baseline and achieves a NoEB lower than the New

_{2_6}. Furthermore, it must be considered that the architectures in [9] operate only on unsigned operands. The above analysis confirms the effectiveness of the proposed approach in reducing the number of non-zero bits within the tree of partial products in favor of energy efficiency. Indeed, the strategies proposed in [8,9], being, respectively, based on LSB truncation and approximate compressors, just partially simplify the adder circuits responsible for the accumulation of the partial products.

_{8_8}16 × 16 signed multiplier saves ~75% of the energy, whereas [8] saves at most ~63%. Surprisingly, [9] shows a ~8% improvement in this figure. However, the quality level of the 16 × 16 New

_{8_8}multiplier still overcomes the competitors. On the other hand, [8,9] achieve area and delay reductions remarkably higher than the new designs.

_{ASIC}(NFM) and CF

_{ASIC}(NCF) and shows that the FM

_{ASIC}achieved by the New

_{2_6}circuit is 12% and 34% higher than 1StepTrunc [8] and CSSM [7], respectively. Indeed, at a comparable NoEB, the signed 8 × 8 architectures demonstrated in [7] reach a power saving ~20% lower. The graceful behavior of the proposed multiplier is confirmed by the CF

_{ASIC}, which is up to 13 times lower than that of the competitors.

#### 4.3. FPGA Implementations

_{4_4}achieves the lowest MRED. Results in Table 3 show that New

_{4_4}and New

_{2_6}architectures achieve the best energy-quality-delay trade-off, significantly overcoming their counterparts.

_{11_11}architecture is ~12% faster than [14] and reaches a more than acceptable energy-quality behavior. As a final remark, it is worth noting that none of the competitors evaluated in Table 1, Table 2, Table 3 and Table 4 have the ability to perform well by using both ASIC and FPGA platforms.

## 5. Case Study: Image Processing Applications

_{2_6}multiplier and receive the kerne values as external inputs. Therefore, they can support different edge detectors and filters. However, for purposes of comparison with previous works, the Sobel operator and the 2D Gaussian smoothing filters have been referenced. The energy consumption of complete systems was analyzed with 100,000 random vectors at the maximum toggle rates. Whereas, the accuracy was examined using images from the USC-SIPI dataset [31] as test benches. Accuracy results discussed in the following are calculated by averaging those obtained for all the 256 × 256 and 512 × 512 images available in [31]. Sample images reported in Figure 7 show that the new approximate multipliers work well in both the referred image processing applications.

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Alioto, M. Ultra-Low Power VLSI Circuit Design Demystified and Explained: A Tutorial. IEEE Trans. Circuits Syst. I Regul. Pap.
**2012**, 59, 3–29. [Google Scholar] [CrossRef] - Jiang, H.; Santiago, F.J.H.; Mo, H.; Liu, L.; Han, J. Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications. Proc. IEEE
**2020**, 108, 2108–2135. [Google Scholar] [CrossRef] - Chang, C.-H.; Satzoda, R.K. A low error and high performance multiplexer-based truncated multiplier. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2010**, 18, 1767–1771. [Google Scholar] [CrossRef] - Frustaci, F.; Perri, S.; Corsonello, P.; Alioto, M. Approximate Multipliers with Dynamic Truncation for Energy Reduction via Graceful Quality Degradation. IEEE Trans. Circuits Syst. II Express Briefs
**2020**, 67, 3427–3431. [Google Scholar] [CrossRef] - Hashemi, S.; Bahar, R.I.; Reda, S. DRUM: A Dynamic Range Unbiased Multiplier for approximate applications. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 2–6 November 2015. [Google Scholar]
- Narayanamoorthy, S.; Moghaddam, H.A.; Liu, Z.; Park, T.; Kim, N.S. Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2015**, 23, 1180–1184. [Google Scholar] [CrossRef] - Strollo, A.G.M.; Napoli, E.; De Caro, D.; Petra, N.; Saggese, G.; Di Meo, G. Approximate Multipliers Using Static Segmentation: Error Analysis and Improvements. IEEE Trans. Circuits Syst. I Regul. Pap.
**2022**, 69, 2449–2462. [Google Scholar] [CrossRef] - Esposito, D.; Strollo, A.G.M.; Napoli, E.; De Caro, D. Approximate Multipliers Based on New Approximate Compressors. IEEE Trans. Circuits Syst. I Regul. Pap.
**2018**, 65, 4169–4182. [Google Scholar] [CrossRef] - Strollo, A.G.M.; Napoli, E.; De Caro, D.; Petra, N.; Di Meo, G. Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers. IEEE Trans. Circuits Syst. I Regul. Pap.
**2020**, 67, 3021–3034. [Google Scholar] [CrossRef] - Venkatachalam, S.; Adams, E.; Lee, H.J.; Ko, S.-B. Design and analysis of area and power efficient approximate booth multipliers. IEEE Trans. Comput.
**2019**, 68, 1697–1703. [Google Scholar] [CrossRef] - Waris, H.; Wang, C.; Liu, W. Hybrid low radix encoding-based approximate booth multipliers. IEEE Trans. Circuits Syst. II Express Briefs
**2020**, 67, 3367–3371. [Google Scholar] [CrossRef] - Kulkarni, P.; Gupta, P.; Ercegovac, M. Trading accuracy for power with an underdesigned multiplier architecture. In Proceedings of the 24th Internatioal Conference on VLSI Design, Chennai, India, 2–7 January 2011. [Google Scholar]
- Qiqieh, I.; Shafik, R.; Tarawneh, G.; Sokolov, D.; Yakovlev, A. Energy-efficient approximate multiplier design using bit significance-driven logic compression. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017. [Google Scholar]
- Waris, H.; Wang, C.; Liu, W.; Lombardi, F. AxBMs: Approximate Radix-8 Booth Multipliers for High-Performance FPGA-Based Accelerators. IEEE Trans. Circuits Syst. II Express Briefs
**2021**, 68, 1566–1570. [Google Scholar] [CrossRef] - Ullah, S.; Schmidl, H.; Sahoo, S.S.; Rehman, S.; Kumar, A. Area-Optimized Accurate and Approximate Softcore Signed Multiplier Architecture. IEEE Trans. Comput.
**2021**, 70, 384–392. [Google Scholar] [CrossRef] - Ullah, S.; Rehman, S.; Shafique, M.; Kumar, A. High-Performance Accurate and Approximate Multipliers for FPGA-based Hardware Accelerators. IEEE Trans. Comput. -Aided Des. Integr. Circuits Syst.
**2022**, 41, 211–224. [Google Scholar] [CrossRef] - Rehman, S.; El-Harouni, W.; Shafique, M.; Kumar, A.; Henkel, J. Architectural-space exploration of approximate multipliers. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 7–10 November 2016. [Google Scholar]
- Ullah, S.; Rehman, S.; Prabakaran, B.S.; Kriebel, F.; Hanif, M.A.; Shafique, M.; Kumar, A. Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators. In Proceedings of the 55th Annual Design Automation Conference, San Francisco, CA, USA, 24–28 June 2018. [Google Scholar]
- Mrazek, V.; Hrbacek, R.; Vasicek, Z.; Sekanina, L. EvoApprox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017. [Google Scholar]
- Imed, B.D. Implementation of a Fuel Estimation Algorithm Using Approximated Computing. J. Low Power Electron. Appl.
**2022**, 12, 17. [Google Scholar] - Preatto, S.; Giannini, A.; Valente, L.; Masera, G.; Martina, M. Optimized VLSI Architecture of HEVC Fractional Pixel Interpolators with Approximate Computing. J. Low Power Electron. Appl.
**2020**, 10, 24. [Google Scholar] [CrossRef] - Coelho, D.F.G.; Cintra, R.J.; Bayer, F.M.; Kulasekera, S.; Madanayake, A.; Martinez, P.; Silveira, T.L.T.; Oliveira, R.S.; Dimitrov, V.S. Low-Complexity Loeffler DCT Approximations for Image and Video Coding. J. Low Power Electron. Appl.
**2018**, 8, 46. [Google Scholar] [CrossRef] - Balasubramanian, P.; Maskell, D.L. Hardware Optimized and Error Reduced Approximate Adder. Electronics
**2019**, 8, 1212. [Google Scholar] [CrossRef] - Tastan, I.; Karaca, M.; Yurdakul, A. Approximate CPU Design for IoT End-Devices with Learning Capabilities. Electronics
**2020**, 9, 125. [Google Scholar] [CrossRef] - Perri, S.; Spagnolo, F.; Frustaci, F.; Corsonello, P. Efficient Approximate Adders for FPGA-Based Data-Paths. Electronics
**2020**, 9, 1529. [Google Scholar] [CrossRef] - Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.
**2004**, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] - Perri, S.; Corsonello, P.; Cocorullo, G. Efficient recursive multiply architecture for FPGAs. Electron. Lett.
**2005**, 41, 1314–1316. [Google Scholar] [CrossRef] - 7 Series FPGAs Configurable Logic Block User Guide, UG474 (v1.8) September 27. 2016. Available online: https://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf (accessed on 22 July 2022).
- Intel
^{®}Stratix^{®}10 Logic Array Blocksand Adaptive Logic Modules User Guide, UG-S10LA, April 4. 2020. Available online: https://www.intel.com/content/dam/www/programmable/us/en/pdf/literature/hb/stratix-10/ug-s10-lab.pdf (accessed on 22 July 2022). - Liang, J.; Han, J.; Lombardi, F. New Metrics for the Reliability of Approximate and Probabilistic Adders. IEEE Trans. Comput.
**2013**, 62, 1760–1771. [Google Scholar] [CrossRef] - SIPI Image Database. 2019. Available online: http://sipi.usc.edu/database/database.php?volume=misc (accessed on 22 July 2022).

**Figure 4.**An example of multiplication through the proposed approximate approach: (

**a**) approximate A and B; (

**b**) compute ${P}_{MM}$, ${P}_{MLa}$, ${P}_{LMa}$, ${P}_{LLa}$; (

**c**) compute the approximate product Pa.

Circuit | Process | D (ps) | A (um^{2}) | E (pJ) | AE | ER% | NoEB |
---|---|---|---|---|---|---|---|

Baseline [8] | 40 nm | 564 | 986 | 1.29 | PRECISE | ||

1StepFull [8] | 40 nm | 500 | 524 | 0.81 | 42.3 | 30 | 9.47 |

1StepTrunc [8] | 40 nm | 500 | 310 | 0.47 | 2.3 × 10^{2} | 96 | 7.89 |

2StepFull [8] | 40 nm | 419 | 428 | 0.72 | 8.7 × 10^{2} | 84 | 5.59 |

2StepTrunc [8] | 40 nm | 375 | 171 | 0.3 | 1.0 × 10^{3} | 99 | 5.46 |

Our Baseline | 40 nm | 506 | 780.4 | 0.24 | PRECISE | ||

New_{2_6} | 40 nm | 519 | 529.6 | 0.04 | 0.324 | 91.4 | 6.79 |

Baseline [9] | 28 nm | 260 | 196 | 0.046 | PRECISE | ||

C-N [9] | 28 nm | 248.6 | 175 | 0.041 | n.a. | 9 | 10.8 |

C-Full [9] | 28 nm | 216 | 155 | 0.031 | n.a. | 40 | 5.44 |

Our Baseline | 28 nm | 280 | 370 | 0.16 | PRECISE | ||

New_{2_6} | 28 nm | 284 | 360 | 0.031 | 0.324 | 91.4 | 6.79 |

Circuit | Process | D (ps) | A (um^{2}) | E (pJ) | AE | ER% | NoEB |
---|---|---|---|---|---|---|---|

Baseline [8] | 40 nm | 800 | 2595 | 3.58 | PRECISE | ||

1StepFull [8] | 40 nm | 746 | 1859 | 2.94 | 3.57 × 10^{4} | 61 | 16.04 |

1StepTrunc [8] | 40 nm | 730 | 1002 | 1.56 | 1.45 × 10^{5} | 100 | 14.66 |

2StepFull [8] | 40 nm | 667 | 1147 | 2.01 | 3.77 × 10^{6} | 97 | 9.36 |

2StepTrunc [8] | 40 nm | 650 | 700 | 1.29 | 3.86 × 10^{6} | 100 | 9.35 |

Our Baseline | 40 nm | 720 | 2362 | 7.2 | PRECISE | ||

New_{8_8} | 40 nm | 737 | 1814 | 1.62 | 8867.18 | 99.87 | 10.12 |

Baseline [9] | 28 nm | 375 | 920 | 3.52 | PRECISE | ||

C-N [9] | 28 nm | 363 | 821 | 2.94 | n.a. | 47 | 17.53 |

C-Full [9] | 28 nm | 318 | 727 | 2.11 | n.a. | 88 | 5.44 |

Our Baseline | 28 nm | 445 | 1016 | 4.68 | PRECISE | ||

New_{8_8} | 28 nm | 446 | 849 | 1.2 | 8867.18 | 99.87 | 10.12 |

Configuration | #LUTs | D(ns) | E(pJ) | AE | ER (%) | MRED |
---|---|---|---|---|---|---|

BA [15] | 37 | 3.41 | 4.22 | 85.01 | 90.56 | 0.091 |

Trunc ^{1} [15] | 43 | 2.15 | 3.06 | 149.78 | 93 | 0.121 |

S2 [15] | 86 | 4.89 | 7.42 | 118.875 | 34.19 | 0.0223 |

CA [16] | 57 | 3.13 | 4.73 | 54.19 | 8.36 | 0.0029 |

CC [16] | 56 | 1.98 | 3.55 | 1592.26 | 80.46 | 0.13 |

S1 [17] | 92 | 4.99 | 7.1 | 1842.44 | 86.46 | 0.362 |

S3 [18] | 81 | 5.19 | 7.41 | 101.94 | 8.42 | 0.0121 |

S5 [19] | 110 | 4.43 | 9.75 | 127.11 | 84.43 | 0.049 |

New_{4_4} | 82 | 2.4 | 7.2 | 0.0664 | 89.69 | 1.5 × 10^{−4} |

New_{2_6} | 68 | 2.13 | 3.8 | 0.324 | 91.35 | 2.1 × 10^{−3} |

^{1}The two LSBs of each input operand are truncated [15].

Configuration | #LUTs | D (ns) | E (pJ) | MRED | MED | NMED |
---|---|---|---|---|---|---|

AxBM1 [14] | 194 | 3.68 | 18.03 | 5.0 × 10^{−4} | 9233.62 | 8.6 × 10^{−6} |

AxBM2 [14] | 161 | 3.45 | 14.21 | 3.0 × 10^{−4} | 7623.1 | 7.1 × 10^{−6} |

New_{11_11} | 183 | 3.03 | 15.15 | 4.0 × 10^{−3} | 13,194.9 | 1.2 × 10^{−5} |

Multiplier | Configuration | AE | MRED | NMED | NoEB |
---|---|---|---|---|---|

New_{4_4} | 0.0664 | 1.50 × 10^{−4} | 0.009 | 8.343 | |

8 × 8 | New_{5_3} | 0.476 | 5.70 × 10^{−4} | 0.0127 | 7.78 |

New_{2_6} | 0.324 | 2.10 × 10^{−3} | 0.024 | 6.79 | |

12 × 12 | New_{8_8} | 82.556 | 2.50 × 10^{−3} | 0.00943 | 8.306 |

16 × 16 | New_{8_8} | 8867.18 | 8.38 × 10^{−3} | 1.53 × 10^{−5} | 10.118 |

New_{11_11} | 82,031.17 | 3.97 × 10^{−3} | 1.22 × 10^{−5} | 7.1 |

n × m | Configuration | #LUTs | D (ns) | E (pJ) |
---|---|---|---|---|

8 × 8 | New_{4_4} | 82 | 2.4 | 7.2 |

New_{5_3} | 78 | 2.21 | 4.42 | |

12 × 12 | New_{8_8} | 177 | 2.8 | 11.2 |

BA [15] | 79 | 5.3 | 10.17 | |

Trunc [15] | 102 | 3.52 | 8.97 | |

S1 [17] | 228 | 6.98 | 20.8 | |

S2 [15] | 189 | 6.37 | 20.77 | |

S3 [18] | 185 | 7.11 | 20.39 | |

Accurate IP core | 162 | 4.2 | 19.79 | |

16 × 16 | New_{8_8} | 270 | 3.4 | 20.4 |

New_{11_11} | 183 | 3.03 | 15.15 | |

BA [15] | 144 | 7.64 | 21.15 | |

Trunc [15] | 214 | 4.1 | 14.76 | |

S1 [17] | 228 | 6.98 | 20.8 | |

S2 [15] | 330 | 6.59 | 20.39 | |

S3 [18] | 296 | 7.33 | 18.58 | |

CA [16] | 245 | 4.98 | 26.5 | |

CC [16] | 240 | 2.38 | 16.16 | |

Accurate IP core | 286 | 4.27 | 34.35 | |

24 × 24 | New_{16_16} | 565 | 3.6 | 28.8 |

BA [15] | 301 | 10.99 | 48.26 | |

Trunc [15] | 514 | 6.07 | 53.97 | |

S1 [17] | 895 | 9.43 | 101.63 | |

S2 [15] | 777 | 9.45 | 97.48 | |

S3 [18] | 697 | 9.69 | 92.35 | |

Accurate IP core | 627 | 5.98 | 77.25 | |

32 × 32 | New_{16_16} | 937 | 4.95 | 64.35 |

CA [16] | 1013 | 6.98 | 58.84 | |

CC [16] | 992 | 3.02 | 33.04 | |

Accurate IP core | 1037 | 7.23 | 151.83 |

Multiplier Used | Device | Filter Size | Hardware Characteristics | Quality Metrics | ||||
---|---|---|---|---|---|---|---|---|

#LUT/LE | #FFs | D (ns) | E (pJ) | PSNR | SSIM | |||

8 × 8 New_{2_6} | VIRTEX 7 XC7VX485 | 3 × 3 | 664 | 164 | 4.9 | 98 | 52.9 | 1 |

5 × 5 | 1935 | 420 | 5.87 | 299.3 | 54.68 | 1 | ||

7 × 7 | 3781 | 804 | 6.8 | 632.4 | 60.75 | 1 | ||

CYCLONE10LP 006YE144A7G | 3 × 3 | 1118 | 164 | 14.4 | 89.57 | 52.9 | 1 | |

8 × 8 BA [15] | VIRTEX 7 XC7VX485 | 3 × 3 | 398 | 163 | 6.3 | 100.8 | 50.5 | 0.98 |

5 × 5 | 1221 | 419 | 6.8 | 326.5 | 51.85 | 0.99 | ||

7 × 7 | 2411 | 803 | 7.5 | 682.5 | 52.36 | 0.99 | ||

8 × 8Accurate IP | VIRTEX 7 XC7VX485 | 3 × 3 | 722 | 164 | 5.5 | 143 | ∞ | 1 |

5 × 5 | 2025 | 420 | 6.9 | 414 | ∞ | 1 | ||

7 × 7 | 3976 | 804 | 8.6 | 842.8 | ∞ | 1 | ||

CYCLONE10LP 006YE144A7G | 3 × 3 | 1010 | 164 | 14 | 204.7 | ∞ | 1 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Perri, S.; Spagnolo, F.; Frustaci, F.; Corsonello, P.
Designing Energy-Efficient Approximate Multipliers. *J. Low Power Electron. Appl.* **2022**, *12*, 49.
https://doi.org/10.3390/jlpea12040049

**AMA Style**

Perri S, Spagnolo F, Frustaci F, Corsonello P.
Designing Energy-Efficient Approximate Multipliers. *Journal of Low Power Electronics and Applications*. 2022; 12(4):49.
https://doi.org/10.3390/jlpea12040049

**Chicago/Turabian Style**

Perri, Stefania, Fanny Spagnolo, Fabio Frustaci, and Pasquale Corsonello.
2022. "Designing Energy-Efficient Approximate Multipliers" *Journal of Low Power Electronics and Applications* 12, no. 4: 49.
https://doi.org/10.3390/jlpea12040049