Accelerating Pattern Recognition with a High-Precision Hardware Divider Using Binary Logarithms and Regional Error Corrections
Abstract
:1. Introduction
2. Related Work
2.1. Digit Recurrence Division Methods
2.1.1. Restoring Division
Algorithm 1 Restoring Division |
Input: Dividend and divisor Output: Quotient Q and remainder P Begin
|
2.1.2. Non-Restoring Division
Algorithm 2 Non-Restoring Division |
Input: Dividend and divisor Output: Quotient Q and remainder P Begin
|
2.1.3. SRT Division
- Selecting the next quotient digit .
- Computing the product .
- Updating the remainder: .
2.2. Functional Iteration Division Methods
2.2.1. Newton–Raphson Division
- Computing an initial estimate of .
- Refining the estimate iteratively: .
- Computing the quotient: .
2.2.2. Goldschmidt Division
2.3. Logarithm-Based Division Methods
- Determining the index k of the most significant nonzero bit .
- Computing , where x represents the fractional component of N.
2.3.1. Mitchell’s Algorithm
2.3.2. Discontinuous Piecewise Linear Approximation
2.3.3. Non-Uniform Multi-Region Constant Adder Correction
3. Proposed Method
Algorithm 3 Regional Error Correction For Binary Logarithm Calculation |
Input: Integer N Parameter(s): Number of regions M Output: Logarithm value Begin
|
Algorithm 4 Regional Error Correction For Antilogarithm Calculation |
Input: Logarithm value Parameter(s): Number of regions M Output: Antilogarithm value N Begin
|
4. Error Analysis
5. Hardware Implementation
5.1. Hardware Architecture
5.2. FPGA Implementation and Performance Evaluation
- Number of registers;
- Number of LUTs;
- Maximum operating frequency ();
- Power consumption ();
- Latency.
- Restoring and Radix-2 dividers exhibit similar resource utilization, power consumption, and latency. However, Radix-2 dividers require more registers, leading to faster processing speeds for large wordlengths (24/48 and 32/64), albeit with higher power consumption.
- High-Radix dividers significantly reduce the latency for large wordlengths while utilizing block RAMs and DSP slices, resulting in lower registers and LUT usage compared to restoring and Radix-2 dividers.
- LutMultA dividers are more suitable for small wordlengths. For input wordlength less than 12 bits, they require only 8 clock cycles. Although they are slower than restoring and Radix-2 dividers, their processing speed remains sufficient for real-time processing. However, as Xilinx’s divider generator does not support LutMultA division for wordlengths greater than or equal to 12 bits [11], implementation results for these cases are marked as NA (not available).
5.3. Practical Application in Image Processing
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, G.; Chen, Y.; Zheng, Y.; Martin, G.; Wang, R. Local-enhanced representation for text-based person search. Pattern Recognit. 2025, 161, 111247. [Google Scholar] [CrossRef]
- Wang, Y.; Wei, W. Local and global feature attention fusion network for face recognition. Pattern Recognit. 2025, 161, 111227. [Google Scholar] [CrossRef]
- Zhang, Z.; Yang, L.; Wang, K.; Xi, X.; Nie, X.; Yang, G.; Yin, Y. Consistency and label constrained transfer low-rank representation for cross-light finger vein recognition. Pattern Recognit. 2025, 161, 111208. [Google Scholar] [CrossRef]
- Fog, A. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. Available online: https://www.agner.org/optimize/instruction_tables.pdf (accessed on 26 December 2024).
- NVIDIA. CUDA Binary Utilities. Available online: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#maxwell-pascal (accessed on 26 December 2024).
- Rodeheffer, T. Software Integer Division. Available online: https://www.microsoft.com/en-us/research/wp-content/uploads/2008/08/tr-2008-141.pdf (accessed on 16 December 2024).
- Mitchell, J.N. Computer Multiplication and Division Using Binary Logarithms. IRE Trans. Electron. Comput. 1962, EC-11, 512–517. [Google Scholar] [CrossRef]
- Shaw, R. Arithmetic Operations in a Binary Computer. Rev. Sci. Instrum. 1950, 21, 690. [Google Scholar] [CrossRef]
- McCann, M.; Pippenger, N. SRT Division Algorithms as Dynamical Systems. SIAM J. Comput. 2005, 34, 1279–1301. [Google Scholar] [CrossRef]
- Lee, S.; Ngo, D.; Kang, B. Design of an FPGA-Based High-Quality Real-Time Autonomous Dehazing System. Remote Sens. 2022, 14, 1852. [Google Scholar] [CrossRef]
- Xilinx. Divider Generator v5.1 Product Guide (PG151). Available online: https://docs.amd.com/v/u/en-US/pg151-div-gen (accessed on 26 August 2024).
- Oberman, S.; Flynn, M. Measuring the Complexity of SRT Tables. Available online: http://i.stanford.edu/pub/cstr/reports/csl/tr/95/679/CSL-TR-95-679.pdf (accessed on 19 February 2025).
- Rodriguez-Garcia, A.; Pizano-Escalante, L.; Parra-Michel, R.; Longoria-Gandara, O.; Cortez, J. Fast fixed-point divider based on Newton-Raphson method and piecewise polynomial approximation. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 9–11 December 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Goldschmidt, R. Applications of Division by Convergence. Master’s Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1964. [Google Scholar]
- Soderquist, P.; Leeser, M. Division and square root: Choosing the right implementation. IEEE Micro 2002, 17, 56–66. [Google Scholar] [CrossRef]
- Chaudhary, M.; Lee, P. An Improved Two-Step Binary Logarithmic Converter for FPGAs. IEEE Trans. Circuits Syst. II Express Briefs 2015, 62, 476–480. [Google Scholar] [CrossRef]
- Ngo, D.; Kang, B. Taylor-Series-Based Reconfigurability of Gamma Correction in Hardware Designs. Electronics 2021, 10, 1959. [Google Scholar] [CrossRef]
- Arnold, M.G.; Collange, C. A Real/Complex Logarithmic Number System ALU. IEEE Trans. Comput. 2011, 60, 202–213. [Google Scholar] [CrossRef]
- Ha, M.; Lee, S. Accurate Hardware-Efficient Logarithm Circuit. IEEE Trans. Circuits Syst. II Express Briefs 2017, 64, 967–971. [Google Scholar] [CrossRef]
- Kuo, C. Design and realization of high performance logarithmic converters using non-uniform multi-regions constant adder correction schemes. Microsyst. Technol. 2018, 24, 4237–4245. [Google Scholar] [CrossRef]
- 1364-2005; IEEE Standard for Verilog Hardware Description Language. IEEE (Institute of Electrical and Electronics Engineers): Piscataway, NJ, USA, 2006; pp. 1–590. [CrossRef]
- Xilinx. Vivado Design Suite User Guide: Designing with IP (UG896). Available online: https://docs.amd.com/viewer/book-attachment/21Juiels_eENy0SgK2kr7g/3ocj~oULvr~9S5RyFlBM3g-21Juiels_eENy0SgK2kr7g (accessed on 22 February 2025).
- Ngo, D.; Lee, S.; Nguyen, Q.H.; Ngo, M.; Lee, G.D.; Kang, B. Single Image Haze Removal from Image Enhancement Perspective for Real-Time Vision-Based Systems. Sensors 2020, 20, 5170. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Method | Minimum | Maximum | Mean | Std. | |
---|---|---|---|---|---|
Mitchell | 0.0000 | 0.0861 | 0.0573 | 0.0257 | |
Kuo | 0.0000 | 0.0249 | 0.0148 | 0.0065 | |
Ha and Lee | −0.0212 | 0.0070 | −0.0024 | 0.0055 | |
Proposed | −0.0225 | 0.0222 | 0.0009 | 0.0072 | |
−0.0125 | 0.0121 | 0.0002 | 0.0036 | ||
−0.0066 | 0.0062 | 0.0001 | 0.0018 | ||
−0.0002 | 0.0001 | 0.0000 | 0.0001 |
Metric | M | Input/Output Wordlength (Bits) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
8/16 | 9/18 | 10/20 | 11/22 | 12/24 | 13/26 | 14/28 | 15/30 | 16/32 | ||
E (%) | 8 | 3.493 | 3.657 | 3.657 | 3.698 | 3.719 | 3.729 | 3.724 | 3.722 | 3.721 |
16 | 1.774 | 1.951 | 1.989 | 1.994 | 2.035 | 2.034 | 2.040 | 2.042 | 2.045 | |
32 | 0.971 | 0.859 | 0.952 | 0.997 | 1.010 | 1.021 | 1.021 | 1.024 | 1.023 | |
1024 | 0.103 | 0.103 | 0.110 | 0.112 | 0.111 | 0.112 | 0.112 | 0.108 | 0.111 | |
2048 | 0.120 | 0.100 | 0.120 | 0.112 | 0.102 | 0.102 | 0.098 | 0.103 | 0.103 | |
4096 | 0.098 | 0.100 | 0.094 | 0.112 | 0.103 | 0.098 | 0.103 | 0.112 | 0.112 |
Metric | W | Input/Output Wordlength (Bits) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
8/16 | 9/18 | 10/20 | 11/22 | 12/24 | 13/26 | 14/28 | 15/30 | 16/32 | ||
E (%) | 8 | 0.452 | 0.395 | 0.452 | 0.452 | 0.452 | 0.452 | 0.403 | 0.398 | 0.417 |
10 | 0.103 | 0.103 | 0.110 | 0.112 | 0.111 | 0.112 | 0.112 | 0.108 | 0.111 | |
12 | 0.044 | 0.044 | 0.045 | 0.037 | 0.041 | 0.040 | 0.042 | 0.039 | 0.042 | |
14 | 0.034 | 0.032 | 0.037 | 0.034 | 0.034 | 0.034 | 0.031 | 0.028 | 0.031 | |
16 | 0.031 | 0.034 | 0.033 | 0.032 | 0.034 | 0.034 | 0.028 | 0.028 | 0.030 | |
18 | 0.032 | 0.033 | 0.034 | 0.032 | 0.034 | 0.032 | 0.028 | 0.027 | 0.031 |
Method | Metric * | Input/Output Wordlength (bits) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
8/16 | 9/18 | 10/20 | 11/22 | 12/24 | 13/26 | 14/28 | 15/30 | 16/32 | 24/48 | 32/64 | ||
Restoring | Registers | 350 | 433 | 521 | 616 | 722 | 834 | 951 | 1081 | 1216 | 2573 | 4447 |
LUTs | 320 | 403 | 485 | 582 | 675 | 761 | 865 | 1037 | 1148 | 2505 | 4404 | |
775.194 | 775.194 | 775.194 | 775.194 | 775.194 | 775.194 | 775.194 | 775.194 | 775.194 | 672.043 | 627.746 | ||
0.750 | 0.764 | 0.964 | 0.889 | 0.921 | 1.075 | 1.117 | 1.160 | 1.198 | 1.452 | 1.726 | ||
Latency | 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 | 33 | 49 | 65 | |
Radix-2 | Registers | 438 | 557 | 681 | 882 | 968 | 1139 | 1315 | 1504 | 1706 | 3806 | 6738 |
LUTs | 175 | 215 | 260 | 330 | 359 | 415 | 475 | 540 | 608 | 1297 | 2241 | |
775.194 | 775.194 | 771.605 | 771.605 | 775.194 | 775.194 | 771.605 | 775.194 | 769.823 | 775.194 | 771.605 | ||
0.721 | 0.739 | 0.760 | 0.932 | 0.964 | 1.012 | 1.048 | 1.082 | 1.129 | 1.492 | 1.919 | ||
Latency | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 | 34 | 50 | 66 | |
High-Radix | Registers | 558 | 656 | 689 | 724 | 724 | 795 | 873 | 908 | 1017 | 888 | 1136 |
LUTs | 386 | 397 | 431 | 442 | 459 | 473 | 533 | 543 | 727 | 554 | 710 | |
BRAMs | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
DSP48s | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 7 | 9 | 11 | |
673.854 | 632.111 | 627.746 | 628.931 | 628.931 | 626.566 | 591.716 | 588.582 | 592.066 | 591.716 | 592.417 | ||
0.763 | 0.799 | 0.801 | 0.807 | 0.810 | 1.041 | 1.025 | 1.027 | 1.054 | 1.192 | 1.383 | ||
Latency | 20 | 21 | 21 | 21 | 21 | 21 | 25 | 25 | 25 | 31 | 35 | |
LutMultA | Registers | 170 | 202 | 218 | 225 | NA | NA | NA | NA | NA | NA | NA |
LUTs | 300 | 308 | 467 | 437 | NA | NA | NA | NA | NA | NA | NA | |
BRAMs | 0.5 | 0.5 | 1 | 2 | NA | NA | NA | NA | NA | NA | NA | |
DSP48s | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA | |
532.198 | 528.541 | 504.796 | 437.828 | NA | NA | NA | NA | NA | NA | NA | ||
0.694 | 0.903 | 0.920 | 0.880 | NA | NA | NA | NA | NA | NA | NA | ||
Latency | 8 | 8 | 8 | 8 | NA | NA | NA | NA | NA | NA | NA | |
Proposed | Registers | 140 | 152 | 158 | 164 | 174 | 184 | 194 | 204 | 214 | 300 | 380 |
LUTs | 305 | 391 | 429 | 461 | 469 | 679 | 677 | 717 | 581 | 846 | 1166 | |
724.638 | 676.590 | 685.401 | 672.043 | 645.161 | 529.381 | 534.474 | 537.634 | 573.723 | 531.915 | 480.077 | ||
0.689 | 0.692 | 0.699 | 0.713 | 0.784 | 0.787 | 0.794 | 0.810 | 0.825 | 0.902 | 0.967 | ||
Latency | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
Metric * | Available | Standard | Proposed | ||
---|---|---|---|---|---|
Used | Utilization | Used | Utilization | ||
Registers | 460,800 | 34,566 | 31,413 | ||
LUTs | 230,400 | 28,718 | 26,785 | ||
BRAMs | 312 | 66 | 66 | ||
- | 373.276 | 371.747 | |||
- | 2.757 | 2.576 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ngo, D.; Ahn, S.; Son, J.; Kang, B. Accelerating Pattern Recognition with a High-Precision Hardware Divider Using Binary Logarithms and Regional Error Corrections. Electronics 2025, 14, 1066. https://doi.org/10.3390/electronics14061066
Ngo D, Ahn S, Son J, Kang B. Accelerating Pattern Recognition with a High-Precision Hardware Divider Using Binary Logarithms and Regional Error Corrections. Electronics. 2025; 14(6):1066. https://doi.org/10.3390/electronics14061066
Chicago/Turabian StyleNgo, Dat, Suhun Ahn, Jeonghyeon Son, and Bongsoon Kang. 2025. "Accelerating Pattern Recognition with a High-Precision Hardware Divider Using Binary Logarithms and Regional Error Corrections" Electronics 14, no. 6: 1066. https://doi.org/10.3390/electronics14061066
APA StyleNgo, D., Ahn, S., Son, J., & Kang, B. (2025). Accelerating Pattern Recognition with a High-Precision Hardware Divider Using Binary Logarithms and Regional Error Corrections. Electronics, 14(6), 1066. https://doi.org/10.3390/electronics14061066