Next Article in Journal
Improved Parallel Implementation of 1D Discrete Wavelet Transform Using CPU-GPU
Previous Article in Journal
Distributed ItemCF Recommendation Algorithm Based on the Combination of MapReduce and Hive
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Power Function Algorithms Implemented in Microcontrollers and FPGAs

1
Faculty of Mechanical and Industrial Engineering, Warsaw University of Technology, 00-661 Warszawa, Poland
2
Faculty of Electrical and Computer Engineering, Cracow University of Technology, Warszawska 24 Str., 31-155 Cracow, Poland
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(16), 3399; https://doi.org/10.3390/electronics12163399
Submission received: 25 June 2023 / Revised: 5 August 2023 / Accepted: 7 August 2023 / Published: 10 August 2023
(This article belongs to the Section Circuit and Signal Processing)

Abstract

:
The exponential function a x is widespread in many fields of science. Its calculation is a complicated issue for Central Processing Units (CPUs) and Graphics Processing Units (GPUs), as well as for specialised Digital Signal Processing (DSP) processors, such as Intelligent Processor Units (IPUs), for the needs of neural networks. This article presents some simple and accurate exponential function calculation algorithms in half, single, and double precision that can be prototyped in Field-Programmable Gate Arrays (FPGAs). It should be noted that, for the approximation, the use of effective polynomials of the first degree was proposed in most cases. The characteristic feature of such algorithms is that they only contain fast ‘bithack’ operations (‘bit manipulation technique’) and Floating-Point (FP) addition, multiplication, and (if necessary) Fused Multiply-Add (FMA) operations. We published an article on algorithms for this class of function recently, but the focus was on the use of approximations of second-degree polynomials and higher, requiring two multiplications and two additions or more, which poses some complications in FPGA implementation. This article considers algorithms based on piecewise linear approximation, with one multiplication and one addition. Such algorithms of low complexity provide decent accuracy and speed, sufficient for practical applications such as accelerators for neural networks, power electronics, machine learning, computer vision, and intelligent robotic systems. These are FP-oriented algorithms; therefore, we briefly describe the characteristic parameters of such numbers.

1. Introduction

The available literature in this field addresses the need to develop newer and newer methods of calculating exponential and power functions [1,2,3,4,5,6,7,8,9,10,11]. Numerical methods are needed for scientific calculations with increased accuracy and for the implementation of neural networks in hardware. The implementation of advanced applications in FPGAs and Application Specific Integrated Circuits (ASIC) requires high algorithm speeds and a reduction of hardware resources and power consumption [2,12]. The starting points of the improvements in the algorithms described in the literature are:
-
Piecewise Linear Approximation Computation (PLAC) [2,4,9,12,13,14,15,16]
-
Look-Up Table methods (LUT) [1,2,5,6,12,16]
-
Taylor’s methods [1,6]
-
partitioning method of 32-bit format for computations into four eight-bit numbers, which are used in LUT multiplications [1]
-
CORDIC methods [4,17].
Refs. [4,13] concern the method of PLAC. The PLAC algorithm runs in two stages. The first stage is optimisation by a segmenter, to find the minimum number of segments within a software-predefined Maximum Absolute Error (MAE). The second stage is quantisation. The hardware architecture has also been improved, simplifying the indexing logic and leading to a reduction in hardware redundancy. Ref. [4] takes into account the number of segments needed in a piecewise linear approximation, depending on the interval and the desired approximation error.
The initial solution presented in [13] had more flaws, such as reusing the endpoints of recorded segments; the search order was wrong, the indexing logic in the index generator was redundant; only the MAEsoft was controlled in the segmenter; and the quantisation of the circuit was not solved before hardware implementation.
Ref. [1] concerns the approximation of the exponential function by Taylor series expansion. The ‘divide and conquer’ technique was applied to the mantissa area. The accuracy, number of operations, and LUT size were taken into consideration. Here, the mantissa range was divided into several regions, and cases from one to eight ranges were analysed. For a larger number of ranges, the number of terms in the Taylor expansion was clearly reduced while maintaining the assumed accuracy. The number of terms in the Taylor expansion is closely related to the number of mathematical operations. For example, an eight-term Taylor expansion requires ten multiplications and additions/subtractions. The expansion coefficients are stored in the LUT in advance. This requires a large amount of LUT memory. For example, storing the coefficients for four-range splitting required a four-word LUT.
Refs. [16,18] presented an FPGA implementation of single-precision FP arithmetic based on an indirect method that transforms xy into a chain of operations involving logarithm, multiplication, exponential function, and dedicated logic for the negative e base case. Speed is increased by systematically introducing pipeline steps into the data path of exponential and logarithmic units and their subunits. In [16], an innovative method of linear piecewise approximation (PWL) was proposed for the approximate representation of non-linear logarithmic and antilogarithmic functions ( l o g 2 x , 2 x ).
A slightly different solution was presented in [6], concerning the implementation of the double-precision exponential function in an FPGA. The most significant 27 bits of the input fractional part x are computed using LUT-based methods. The rest of the least significant bits are computed using the Taylor-Maclaurin expansion because the LUT approach is inefficient for the least significant bits. The relative error is approximately 2 55 .
Refs. [2,12] analysed the problem of cost-effective inference for non-linear operations in Deep Neural Networks (DNN). The focus was on the exponential function ‘exp’ in the Softmax layer of the DNN for object detection. The goal here was to minimise hardware resources and reduce power consumption without losing application accuracy.
Similarly, as in [16], the Piecewise Linear Function (PLF) was applied to approximate exp. In this case, the number of elements required to maintain detection accuracy was reduced, and then the PLF was constrained to a smaller domain in order to minimise the bit width of the segments in the LUT, resulting in lower energy and resource costs. The calculation costs of each element were stored in the LUT. Then, based on the values in the LUTs of these segments, a decision was made as to which segment was selected for the calculation of the approximate value of the affine function. DNNs were trained with Softmax and exp was replaced in the Softmax layer by PLF in the inference phase. The way in which the size of the LUT depends on the number of segments was analysed, affecting the accuracy of the application in the context of detecting objects using DNN.
Ref. [5] focused on the issue of computing execution time for exponential functions in neural networks. This function is used to compute most of the activation functions and probability distributions used in neural network models. There is a need to develop algorithms faster than those from mathematical libraries. A method of approximating the exponential function by manipulating the components of the standard (IEEE-754) FP representation is presented. The exponential function is modelled as a LUT with linear interpolation in many software packages. The basis of the innovation here is that the components of the FP representation, according to the IEEE-754 standard, can be manipulated by accessing the same memory location [5].
On average, the integer EXP macro was 6 nanoseconds faster than the integer-to-float conversion that takes place in the control program. An additional trick is applied, namely, if the input argument is an integer, the EXP macro does not perform any FP operations. The implementation of this macro depends on the byte order of the machine. The use of a global static variable is problematic in multi-threaded environments because each thread must have a private copy of the eco data structure. There is no overflow or error handling. The user must ensure that the argument is in the valid range (approximately −700 to 700). This only approximates the exponential function. Some numerical methods may amplify the approximation error; each algorithm that uses EXP should therefore be tested against the original version first.
Ref. [17] presented a fixed-point architecture. Very High-Speed Integrated Circuit Hardware Description Language (VHDL) source code is provided for power function computations. A fully customised architecture was based on the extensive CORDIC hyperbolic algorithm. Each stage utilised two-barrel shifters, a LUT, and multiplexers. The master stage required five adders, and the slave stage required three adders. The state machine controlled the iteration counter for the add/sub inputs of the adders. The customised hyperbolic CORDIC architecture allowed the user to modify the design parameters: the number of bits (B), number of fractional bits (FW), number of positive iterations (N), and number of negative iterations (M + 1).
The use of fixed-point arithmetic optimised resource usage. It was not possible to use the entire algorithm’s input domain because the numerical range is limited. For each function, 13 × 9 = 117 different hardware profiles were generated. The application of fixed-point arithmetic optimised the use of resources.
CORDIC and the digital recursive method do not meet the hard real-time constraints, taking into consideration the fact that both operations are clock-based and require many clock cycles to compute the results. Up to 824 iterations may be required for the calculation. The output register of each stage requires two additional cycles.
Traditional activation functions of convolutional neural networks face problems such as gradient decay, neuron death, and output offset. To overcome these problems, a new activation function, the Fast Exponentially Linear Unit (FELU), was proposed in [9] to accelerate exponential linear computations and reduce network uptime. FELU has the advantages of a Rectified Linear Unit (RELU) and Exponential Linear Unit (ELU). The contribution of FELU is the usage of a fast exponential function that approximates the calculation of the natural gradient in the negative part, which can speed up the calculations and reduce network uptime by speeding up the calculation of the exponent. Simple bitwise displacement and integer algebraic operations were applied to achieve a fast exponential approximation based on the IEEE-754 representation of FP calculations.
The remainder of this paper is organised as follows: Section 2 describes the Schraudolph algorithm and provides its numerical evaluation. The formulas for approximating the indicator function are also presented. Section 3 is devoted to testing the execution time of FP indicator functions on STM32 microcontrollers. Section 4 presents the implementation of the exponential functions on FPGAs. Finally, conclusions are drawn in Section 5.

2. Floating-Point Numbers

Assume the FP argument is presented as:
x = ( 1 ) S x M x 2 w x b i a s = ( 1 ) S x 1 + m x 2 w x b i a s ,
where:
  • S x —sign and S x = {0, 1}, 0—for positive numbers, 1—for negative numbers;
  • w x = E x + b i a s —shifted exponent;
  • E x —exponent calculated according to the equation:
E x = log 2 x = floor log 2 x ,
  • M x —the mantissa, calculated as:
M x = x 2 E x ,
and is within the range M x 1 , 2 . It is given in the form M x = 1 + m x , where m x 0 , 1 is the fractional part of the mantissa.

2.1. Scheme of the Algorithm for Approximating the Function a x

1.
We use a familiar approach—calculating a x by the functions 2 z . In this case we will write [5,15]:
a x = 2 z = 2 z i + z f = 2 z i · 2 z f ,
z = x · log 2 a = x ln a ln 2 = z i + z f
2.
Now we need to calculate two functions: 2 z i and 2 z f . For FP numbers, we calculate the value of variable z according to the formula:
i n t   z = x · log 2 a + b i a s · N m ,
where: bias = 15, 127, and 1023 (for half, single, and double precision formats, respectively);
  • N m   = 2 10 —for half precision,
  • N m = 2 23 —for single precision, and
  •   N m = 2 52 —for double precision.
Here the type conversion takes place: all calculations in (5) are carried out in FP (half, single, double), and the result is given as an integer. We convert float-integer types according to the bit manipulation technique.
If we need to calculate a function a x then Equation (5) should have the form
i n t   z = b i a s x · log 2 a · N m .
3.
We divide the variable int z into z i  i z f using exponential and mantissa masks.
4.
We divide the mantissa z f 0 , 1 into equal 2j parts, where j = 1, 2, 4, 8, 16, 32, ... We approximate functions 2 z f on each part by piecewise linear approximation 2 z f   b e t a i + a l p h a i z f 1 ,   i = 1 , 2 , 3 ,   2 j .
5.
Using the bitwise OR operator, we combine the exponent and the linearised mantissa back together.
6.
We convert this number from an integer to a float. It should be noted that all the algorithms described in this paper work in the full range of normalised numbers. For single precision, these are numbers from ± 1.18 · 10 38 to ± 3.39 · 10 38 .

2.2. The General Approach of N. Schraudolph, Applied to Calculating the Function a x

It is known that N. Schraudolph proposed a very simple approach to calculation error reduction of the e x function with the approximation 2 z f by the formula: 2 z f ≅ (1 +   z f ) [5,15]. This technique can also be applied to functions of the form a x . In this case, from the number int z, obtained from Equation (5), the constant c should be subtracted, which is determined by the formula: c = floor γ / ln ( 2 · N m ) . The number c is utilised to reduce the relative error. The coefficient γ is calculated according to the equation γ = ln ln 2 + 2 / exp 1 ln 2 ln ln 2 . In this case, (5) will look like this:
i n t   z = x · log 2 a + b i a s c · N m = x · log 2 a ) · N m + ( b i a s c · N m .
Next, inverse processing z (integer) → y (float) was performed. As an example, the code for calculating the function 10 x for single precision is presented below, where N m = 2 23 ,   c = 0.04367749 , log 2 10 · N m 27866353 = 0 x 01 a 934 f 1 , b i a s c · N m 1064986823 = 0 x 3 f 7 a 68 c 7 :
       float a10_x(float x)
                { int z = x * 0x01a934f1 + 0x3f7a68c7;
           float y = *(float*)&z;
      return y;
}
Note that the accuracy of the algorithm only constitutes δ m a x = 2.98 · 10 2 or log 2 δ m a x = 5.06 correct bits. Here, δ m a x is the maximum value of the relative error.
δ = y 10 x 1 .

2.3. New Piecewise Linear Approximation Algorithms

To increase accuracy, we first use the method described in the scheme above, but with one feature (as described below).
We use the following equations to calculate alpha[] and beta[] coefficients for functions 2 z f on the interval [a,b):
a l p h a i = 2 a + 2 b b + a ; b e t a i = a 2 b 2 a b b + a .
As an example, we present the code of the algorithm to calculate the following function e x , for j = 2 :
               const float alpha[2] = {0.8284271247f,
      1.171572875f};
              const int beta[2] = {0, −1439258};
         float exp_12f(float x)
   {float y;
     int j;
                  int z = x*0xb8aa3b + 0x3f800000;
             int zii = z&0x7f800000;
             int zif = z&0x007fffff;
        j = zif >> 22;
                 zif = (int)(zif*alpha[j] + beta[j]);
       zii |= zif;
           y = *(float*)&zii;
               y* = 0.9925613f; //Figure 1.
       return y;
}
A very important detail of all the codes should be noted, in that the tables of alpha[] and beta[] coefficients are filled with the same ones for any type of function a x , only the operators corresponding to the implementation of Equation (5) (marked in green) change. Note that, instead of pointers in the code, one can use the type union. A characteristic feature of this type of algorithm is the presence of the additional FP multiplication at the output of the algorithm, which gives a uniformly distributed relative error (see Figure 1). Black lines represent the exp_12f code without row y* = 0.9925613f, and red lines represent the whole exp_12f function.
The accuracy of the algorithm is only δ m a x = 7.44 · 10 3 , or 7.07 correct bits. The presence of the FP multiplication at the output is a defect, which complicates its hardware implementation. Therefore, we continue with a simplified approach without using this multiplication.
To reduce errors, we use the following technique: in the first part, we set the linear approximation coefficients so that at point 0, the approximation is linear and the function 2 z _ f has the same values (i.e., equal to 1). The maxima of the relative errors are at point b, and within the subinterval, they would be located symmetrically about zero. At the same time, the linear approximation coefficients for z_f ∈ [0,b) are calculated according to the formula:
a l p h a i = ln 2 4 a 2 1 + a + b / ( L a m b e r t W exp ln 2 a 2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b 2 a 2 1 + b 2 1 + b 2 a + 2 1 + b L a m b e r t W ( exp ( ln 2 a   2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b ln 2 b 2 a + ln 2 a 2 a ) ;
b e t a i = ( ln 2 a 2 1 + a + b L a m b e r t W exp ln 2 a 2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b 2 a 2 1 + b   4 a + 2 1 + a + b L a m b e r t W exp ln 2 a 2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b 2 a 2 1 + b ln 2 b 4 a ) / ( LambertW exp ln 2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b 2 a 2 1 + b 2 a + 2 1 + b L a m b e r t W exp ln 2 a 2 a + 2 1 + b 2 1 + b ln 2 b 2 a 2 1 + b 2 a 2 1 + b ln 2 b 2 a + ln 2 a 2 a ) ;
and on all other subintervals i = 2 , 3 , 2 j :
a l p h a i = ( 4 ln 2 e 2 b 2 b 2 a   ( 2 2 b a 2 a b + a 2 a 2 b + 2 a + b 2 b 2 b 2 a 2 2 1 + b a 2 b 2 a 2 b 2 a ) ) / ( 2   a 2 a + b 2 b 2 b 2 a e 2 a 2 b 2 a + 2   2 b a + 2 a b 2 b 2 a e 2 a 2 b 2 a   + ln 2 a e 2 b 2 b 2 a   2   a ( 2 a + 2 b ) 2 b 2 a ln 2 b e 2 b 2 b 2 a   ( 2   a ( 2 a + 2 b ) 2 b 2 a ) )
b e t a i = ( 4 ln 2 e 2 b 2 b 2 a   ( 2 2 b a 2 a b + a 2 a 2 b + 2 a + b 2 b 2 b 2 a a 2 2 1 + b a 2 b 2 a 2 b 2 a b ) ) / ( 2   a 2 a + b 2 b 2 b 2 a e 2 a 2 b 2 a   + 2   2 b a + 2 a b 2 b 2 a e 2 a 2 b 2 a     + ln 2 a e 2 b 2 b 2 a   2   a ( 2 a + 2 b ) 2 b 2 a ln 2 b e 2 b 2 b 2 a ( 2   a ( 2 a + 2 b ) 2 b 2 a ) )
An example of the code of such an algorithm, for calculating the function e x for j = 2, is given below.
              const int alpha[2] = {6714081,
     9754211};
              const int beta[2] = {0, −1491339};
          float exp_12int(float x)
{
      float y;
     int j;
                  int z = x*0xb8aa3b + 0x3f800000;
             int zii = z&0x7f800000;
             int zif = z&0x007fffff;
        j = zif >> 22;
                    zif = (int)((zif*(int64_t) alpha[j]) >> 23) + beta[j];
       zii |= zif;
                y = *(float*)&zii; //Figure 2.
       return y;
}
Here we use a linear approximation with minimal relative error in each section except the first one. In the first section, we use a linear approximation with a slightly larger relative error to stay within the value of the zi range. It should be noted that, in this code, except for the row int z = x*0xb8aa3b + 0x3f800000 (multiplication of the float type), only integer type operations are used. The accuracy of the algorithm is δ m a x = 9.92 ·10−3, or 6.65 correct bits.
The relative error is shown in Figure 2. For another number of mantissa parts, the test results of the algorithms are collected in Table 1. The code structure of such algorithms is invariant, except for the size of the coefficients: array alpha[j], beta[j] and number of right shift bits j = zif >> (23 − j), where the number of parts of the mantissa is 2 j .
We have also developed algorithms for calculating function computation with different numbers of mantissa division intervals (see Table 2). As can be seen from the comparison of Table 1 and Table 2, the maximum errors are the same, which confirms the validity of the previously formulated theory.

3. Execution Time Testing of Floating-Point Functions expf(x) and powf(2,x) on Micro Controllers of the STM32 Family

The goal of the next experiment was to compare the execution time of the novel exponential and the power of the two invented functions.
Our case study in the field of microcontrollers focused on the STM32F746, STM32H747, and STM32H750 models using the Discovery evaluation platforms. For our research, we utilized the Keil uVision software environment (version 5.3). The microcontroller was connected via the basic USB/ST-Link debugging interface.
The results are depicted in Table 2, Table 3 and Table 4, for three different microcontrollers. We also measured the execution times of the FP powf(2,x) and floating expf(x) functions from the math.h library. An interesting observation is that these times are equal to 50.0 ns and 25.4 when executed on the STM32H747 at a main clock speed of 400 MHz and measured by the same method. Thus, for every case, the execution time of our latest functions, expf(x) and powf(2,x) is much shorter than those functions from the ‘math.h’ library. The fastest power function was the FP pow2_16p_int (time 8.33 ns). The fastest sixteen parts without multiplication P1_16p_wM_int exponential function required 9.85 ns. Examples of our function’s designations are given in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9: EXP_P1_16p_wM_int exponential function, 16 parts, without Multiplication, integer approximation, EXP_P1_4pM—exponential function 4 parts, with Multiplication.
Table 3 presents the execution times of the same functions on the microcontroller STM32H750 with the main clock of 480 MHz.
The fastest power function was the float pow2_16p_int (time 8.02 ns). The fastest exponential functions performed on this microcontroller turned out to be EXP_P1_16p_wM_int (time 8.43 ns).
Table 4 collects the execution time results of the same functions on the microcontroller STM32F746 with the main clock at 216 MHz. The fastest exponent functions were the float EXP_P1_16p_wM_int and float exp_18p_wM_int (time 37.55 ns). The fastest power functions executed on this microcontroller turned out to be float pow2_P1_32p_wM_int and pow2_18int (time 38.58 ns).
From the console of the Keil uVision tool, we extracted the memory occupancy data for each microcontroller within the STM32 families, which were utilized in the experiments. Table 5, Table 6 and Table 7 compile the memory usage information for all functions presented in the paper.
In the cases of the STM32H750 and STM32H474 microcontrollers (Table 5 and Table 6), it is clearly visible that the ‘powf’ function from the ‘math.h’ library consumes at least eight times more memory than our functions. Similarly, for the ‘expf’ function, this ratio is approximately 2.6 in favor of our exponential functions.
In the case of the STM32F746 microcontroller (Table 7), it is clearly visible that the ‘powf’ function from the ‘math.h’ library consumes at least 3.6 times more memory than our functions. Conversely, the ‘expf’ function from the ‘math.h’ library occupies 50% more memory than our exponential functions.

4. FPGA Implementation of expf(x) and pow2f(x) Functions

All of the above functions were implemented on a few selected families of FPGAs. For this purpose, the Vitis automatic synthesis tool was applied. This tool generated Verilog code at the RTL level. In this way, the project was implemented by utilising ready-made IP blocks (by Xilinx) as FP operators (FP7.1), a DSP with a certain quantity of LUT-based logic, etc. The Xilinx IP FP7.1 implements basic FP operations inside our functions; as “+”, “−”, DSP blocks execute possible multiplications. The C functions presented in this paper are easily implementable on FPGAs and occupy a small amount of FPGA resources.
The FPGA 7 Series from Xilinx is optimised for low-power applications requiring serial transceivers and high DSP and logic throughput. They provide the lowest total material costs for high-throughput, cost-sensitive applications. They represent High-Performance, Low-power (HPL) (28 nm), High-K Metal Gate (HKMG) process technology. The logic is based on real 6-input LUT technology, configurable as distributed memory. DSP slices are constructed with a 25 × 18 multiplier, 48-bit accumulator, and a pre-adder for high-performance filtering, including optimised symmetric coefficient filtering.
DSP applications use many binary multipliers and accumulators, which are best implemented in dedicated DSP slices. All 7 series FPGAs have many dedicated, full custom, low-power DSP slices, combining high speed with small size, while retaining system design flexibility. Each DSP slice fundamentally consists of a dedicated 25 × 18 bit two’s complement multiplier and a 48-bit accumulator, both capable of operating up to 741 MHz. The multiplier can be dynamically bypassed and two 48-bit inputs can feed a Single-Instruction-Multiple-Data (SIMD) arithmetic unit (dual 24-bit add/subtract/accumulate or quad 12-bit add/subtract/accumulate), or a logic unit that can generate any one of ten different logic functions of the two operands. The DSP includes an additional pre-adder, typically used in symmetrical filters. This pre-adder improves performance in densely packed designs and reduces the DSP slice count by up to 50%. The DSP also includes a 48-bit-wide Pattern Detector that can be used for convergent or symmetric rounding. The pattern detector is also capable of implementing 96-bit-wide logic functions when used in conjunction with the logic unit. The DSP slice provides extensive pipelining and extension capabilities that enhance the speed and efficiency of many applications beyond DSP, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O register files. The accumulator can also be used as a synchronous up/down counter. Powerful Clock Management Tiles (CMT), combining Phase-Locked Loop (PLL) and mixed mode clock manager (MMCM) blocks for high precision and low jitter (Xilinx 7 Series Data Sheet, v2.6.1, Sept’20).
The Artix 7 family includes up to 215 k Logic Cells, 13 Mb Block Ram, and 740 DSP slices. The eore sophisticated families, Kintex 7 and Virtex 7, include the following: Kintex 7—up to 478 k Logic Cells, 34 Mb Block Ram, and 1920 DSP slices; Virtex 7—up to 1955 k Logic Cells, 68 Mb Block Ram, and 3600 DSP slices.
For the first case study, one of the widespread FPGA Artix 7 chips was chosen. The clock frequency was set to a default value 100 MHz, because most widespread FPGAs can operate with this clock. This makes it easier to compare the achievements of FPGAs from different families. The execution times and resources used for Artix 7 are presented in Table 8. The fastest exponential function turned out to be exp(x) P1 16 parts, which required 12 clock cycles to complete it (120 ns). All of the powf(2,x) were executed in the same timespan of 12 clock cycles (120 ns). Passing local or global variables to functions has no effect on the execution speed of those functions, unlike STM32 microcontrollers. The implementation of individual functions slightly differs in terms of the amount of FPGA logical resources used.
The implementation of our innovative functions in the smallest chip of the Artix family (xc7a25t-csg325-3) required between 6.25 and 7.50% of built-in DSP blocks. The first three exponential functions only used 5 DSP blocks, whereas only P1 16 parts without multiplication functions were built with 6 DSPs. All the power functions required 6 DSPs. Exponential functions occupied from 7.7% (P1 16 parts without multiplication) up to 13.3% for the P1 4 parts multiplication function. All of our functions occupied no more than 2% of available flip-flops inside the chip. All of the results obtained with the Xilinx Vitis Automated Synthesis Tool are shown in Table 8.
We also investigated the performance of the expf(x) function realised by Xilinx FP7.1. It requires 7 clock cycles i to complete it and occupies 7 DSP blocks, 943 LUTs, and 332 flip-flops. There are 8.8% of available DSPs, 6.5% of LUTs, and 1.1% of flip-flops. The powf(2,x) does not exist in the set of functions realised by the FP7.1 operator. Hence, our functions can complement the implementation of high-precision functions, which easily utilise resources of FP7.1 and DSPs.
FPGA Kintex®-7 was then tested. This family is optimised for best price-performance with a two times improvement, compared to previous generations, enabling a new class of FPGAs. All the powf(2,x) functions were performed in the same time of 8 clock cycles (80 ns). The fastest exponential function, similar to the case of Artix 7, turned out to be P1 16 parts without multiplication and it was executed in 8 clock cycles (80 ns).
Implementation of our innovative functions in the smallest chip of the Kintex family (xc7k70t-fbv676-3) required between 2.1 and 2.5% of built-in DSP blocks. The first three exponential functions collected in Table 9, used 5 DSP blocks; whereas, only P1 16 parts without multiplication function were built with 6 DSPs. All the power functions required 6 DSPs. This is the same as the Artix 7 case. Exponential functions occupied from 2.8% (P1 16 parts without multiplication) up to 4.7% (for the P1 4 parts multiplication function). Each of our functions occupied less than 7% of available flips-flops inside the FPGA. All the results obtained with the Xilinx Vitis Automated Synthesis Tool are shown in Table 9 below.
The achievements and required FPGA resources for the function expf(x) implemented by the FP 7.1 operator have also been checked. It requires 5 clock cycles to complete, uses 7 DSP blocks, 912 LUTs and 201 flip-flops. It requires 8.8% of available DSPs, 2.2% LUTs, and 0.25% of flip-flops.
The Virtex®-7 Family is optimised for the highest system performance and capacity with a two times improvement in system performance. The highest capability devices are enabled by stacked silicon interconnect (SSI) technology. The xqvu37p-fsqh2892-2-e chip chosen by us is characterised by rich resources and top performances in the Virtex family.
The fastest exponential function, as in the cases of Artix 7 and Kintex 7, revealed P1 16 parts without multiplication and it was completed in 6 clock cycles (60 ns). All of the powf(2,x) functions were executed in the same time of 6 clock cycles (60 ns). Implementation of our innovative functions in this chip only required between 0.05 and 0.08% of built in DSP blocks. The first three exponential functions used 5 DSP blocks, and only P1 16 parts without multiplication function was built with 7 DSPs. All the power functions required 7 DSPs. Exponential functions occupied from 0.09% (P1 16 parts without multiplication) up to 0.14% (for the P1 2 parts and P1 4 parts multiplication functions). All our functions only occupied about 0.02% of available flip-flops inside the chip. All the results are given in Table 10. The achievements and required FPGA resources for function expf(x) implemented by the FP7.1 operator for the VirtexUltraPlus are as follows. It requires 3 clock cycles to complete, uses 7 DSP blocks, 760 LUTs and 144 flip-flops. It requires 8.8% of available DSPs and insignificant numbers of LUTs (0.0058%), and flip-flops (0.0052%).
Figure 3 below presents a screenshot excerpt from the Xilinx Vitis tool, showing the report of the ‘pow2_16p_int’ function implementation on the Virtex Ultra Plus FPGA. The latency and required FPGA resources are visible in this figure.
Figure 4 reveals a decomposition example of pow2_16p_int function into individual instructions and their execution time.
Figures for all functions presented in the tables are available on GitHub within projects under the ‘scheduler’ tab for each individual function.
Finally, we implemented our innovative algorithms in one of Xilinx’s most technologically advanced families—Versal.
Versal® devices are the industry’s first adaptive computer acceleration platform (ACAP), combining adaptable processing and acceleration engines with programmable logic and configurable connectivity to enable customised, heterogeneous hardware solutions for a wide variety of applications in Data Center, Automotive, 5G Wireless, Wired, and Defense. Versal ACAPs feature transformational features like an integrated silicon host interconnect shell and Intelligent Engines (AI and DSP), Adaptable Engines, and Scalar Engines, providing superior performance/watts over conventional FPGAs, CPUs, and GPUs. Versal ACAPs are built around an integrated shell, composed of a programmable Network on a Chip (NoC), which enables seamless memory-mapped access to the full height and width of the device. ACAPs comprise a multicore scalar processing system, an integrated block for PCIe® with DMA and Cache Coherent Interconnect Designs (CPM), SIMD VLIW AI Engine accelerators (for artificial intelligence and complex signal processing), and Adaptable Engines in the Programmable Logic (PL).
The fastest exponential functions, unlike the case in all previous experiments, turned out to be functions occupying the first three positions in Table 11 These functions were only executed in 2 clock cycles (20 ns). The execution of P1 16 parts without multiplication function needs 4 clock cycles (20 ns).
All of the powf(2,x) functions were executed in the same time of 4 clock cycles (40 ns). Implementation of our innovative functions in this chip only required between 1 and 2 built-in DSP blocks. This chip possesses up to 1968 such blocks. The first exponential function in Table 11 only used 1 DSP block. An exclusively P1 16 parts without multiplication function was built with 7 DSPs. All remaining exponential and powf(2,x) functions required 2 DSP blocks.
Exponential functions occupied from 0.08% as an example function P1 16 parts without multiplication and all the powf(2,x) functions up to 0.17% of available in the chip LUTs for P1 4 parts multiplication function. The functions in the first two positions in the table occupied 0.16% of LUTs. The first three exponential functions only occupied about 0.004% of available flips-flops inside the chip. The function P1 16 parts without multiplication and all powf(2,x) functions only occupied 0.007% of available flip-flops in the chip. All of the results achieved using the Versal chip are presented in Table 11.
For comparison, the achievements and FPGA resources required for function expf(x), implemented by the FP 7.1 operator inside the Versal chip, were checked the Vitis implementation reports. The expf(x) function requires 2 clock cycles to complete and uses 1 DSP block, 930 LUTs and 99 flip-flops. There are insignificant numbers of LUTs (0.001%) and flip-flops (0.0055%).
Our research reveals that the functions proposed by us are well-suited for implementation in the newest FPGA families. They occupy a negligible amount of resources and effectively use the solutions and resources available in these chips. With reference to the available literature in the field, we implemented our pow2f(x) functions in the same FPGA Zynq7000 chip under the same timing condition (clock frequency 125 MHz) as in [17]. The time performances of our functions observed in Table 12 are better than those presented in [17] in every case, see Table 13.
All C codes and FPGA projects are available on: https://github.com/pawelgepner/Power_Function/tree/main (accessed on 6 August 2023).

5. Conclusions

In this article, algorithms for the exponential and power function calculation for FP numbers (half, single, and double precision) were developed. Our contribution to the field can be formulated as follows:
  • To increase the accuracy, we proposed the modification of the Schraudolph method, applied to calculating the function a x ;
  • Our novel algorithms are based on piecewise linear approximation, float to integer and inverse type conversion, and bit manipulations technics;
  • To achieve the speedup of algorithms and the ease of hardware implementation, we were focused and influenced by their architecture, based on one multiplication and one addition.
Such algorithms, with low complexity, provide decent accuracy and speed, sufficient for practical applications as accelerators for neural networks, power electronics, machine learning, computer vision, and intelligent robotic systems.
Our innovative algorithms are executed in a very short time on microcontrollers of the STM32H7 family. This family has extensive DSP blocks. Moreover, they are easily executable on older families of microcontrollers (STM32F).
Our research reveals that the functions proposed by us are well suited for implementation in the newest FPGA families too. They occupy a negligible amount of resources and effectively use the solutions and resources available in these chips.

Author Contributions

Conceptualisation, L.M.; Formal analysis, V.S.; Investigation, P.G. and M.W.; Methodology, P.G. and G.N.; Visualisation, L.M.; Software, M.W.; Writing—original draft, L.M. and M.W.; Writing—review and editing, V.S. and G.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Faculty of Electrical and Computer Engineering, Cracow University of Technology and the Ministry of Science and Higher Education, Republic of Poland (grant no. E-1/2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wei, J.; Kuwana, A.; Kobayashi, H.; Kubo, K. Divide and Conquer: Floating-Point Exponential Calculation Based on Taylor-Series Expansion. In Proceedings of the IEEE 14th International Conference on ASIC (ASICON), Kunming, China, 26–29 October 2021. [Google Scholar]
  2. Eissa, S.; Stuijk, S.; Corporaal, H. Hardware Approximation of Exponential Decay for Spiking Neural Networks. In Proceedings of the IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021. [Google Scholar]
  3. Geng, X.; Lin, J.; Zhao, B.; Wang, Z.; Aly, M.M.S.; Chandrasekhar, V. Hardware-Aware Exponential Approximation for Deep Neural Networks. In Proceedings of the 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
  4. Schraudolph, N.N. A Fast, Compact Approximation of the Exponential Function. Neural Comput. 1999, 11, 853–862. [Google Scholar] [CrossRef] [PubMed]
  5. Moroz, L.; Samotyy, V.; Kokosiński, Z.; Gepner, P. Simple Multiple Precision Algorithms for Exponential Functions [Tips & Tricks]. IEEE Signal Process. Mag. 2022, 39, 130–137. [Google Scholar]
  6. Jamro, E.; Wiatr, K.; Wielgosz, M. FPGA Implementation of 64-Bit Exponential Function for HPC. In Proceedings of the International Conference on Field Programmable Logic and Applications, Amsterdam, The Netherlands, 27–29 August 2007. [Google Scholar]
  7. Perini, F.; Reitza, R.D. Fast approximations of exponential and logarithm functions combined with efficient storage/retrieval for combustion kinetics calculations. Combust. Flame 2018, 194, 37–51. [Google Scholar] [CrossRef]
  8. Malossi, A.C.I.; Ineichen, Y.; Bekas, C.; Curioni, A. Fast exponential computation on SIMD architectures. In Proceedings of the Conference: HiPEAC 2015—1st Workshop on Approximate Computing (WAPCO), Amsterdam, The Netherlands, 19–21 January 2015. [Google Scholar]
  9. Qiumei, Z.; Dan, T.; Fenghua, W. Improved Convolutional Neural Network Based on Fast Exponentially Linear Unit Activation Function. IEEE Access 2019, 7, 151359–151367. [Google Scholar] [CrossRef]
  10. Pineiro, J.-A.; Ercegovac, M.D.; Bruguera, J.D. Algorithm and architecture for logarithm, exponential, and powering computation. IEEE Trans. Comput. 2004, 53, 1085–1096. [Google Scholar] [CrossRef]
  11. De Dinechin, F.; Pasca, B. Floating-point exponential functions for DSP-enabled FPGAs. In Proceedings of the IEEE International Conference on Field-Programmable Technology, Beijing, China, 8–10 December 2010. [Google Scholar]
  12. Geng, X.; Lin, J.; Zhao, B.; Kong, A.; Aly, M.M.S.; Chandrasekhar, V. Hardware-Aware Softmax Approximation for Deep Neural Networks. In Proceedings of the 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
  13. Dong, H.; Wang, M.; Luo, Y.; Zheng, M.; An, M.; Ha, Y.; Pan, H. PLAC: Piecewise Linear Approximation Computation for All Nonlinear Unary Functions. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2014–2027. [Google Scholar] [CrossRef]
  14. Frenzen, C.L.; Sasao, T.; Butler, J.T. On the Number of Segments Needed in a Piecewise Linear Approximation. J. Comput. Appl. Math. 2010, 234, 437–446. [Google Scholar] [CrossRef] [Green Version]
  15. Nandagopal, R.; Rajashree, V.; Madhav, R. Accelerated Piece-Wise-Linear Implementation of Floating-Point Power Function. In Proceedings of the 29th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 24–26 October 2022. [Google Scholar]
  16. Nico, W. Efficient hardware implementation of power-line transfer functions using FPGA’s for the purpose of channel emulation. In Proceedings of the IEEE International Symposium on Power Line Communications and Its Applications, Udine, Italy, 3–6 April 2011. [Google Scholar]
  17. Simmonds, N.; Mack, J.; Bellestri, S.; Llamocca, D. CORDIC-based Architecture for Powering Computation in Fixed Point Arithmetic. arXiv 2016, arXiv:1605.03229. [Google Scholar]
  18. Echeverría, P.; López-Vallejo, M. An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic. Available online: https://oa.upm.es/4339/1/INVE_MEM_2008_59918.pdf (accessed on 6 August 2023).
Figure 1. Relative error for code exp_12f ( y = δ ·103).
Figure 1. Relative error for code exp_12f ( y = δ ·103).
Electronics 12 03399 g001
Figure 2. Relative error for code exp_12int ( y = δ ·103).
Figure 2. Relative error for code exp_12int ( y = δ ·103).
Electronics 12 03399 g002
Figure 3. Latency and FPGA resources of function pow2_16p_int.
Figure 3. Latency and FPGA resources of function pow2_16p_int.
Electronics 12 03399 g003
Figure 4. Decomposition of pow2_16p_int function into individual instructions.
Figure 4. Decomposition of pow2_16p_int function into individual instructions.
Electronics 12 03399 g004
Table 1. Maximal relative error.
Table 1. Maximal relative error.
jMax Relative ErrorCorrect Bits
29.92·10−36.65
42.53·10−38.62
86.43·10−410.65
161.65·10−412.56
324.02·10−514.60
Table 2. Time of execution of expf(x) and powf(2,x) functions on STM32H747.
Table 2. Time of execution of expf(x) and powf(2,x) functions on STM32H747.
FunctionT [ns]Processor Cycles No.
EXP_P1_2p11.595
EXP_P1_2pM12.875
EXP_P1_4pM12.875
EXP_P1_16p_wM_int9.854
pow2_P1_8p_wM_int8.604
pow2_16p_int8.334
pow2_P1_32p_wM_int8.644
powf(2,x) from math.h50.00-
expf(x) from math.h25.40-
Table 3. Time of execution expf(x) and powf(2,x) functions on STM32H750.
Table 3. Time of execution expf(x) and powf(2,x) functions on STM32H750.
FunctionT [ns]Processor Cycles No.
EXP_P1_2p12.766
EXP_P1_2pM9.635
EXP_P1_4pM9.635
EXP_P1_16p_wM_int8.434
pow2_P1_8p_wM_int8.434
pow2_16p_int8.024
pow2_P1_32p_wM_int8.644
powf(2,x) from math.h43.8021
expf(x) from math.h22.6511
Table 4. Time of execution expf(x) and powf(2,x) functions on STM32F746.
Table 4. Time of execution expf(x) and powf(2,x) functions on STM32F746.
FunctionT [ns]Processor Cycles No.
EXP_P1_2p39.87155
EXP_P1_2pM43.72170
EXP_P1_4pM44.24172
EXP_P1_16p_wM_int37.55146
pow2_P1_8p_wM_int38.58150
pow2_16p_int42.18164
pow2_P1_32p_wM_int38.58150
powf(2,x) from math.h12.993
expf(x) from math.h28.256
Table 5. Memory usage in STM32H750 by expf(x) and powf(2,x) functions.
Table 5. Memory usage in STM32H750 by expf(x) and powf(2,x) functions.
FunctionFunction Code Size [No. Words]
EXP_P1_2p136
EXP_P1_2pM144
EXP_P1_4pM160
EXP_P1_16p_wM_int112
pow2_P1_8p_wM_int170
pow2_16p_int234
pow2_P1_32p_wM_int116
powf(2,x) from math.h1852
expf(x) from math.h624
Table 6. Memory usage in STM32H747 by expf(x) and powf(2,x) functions.
Table 6. Memory usage in STM32H747 by expf(x) and powf(2,x) functions.
FunctionFunction Code Size [No. Words]
EXP_P1_2p140
EXP_P1_2pM148
EXP_P1_4pM164
EXP_P1_16p_wM_int236
pow2_P1_8p_wM_int170
pow2_16p_int234
pow2_P1_32p_wM_int114
powf(2,x) from math.h1848
expf(x) from math.h624
Table 7. Memory usage in STM32F746 by expf(x) and powf(2,x) functions.
Table 7. Memory usage in STM32F746 by expf(x) and powf(2,x) functions.
FunctionFunction Code Size [No. Words]
EXP_P1_2p414
EXP_P1_2pM424
EXP_P1_4pM424
EXP_P1_16p_wM_int390
pow2_P1_8p_wM_int386
pow2_16p_int506
pow2_P1_32p_wM_int388
powf(2,x) from math.h1848
expf(x) from math.h624
Table 8. Results of the implementation in Artix 7 FPGA.
Table 8. Results of the implementation in Artix 7 FPGA.
FunctionT [ns]No. CyclesNo. DSP% DSPNo. LUT% LUTNo. FFs% FFs
EXP_P1_2p2302356.25183512.505521.9
EXP_P1_2pM2702756.25185712.705882.0
EXP_P1_4pM2702756.25194513.305902.0
EXP_P1_16p_wM_int1201267.5011267.705822.0
pow2_P1_8p_wM_int1201267.5011207.675822.0
pow2_16p_int1201267.5011267.705822.0
pow2_P1_32p_wM_int1201267.5011387.805822.0
Table 9. Results of the implementation in Kintex 7 FPGA.
Table 9. Results of the implementation in Kintex 7 FPGA.
FunctionT [ns]No. CyclesNo. DSP% DSPNo. LUT% LUTNo. FFs% FFs
EXP_P1_2p1801852.118374.55026.1
EXP_P1_2pM2102152.118584.55376.5
EXP_P1_4pM2102152.119464.75426.6
EXP_P1_16p_wM_int80862.511352.85066.2
pow2_P1_8p_wM_int80862.511292.85066.2
pow2_16p_int80862.511352.85066.2
pow2_P1_32p_wM_int80862.511474.75066.2
Table 10. Results of the implementation in Virtexuplus FPGA.
Table 10. Results of the implementation in Virtexuplus FPGA.
FunctionT [ns]No. CyclesNo. DSP% DSPNo. LUT% LUTNo. FFs% FFs
EXP_P1_2p1201250.0517530.134570.02
EXP_P1_2pM1301350.0517650.144600.02
EXP_P1_4pM1301350.0518530.144620.02
EXP_P1_16p_wM_int60670.0811410.094710.02
pow2_P1_8p_wM_int60670.0811410.094710.02
pow2_16p_int60670.0811410.094710.02
pow2_P1_32p_wM_int60670.0811410.094710.02
Table 11. Results of the implementation in Versal FPGA.
Table 11. Results of the implementation in Versal FPGA.
FunctionT [ns]No. CyclesNo. DSP% DSPNo. LUT% LUTNo. FFs% FFs
EXP_P1_2p20210.0514720.16670.004
EXP_P1_2pM20220.1014720.16670.004
EXP_P1_4pM20220.1015350.17690.004
EXP_P1_16p_wM_int40420.107360.081230.007
pow2_P1_8p_wM_int40420.107360.081230.007
pow2_16p_int40420.107360.081230.007
pow2_P1_32p_wM_int40420.107360.081230.007
Table 12. Achievements of pow2f(x) functions on Zynq7000.
Table 12. Achievements of pow2f(x) functions on Zynq7000.
FunctionT [ns]No. CyclesNo. DSPNo. LUTNo. FFs
pow2_P1_8p_wM_int1041361512607
pow2_16p_int1041361518607
pow2_P1_32p_wM_int1041361530607
Table 13. Execution time(ns) for ex/ln(x), xy on Zynq7000.
Table 13. Execution time(ns) for ex/ln(x), xy on Zynq7000.
FunctionN (Number of Positive Iterations), M = 5
812162024323640
e x / ln x 136168208240272336368408
x y 280344424488552680744824
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Moroz, L.; Samotyy, V.; Gepner, P.; Węgrzyn, M.; Nowakowski, G. Power Function Algorithms Implemented in Microcontrollers and FPGAs. Electronics 2023, 12, 3399. https://doi.org/10.3390/electronics12163399

AMA Style

Moroz L, Samotyy V, Gepner P, Węgrzyn M, Nowakowski G. Power Function Algorithms Implemented in Microcontrollers and FPGAs. Electronics. 2023; 12(16):3399. https://doi.org/10.3390/electronics12163399

Chicago/Turabian Style

Moroz, Leonid, Volodymyr Samotyy, Paweł Gepner, Mariusz Węgrzyn, and Grzegorz Nowakowski. 2023. "Power Function Algorithms Implemented in Microcontrollers and FPGAs" Electronics 12, no. 16: 3399. https://doi.org/10.3390/electronics12163399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop