Fast Calculation of Cube and Inverse Cube Roots Using a Magic Constant and Its Implementation on Microcontrollers

: We develop a bit manipulation technique for single precision ﬂoating point numbers which leads to new algorithms for fast computation of the cube root and inverse cube root. It uses the modiﬁed iterative Newton–Raphson method (the ﬁrst order of convergence) and Householder method (the second order of convergence) to increase the accuracy of the results. The proposed algorithms demonstrate high efﬁciency and reduce error several times in the ﬁrst iteration in comparison with known algorithms. After two iterations 22.84 correct bits were obtained for single precision. Experimental tests showed that our novel algorithm is faster and more accurate than library functions for microcontrollers.


Introduction
The operation of extracting a cube and inverse cube root is not as common as adding, subtracting, multiplying, dividing [1,2], or taking a square root [1,3,4], but is basic in many applications [1,3,[5][6][7][8][9][10][11][12][13]. A large number of hardware units and software libraries support this operation, including math.h, the Short Vector Math Library (SVML) [14], and Integrated Performance Primitives (Intel ® IPP) [15]. In most modern central processing units (CPUs), the computation of the cube root is performed using the function of the fully IEEE-compliant cbrt (24 bits of mantissa). However, this has a drawback-it requires a significant number of CPU clocks for implementation [16]. In addition, the disadvantage of this approach is its attachment to specific hardware platforms. In practice, there are different methods for calculation of the cube root [1,11,17]. Here, the exponent and mantissa of the input argument are processed separately after they are extracted by frexp(). The exponent obtains the required value after integer division by three. The mantissa is reduced to the range (0.125, 1), after which the initial approximation of the mantissa of the cube root is obtained with the help of a quadratic polynomial. Then, the obtained exponent value and the initial mantissa approximation are combined with ldexp() into a floating point number. This serves as the initial approximation of the cube root function for the whole range of floating-point numbers. This approximation is then refined using two Newton-Raphson iterations to give full accuracy (24 bits for float numbers). The weakness of this approach is the use of very slow frexp() and ldexp() operations. Moreover, Newton-Raphson iterations with division operations (which are algorithmically much more complex than multiplication) are used here, or another variant is fractional-rational approximations of the cube root function. Most modern low-cost microcontrollers that support floating-point computing only use slow (but accurate) C math.h library functions such as cbrt and 1.0f/cbrt, see [18]. In this article, to increase the speed of computing the inverse cube root, we propose algorithms using the bit manipulation technique [19]. These are based on the fast inverse cube root of the method [3,5,6,13] in modified versions; these are quite accurate and relatively fast. In this method, we first obtain the initial approximation of the cube root for the whole number (and not separately for the mantissa and the exponent) through integer operations, which is refined using Newton-Raphson or Householder iterations. The aim of the work is to theoretically substantiate the operation of algorithms to calculate the inverse cube root using the magic constant for floating-point numbers in the single precision format of the IEEE-754 standard. Selecting such a constant provides the minimum possible relative errors for the entire range of float values. Thus, we derive optimal offsets for additive correction of Newton-Raphson formulas (order 1) which reduce the relative errors. The values of the coefficients of the Householder formula (order 2) are also optimized, which makes it possible to increase accuracy with only a slight complication of the algorithms.
In Section 4 we present results of experimental test using several processors and microcontrollers. Comparing our novel algorithm with the library function for microcontrollers, we can see that our algorithm is two times faster and two times more accurate.

Review of Known Algorithms
To our knowledge, the idea of calculating the inverse cube root using a magic constant for floating point numbers was presented for the first time by Professor W. Kahan [5]. In order to compute the inverse cube root, it was represented in the following form: If the argument x is given as a floating-point number, it can be read as an integer I x , corresponding to the binary representation of the number x in the IEEE-754 standard. Then, after dividing by 3, the sign is changed to the opposite and a fixed integer (known as a magic constant) is added. The result can be read as a floating-point number. It turns out that this represents an initial (albeit inaccurate) approximation, which can be refined using iterative formulas. Such an algorithm for calculating the cube and inverse cube root, based on the use of the magic constant, is presented in [6].  z.fx = x; 10: z.ix = 0x54a21d2a − z.ix/3; //initial guess for inverse cube root 11: float y = z.fx; //convert integer type back to floating-point type 12: y = y * (fourthirds − thirdx * y*y*y); //1st Newton's iteration 13: y = y * (fourthirds − thirdx * y*y*y); //2nd Newton's iteration 14: return y; 15: } The above algorithm can be briefly described as follows. The input argument x is a floating point number of type float (line 1). Next, the content of the z.fx buffer can be read as an integer z.ix (line 7). Subtracting the integer I x /3 from the magic constant R = 0x54a21d2a results in obtaining the integer I y 0 (z.ix, line 7). Line 7 contains a division by 3 which can be replaced by multiplying the numerator by the binary expansion 0.010101010101 . . . of 1/3 in fixed point arithmetic. For 32 bit numerators, we multiply by the hexadecimal constant 55555556 and shift the (64 bit) result down by 32 binary places. Therefore the line 7 of the code can be written in the form [6]: z.ix = 0x54a21d2a − (int) ((z.ix * (int64_t) 0x55555556) >> 32); The contents of the z.ix buffer can be read as a float number. This is the initial approximation y 0 of the function y = 1/ 3 √ x (line 8). Then, the value of the function can be specified by the classical Newton-Raphson formula: (lines 9 and 10). In the general case: The relative error of the calculation results after the two Newtonian iterations is approximately |δ 2max | = 1.09·10 −5 , or − log(|δ 2max |) = 16.47 correct bits, see [6]. Then, according to the author of [6], this algorithm works 20 times faster than the library function powf (x, −1.0f/3.0f) of the C language.
The disadvantage of the Algorithm 1 is that it has a relatively low accuracy (only 16.47 correct bits). The lack of an analytical description of the transformation processes that take place in this algorithm hindered a progress in increasing the accuracy of calculations of both the inverse cube root and the cube root. This article eliminates this shortcoming. Herein is a brief analytical description of the transformations that occur in algorithms of this type. Ways to increase accuracy of calculations are specified and the corresponding working codes of algorithms are presented.

The Theory of the Fast Inverse Cube Root Method and Proposed Algorithms
The main challenge in the theory of the fast inverse cube root method is obtaining an analytical description of initial approximations. We acquire these approximations through integer operations on floating-point argument x converted to an integer I x . A magic constant is used, which inversely converts the final result of these operations into floating-point form (see, for example, lines 7, 8 of the Algorithm 1). To do this, we can use the basics of the theory of the fast inverse square root method, described in [4].
To correctly apply this theory, it is sufficient to replace the integer division from I x /2 with I x /3 and expand the range of values of the input argument from x ∈ [1, 4) to x ∈ [1,8). It is known that the behaviour of the relative error in calculating y 0 in the whole range of normalized numbers with a floating point can be described by its behaviour for x ∈ [0.125, 1) or x ∈ [1, 8) [6,17]. According to [4], relating to the inverse cube root, in this range there are four piecewise linear approximations of the function y 0 = {y 01 , y 02 , y 03 , y 04 } : where t = 4 + 12m R + 6N m . Here, m R is the fractional part of the mantissa of the magic constant R, and m R = N −1 m R− N −1 m R and m R > 0. The maximum relative error of such analytical approximations does not exceed 1/(2N m ) [4,20]. To verify their correctness, we compare the values of y 0 obtained in line 8 of the Algorithm 1 with the values obtained using Equations (4)- (7) in the corresponding ranges. The maximum relative error of the results does not exceed 5.96·10 −8 for any normalized numbers of type float for x ∈ [1,8).
To further improve the accuracy of the Algorithm 1, we can apply the additive correction of the parameter k 2 for each iteration, as proposed in [10,20,21]. In other words, the first iteration should be performed according to the modified Newton-Raphson formulas where the product y 0 y 0 y 0 is used instead of power of three and we assume k 1 = 1/3. The optimal values of k 2 and t (which give minimum relative errors of both signs) have to be found. In order to find the optimal values we use the algorithm proposed in [10]. We write the expressions of the relative errors of the first iteration in the range x ∈ [1, 8) for four piecewise linear approximations y 0 = {y 01 , y 02 , y 03 , y 04 } : where c i (x) = xy 0i y 0i y 0i , i = 1, 2, 3, 4) and y 01 , y 02 , y 03 , y 04 are given by Equations (4)-(7). Next, we find the values of x, at which functions δ 1ij have maximum values (both positive x + ij and negative x − ij ) To do this, we have to solve four equations dδ 1i dx = 0 with respect to x. Their solutions (x + ij and x − ij ) have two indices and the first index denotes the interval to which the solution belongs: 8). The second index numerates solutions within the corresponding interval (in this paper we consider cases with at most j = 1). One has to remember that boundary points of the intervals also can be taken into account as possible positive or negative maxima. Then, applying the alignment algorithm proposed in [10], we form two linear combinations of selected positive and negative maximum values δ + 1ij , δ − 1ij , δ + b and δ − b (including suitable boundary values) and equate them to zero. It is important that the number of positive and negative items in each equation is the same. Finally, we solve this system of two equations obtaining new values of t, k 2 which yield a more accurate algorithm. In order to increase further the accuracy one can repeat this procedure, starting with redefined values of t, k 2 . Now, we start from the code Algorithm 1, i.e., we set the following values of zero approximations of parameters t, k 1 , k 2 : t = 7.19856286048889160156250, corresponding to the magic constant R = 0x54a21d2a (see [2]), k 1 = 0.33333333333333, k 2 = 1.334.
We selected the following five maximum points. For the first interval [1,2): x + 11 = 1.2988420843950769435381686797159343, for the second interval [2,4): x − 21 = 2.899820357561 1114501953125000000057, for the third interval [4,t): x + 31 = 5.950383012140948644294759109 8675369, and two boundary points: x = 4 and x = t. At these points, we form five expressions for the relative errors according to formulas (9). Next, we form two equations from different combinations for positive and negative errors (the number of positive and negative errors for each equation should be the same) [10]. We have not provided analytical expressions for the equations because they are quite cumbersome. The best option found that gives the lowest maxima of positive and negative errors is as follows: t = 7.2086406427967456 (corresponding to the magic constant R = 0x54a239b9) and k 2 = 1.3345156576807351.
Next, we make one more iteration of this procedure. The second approximation gives the following results: t = 7.2005756147691136 (corresponding to the magic constant R = 0x54a223b4) and k 2 = 1.3345157539616962.
Similarly, analytical equations for two modified iterations of Newton-Raphson can be written and the corresponding values of the coefficients of the formula for the second iteration can be found.
The best results are given by the Algorithm 3 with the released (not equal to 1/3) value of k 1 . The five points of maxima on the interval x ∈ [1, 8) are also controlled here, but they are located differently than for Algorithm 2 (Figure 2), where error = y·10 −4 .
The algorithm for finding unknown parameters t, k 1 , k 2 : is similar to the one described above. At the maximum points of relative errors, we form five expressions of these errors according to Equation (8). Then, we form not two but three equations (to find three unknown parameters t, k 1 , k 2 :) from different combinations for positive and negative errors (as before, the number of positive and negative errors for each equation must be the same). Solving the resulting system from these equations, we get the first approximation of the parameters. Then, we use the found values as initial approximations and we repeat the iteration once again, etc. As an example, we set the initial approximations of the parameters t, k 1 , k 2 : t = 5.0013587474822998046875 (corresponding to the magic constant R = 0x548aae60), k 1 = 0.55, k 2 = 1.512. We select the following five maximum points (see Figure 2). For the first interval [1, 2): x + 11 = 1.0643720067216318545117069225172418, Then, the code of the proposed algorithm was as follows:  float y = *(float*) &i; 5: y = y*(1.5015480449f − 0.534850249f*x*y*y*y); 6: y = y*(1.333333985f − 0.33333333f*x*y*y*y);; 7: return y; 8: } The graph of the relative error for the first iteration of this code is shown in Figure 3, where error = y·10 −4 . As can be seen, at the selected five points, the values of the relative errors are equal to each other modulo. Note that the maximum relative errors of the Algorithm 3 for the first iteration are: Maximum relative errors for the second iteration are as follows: |δ 2max | = 8.0803·10 −7 or − log 2 (|δ 2max |) = 20.23 correct bits, δ + 2max = 7.698·10 −7 , δ − 2max = −8.0803·10 −7 . However, more accurate results are given by algorithms using formulas of higher orders of convergence-for example, the Householder method of order 2, which has a cubic convergence [1,10]. For the inverse cube root, the classic Householder method of order 2 has the following form: or: Here is a simplified Algorithm 4 with the classical Householder method of order 2 and with a magic constant from Algorithm 1:  float c = x*y*y*y; 9: y = y*(k1 − c*(k2 − k3*c)); 10: c = 1.0f − x*y*y*y;//fmaf 11: y = y*(1.0f + 0.3333333333f*c ;//fmaf 12: return y; 13: } Here and in the subsequent codes the comments "fmaf" mean that the fused multiplyadd function can be applied reducing the calculation error. We point out that this function is implemented in many modern processors in the form of fast hardware instructions.
The maximum relative errors for the first and second iterations of this algorithm are: However, the Algorithm 5 with the modified Householder method of order 2 has a much higher accuracy. It consists of choosing the optimal values of the coefficients k 1 , k 2 , k 3 of the iterative formula and gives much smaller errors. For example, the first iteration (as in the Algorithm 4) has the form: and contains six multiplication operations and two subtraction operations. The algorithm for finding unknown parameters t, k 1 , k 2 , k 3 is similar to the above. First, we set the zero approximations of the parameters: t = 5.0680007152557373046875 (corresponding to the magic constant R = 0x548b645a), k 1 = 1.75183, k 2 = 1.25, k 3 = 0.509. In this case, the graph of relative error for the first iteration has the form shown in Figure 4, where error = y·10 −5 . Then, a system of four equations is constructed for nine points of maxima on the interval x ∈ [1, 8) (see Figure 4) and the values of the parameters t, k 1 , k 2 , k 3 are refined (R =0x548b645a, k 1 = 1.75183, k 2 = 1.25, k 3 = 0.509). After two approximations, we obtain the following values: t = 5.1408540319298113766739657997215219 (corresponding to the magic constant R = 0x548c2b4a), k 1 = 1.7523196763699390234751750023038468, k 2 = 1.2509524245066599988510127816507199, k 3 = 0.50938182920440939104272244099570921. The algorithm code is as follows: float k3 = 0.5093818292f; 5: int i = *(int*) &x; 6: i = 0x548c2b4b − i/3; 7: float y = *(float*) &i; 8: float c = x*y*y*y; 9: y = y*(k1 − c*(k2 − k3*c)); 10: c = 1.0f − x*y*y*y;//fmaf 11: y = y*(1.0f + 0.333333333333f*c);//fmaf 12: return y; 13: } Figure 5 shows compatible graphs of relative errors for Algorithm 4 (red) and Algorithm 5 (blue) algorithms after the first iteration, where error = y·10 −5 . From analysis of Figure 5, it follows that in the Algorithm 5 there was a significant reduction in errors in the first iteration (7 times). This results in second iteration improvement (1.5 times). Note that the maximum relative errors of the Algorithm 5 for the first iteration are: |δ 1max | = 2.686·10 −5 or − log 2 (|δ 1max |) = 15.18 correct bits. Maximum relative errors for the second iteration are: |δ 2max | = 1.3301·10 −7 or − log 2 (|δ 2max |) = 22.84 correct bits [1]. In order to calculate the cube root (Algorithm 6), the last lines of the algorithm codes related to the second iteration (for example, Algorithm 5, as well as any of the above algorithms).

Evaluation of Speed and Accuracy for Different Types of Microcontrollers
We researched the relative error of the aforementioned algorithms using a STM32F767ZIT6 microcontroller (manufacturer: STMicroelectronics) [22]. We determined the maximum negative and positive errors of two iterations for different algorithms (Table 1). These errors were calculated according to the formula InvCbrtK(x)*pow((double)x,1./3.)-1. for all numbers floating between [1,8], where K ∈ {1, 10, 11, 20, 21}. The maximum error was determined by the formula δ max = max δ + max , δ − max . Analysing the results of Table 1, we can see that Algorithm 5 gives the most accurate results in both iterations. In the last column, we present the results of errors which give the standard library function 1.f/cbrtf(). According to the results (Table 1) our Algorithm 5 (1.3301·10 −7 ) has more than twice less error than the error of the library function (2.7947·10 −7 ). Table 1. The maximum negative and positive errors of two iterations for different algorithms.

Iteration I Iteration II
Algorithm In addition, the operating time of the algorithm was researched. Testing was performed on four 32-bit microcontrollers with FPU.
We used two types of optimization for the -Os (smallest) and -O3 (fastest) microcontroller compilers. In all cases, the best optimization result was for -O3; therefore, we do not cite the optimization results of -Os. Table 2 shows average function calculation times of InvCbrt(x) for Iteration I, which corresponds to 10,000,000 tests. Analysis of the results shows that the shortest calculation time is for the Algorithm 3 for all types of microcontrollers. Similar results are shown in Table 3 for Iteration II. In this table, we have added one more column with the results of the library function 1.f/cbrtf(). For the first three microcontrollers, the gain of the Algorithm 5 function is more than twice that of the library function and nine times better in terms of work gain for the microcontroller ESP32-D0WDQ6. If we compare microcontrollers, we can see that the STM32F767ZIT6 is the fastest. Table 4 shows the results of algorithm performance on different microcontrollers for Iteration I, which is defined as the product of time calculating the inverse cube root and the tact frequency of the microcontroller (cycles).  Table 5 shows the results for Iteration II. In this table, we also have added one more column for library function 1.f/cbtf(). According to this table, the productivity of the function Algorithm 5 is twice better than the library function. Regarding microcontroller ESP32-D0WDQ6, the function Algorithm 5 works nine times better than the library function.

Conclusions
This paper presents results of a systematic analytical approach to the problem of increasing the accuracy of the inverse cube root code (including the optimal choice of magic constants). These provide minimum values of relative error for initial approximations according to the Newton-Raphson formula when calculating the function for float numbers. The optimal values of displacements for additive correction of Newton-Raphson formulas are also determined. These make it possible to reduce the relative errors of calculations. Thus, comparing the algorithms Algorithm 1 and Algorithm 2, the gain on the first iteration is approximately 1.97 times, and six times on the second iteration. The values of the coefficients of the Householder formula (order 2) were also optimized, making it possible to increase the calculation accuracy seven times (after the first iteration) without complicating the algorithm. In summary, we can conclude that the STM32F767ZIT6 microcontroller with -O3 optimization showed the best performance, and the Algorithm 3 was the fastest. Comparing the Algorithm 5 with the library function for microcontrollers, we can see that our algorithm is not only twice as accurate, but it can also calculate the function more than twice as fast (1/ 3 √ x). These algorithms can also be implemented on an Intel Cyclone 10 GX (C10GX51001) FPGA [24].
In addition, the proposed algorithm can also be used to calculate the cubic root and can be implemented on Intel FPGA devices as it done in [25].