Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants

: Many low-cost platforms that support floating-point arithmetic, such as microcontrollers and field-programmable gate arrays, do not include fast hardware or software methods for calculating the square root and/or reciprocal square root. Typically, such functions are implemented using direct lookup tables or polynomial approximations, with a subsequent application of the Newton–Raphson method. Other, more complex solutions include high-radix digit-recurrence and bipartite or multipartite table-based methods. In contrast, this article proposes a simple modification of the fast inverse square root method that has high accuracy and relatively low latency. Algorithms are given in C/C++ for single- and double-precision numbers in the IEEE 754 format for both square root and reciprocal square root functions. These are based on the switching of magic constants in the initial approximation, depending on the input interval of the normalized float-ing-point numbers, in order to minimize the maximum relative error on each subinterval after the first iteration—giving 13 correct bits of the result. Our experimental results show that the proposed algorithms provide a fairly good trade-off between accuracy and latency after two iterations for numbers of type float, and after three iterations for numbers of type double when using fused multiply–add instructions—giving almost complete accuracy.


Introduction
The square root ( x ) and reciprocal of the square root ( x / 1 ), also known as the inverse square root, are two relatively common functions used in digital signal processing [1][2][3][4], and are often found in many computer graphics, multimedia, and scientific applications [1]. In particular, they are important for data analysis and processing, solving systems of linear equations, computer vision, and object detection tasks. In view of this, most current processors allow the use of the appropriate software (SW) functions and multimedia hardware (HW) instructions in both single (SP) and double (DP) precision.
Let us formalize the problem of calculating the function in floating-point (FP) arithmetic. We consider an input argument x to be a normalized n -bit value with p -bit mantissa, which satisfies the condition 0 ≥ x . Similarly, for the func- . For many practical applications, where it is assumed that the input data are obtained with some error, e.g., read from sensors, or where computa-tion speed is preferable to accuracy, e.g., in 3D graphics and real-time computer vision, approximate square root calculation may be sufficient. However, on the other side, there is a strong discussion whether the x function should be correctly-rounded for all input FP numbers according to the IEEE 754 standard.
Compilers offer a built-in ) ( sqrt x function for various types of FP data (float, double, and long double) [5]. Although the common SW implementation of this function provides high accuracy, it is very slow. On the other hand, HW instruction sets are specific to particular processors or microprocessors. Computers and microcontrollers can use HW floating-point units (FPUs) in order to work effectively with FP numbers and for the fast calculation of some standard mathematical functions [6,7]. Typically, instructions are available for a square root with SP or DP (float and double), and estimation instructions are available for the reciprocal square root with various accuracies (usually 12 bits in SP, although there are also intrinsics that provide 8, 14, 23, or 28 bits). For example, for SP numbers, Intel SSE HW instruction RSQRTSS has an accuracy of 11.42 bits, Intel AVX instructions VRSQRT14SS and VRSQRT28SS provide 14 and 23.4 correct bits), respectively, and ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result. An overview of the basic characteristics of these instructions on different modern machines can be found in [8][9][10][11]. However, similar instructions are not available for many low-cost platforms, such as microcontrollers and field-programmable gate arrays (FPGAs) [12,13]. For such devices, we require simple and efficient SW/HW methods of approximating square root functions.
Usually, the HW-implemented operation of calculating the square root in FP arithmetic requires 3-10 times more processor cycles than multiplication and addition [9]. Therefore, for such applications as video games, complex matrix operations, and real-time control systems, these functions may become a bottleneck that hampers application performance. According to the analysis of all CPU performance bottlenecks (the second table in [14]), the square root calculation function is high on the list and first among the FP operations. As a result, a large number of competing implementations of such algorithms, which make various trade-offs in terms of the accuracy and speed of the approximation, have been developed.
Digit-recurrence, table-based, and bit-manipulation methods have the advantage of using only simple and fast operations, such as addition/subtraction, bit shift, and table lookup. Compared to polynomial and iterative methods, they are therefore more suitable for implementation on devices without HW FP multiplication support (e.g., calculators, some microcontrollers, and FPGAs). However, since digit-recurrence algorithms have linear convergence, they are very slow in terms of SW implementation. Tabular methods are very fast but require a large area (memory), since the size of the table grows exponentially with an increase in the precision required. This may pose a problem not only for small devices such as microcontrollers, but also to a lesser extent for FPGAs. The use of direct lookup tables (LUTs) is less practical than combining several smaller LUTs with addition and multiplication operations. Bit-manipulation techniques are fast in both HW and SW, but have very limited accuracy-about 5-6 bits [15]. They are based on the peculiarities of the in-memory binary representation formats of integer and FP numbers.
In computer systems with fast HW multiplication instructions, iterative and polynomial methods can be efficient. Iterative approaches such as Newton-Raphson (NR) and Goldschmidt's method have quadratic convergence but require a good initial approximation-in general, polynomial, table-based, or bit-manipulation techniques are used. Polynomial methods of high order rely heavily on multiplications and need to store polynomial coefficients in memory; they also require a range reduction step. This makes them less suitable for the calculation of x and x / 1 than iterative methods, especially for HW implementation. Compared to the polynomial method, a rational approximation is not efficient if there is no fast FP division operation; however, for some elementary functions, it can give more accurate results, e.g., the ) ( tan x and x functions. Most existing research studies on calculating the x / 1 and x functions have focused on the HW implementation in FPGAs. They use LUTs or polynomial approximation, and if more accurate results are required, iterative methods are subsequently applied. In this paper, we consider a modification of a bit-manipulation technique called the fast inverse square root (FISR) method for the approximate calculation of these functions with high accuracy, without using large LUTs or divisions. This work proposes a novel approach that combines the FISR method and a modified NR method and uses two different magic constants for an initial approximation depending on the input subinterval. On each of the two subintervals, such values of the magic constants should be chosen that minimize the maximum relative error after the first iteration. This can be considered a fairly accurate and fast initial approximation for other iterative methods such as NR or modified NR. This method can be effectively implemented on microcontrollers and FPGAs that support FP calculations in HW. However, in this paper, we focus mostly on the fast SW implementation of the method, in particular on microcontrollers. In a HW implementation, a kind of tiny 1-bit LUT can be used to determine a magic constant and two other parameters of the basic algorithm.
Among the better-known methods of calculating the square root and reciprocal square root [1,2,15,20], the FISR method [20,[25][26][27][28][29] has recently gained increasing popularity in SW [8,20,27,[29][30][31][32] and HW [3,[33][34][35][36][37] applications. The algorithm was proposed for the first time in [24] but gained wider popularity through its use in the computer game Quake III Arena [27]. Its attraction lies in its very simple and rapid way of obtaining a fairly accurate initial approximation of the function x y / 1 0 ≈ -almost 5 bits-without using multiplications or a LUT. It uses integer subtraction and bit shifting, and combines these with switching between two different ways of interpreting the binary data: as an FP or an integer number. In addition, fewer hardware resources are used when this is implemented with an FPGA.
We denote by ) ( ι x an integer that has the same binary representation as an FP number x and by ) ( φ i an FP number that has the same binary representation as an integer i . The main idea behind FISR is as follows. If the FP number x , given in the IEEE 754 standard, is represented as an integer ) ( ι x , then it can be considered a coarse, biased, and scaled approximation of the binary logarithm ) ( log 2 x . This integer is divided by two, its sign is changed, and it is then translated into the IEEE 754 format as an FP number 0 y with the same binary representation The method introduces a magic constant R to take into account the bias and reduce the approximation error. This gives an initial approximation for the function , which is then further refined with the help of NR iterations.
The NR method is the most commonly used iterative method; it is characterized by a quadratic convergence rate and has the property of self-correcting errors [19]. Quadratic convergence in an iterative method means that the method roughly doubles the number of exact bits in the result after each iteration. If we apply the NR method directly to find the square root, we obtain the formula known as Heron's formula. The disadvantage of this approach is the need to perform an FP division operation at each iteration, which is rather complex and has high latency [1,9,18]. An alternative way of calculating the square root is to use the NR method to find the root of the equation This formula has the form: (1) Using this approach, it is necessary to multiply the final result of the iteration method in Equation (1) by x to get the approximate square root of the input number x . The main feature of iterative methods is the need to select an initial value 0 y ; as a rule, the better this initial approximation, the lower the number of subsequent iterations needed to obtain the required accuracy for the calculations. However, we will show later that this is not always the case. The purpose of this paper is to present a modified FISR method based on the switching of magic constants. This method is characterized by increased accuracy of calculations-13.71 correct bits after the first iteration-with low overhead compared to the known FISR-based approximation algorithms described below in Section 2. We also provide the code for the corresponding optimized algorithms for calculating the square root and reciprocal square root, which use this method for the initial approximation. These algorithms work for normalized single-and double-precision numbers in the IEEE 754 format and provide different accuracy depending on the number of iterations used.
Functions that comply with the IEEE 754 standard-assuming that the chosen rounding mode is round-to-nearest-return the FP value that is closest to the exact result (the error is less than 0.5 units in the last place, or ulp). By contrast, the proposed method allows a numerical error of the algorithms for the float and double types to be obtained that does not exceed the least significant bit (1 ulp) when using the fused multiply-add function.
The rest of the paper is organized as follows: Section 2 introduces the well-known FISR-based algorithms and basic theoretical concepts of the FISR method. In Section 3, the main idea of the proposed method of switching magic constants is presented, and the corresponding algorithms are given. Section 4 contains the experimental results on microcontrollers and a discussion. Finally, the conclusions are presented in Section 5.

State-of-the-Art FISR Algorithms
FISR is most commonly used in its classic version. In this case, the initial approximation 0 y is calculated using magic constants "0x5f3759df" [27] or "0x5f375a86" [25] and one or two clarifying NR iterations are performed using the standard equation (1) [8,20,25,27,28,[33][34][35][36]. As Lomont noted, the best initial guess "0x5f37642f" [25] does not guarantee maximum accuracy for the subsequent NR iterations. The theoretical analysis of Algorithm 1 in [28] mathematically confirmed the optimal values of the magic constants "0x5f37642f" and "0x5f375a86". The code for this classic FISR algorithm in C/C++ is given below in Algorithm 1. The maximum relative error of this algorithm with two NR iterations is less than  float y = *(float*)&i; // represent integer as a float ) (i φ -// initial approximation 0 y 7: y = y*(1.5f -xhalf *y*y); // first NR iteration 8: y = y*(1.5f -xhalf *y*y); // second NR iteration 9: return y; 10: } In [29], an algorithm with increased accuracy and a maximum relative error of (20.37 correct bits) was proposed. In this approach, a simple additive modification of the iterative NR formula is used, and the magic constant of the algorithm is changed, making it possible to reduce the maximum relative error by a factor of more than seven. Algorithm 2 gives the code for this algorithm. The accuracy of the RcpSqrt2 algorithm after the first iteration is 10.15 bits. Another similar algorithm from [29] has even better accuracy (20.97 bits) but requires one extra multiplication compared to Algorithms 1 and 2. int i = *(int*)&x; 5: i = 0x5f376908 -(i >> 1); 6: float y = *(float*)&i; 7: y = y*(1.50087896f -xhalf *y*y); 8: y = y*(1.50000057f -xhalf *y*y); 9: return y; 10: } Our other recent modifications of the FISR algorithm are discussed in [26] and [32]. In both papers, besides the NR method, a modified second-order Householder method is used to improve the accuracy of the algorithms-in the former case, in the last iteration of the algorithm, and in the latter case, in the first iteration. This method has cubic convergence and requires one additional multiplication and subtraction compared to the NR method. As a result, these algorithms can already provide accuracy up to the last bit (more than 23 bits in SP) when using fused multiply-add functions. However, such extra FP operations are very expensive, especially in HW. Therefore, in this paper, we investigate an alternative method to further increase the accuracy of the FISR-based algorithms by using only modified constants and NR iterations-splitting the input interval. Moreover, in [26], we suggested a method for reducing the number of FP multiplications in the algorithm. It allows performing some operations with an exponent using integer subtractions.

Brief Theory of the FISR Method
Here, we briefly outline the main concepts of the basic FISR method used in the paper, which are defined in [28,29]. Suppose that we have a positive normalized FP number (2) We consider numbers of single (binary32/type float) and double (binary64/type double) precision, according to the IEEE 754 standard. In this standard, an SP FP number x is encoded by 32 bits, for DP). The first bit corresponds to a sign (in our case, this bit is equal to zero), while the next eight bits (11 bits for DP) correspond to an exponent which is an integer stored in a biased form. The last 23 bits, for DP), encode a fractional part of the mantissa . The integer representation of this value ) ( ι x (see Algorithm 1, line 4), denoted by x I , is given by Lines 7 and 8 of Algorithm 1 (RcpSqrt1) define the NR iterations [28,29], in order to find the behavior of the relative error when calculating 0 y over the whole range of normalized FP numbers, it is sufficient to describe it in the range . The initial approximation 0 y has three piecewise linear subintervals in this range. According to [28], the analytical approximations that define 0 y can be written as: Here, R m is the fractional part of the mantissa of the magic constant R , defined as: and (13) As proved in [28], the relative error of such analytic model of the FISR method does not exceed

Method of Switching Magic Constants
The main idea of the proposed method is to split the interval of the initial approximation 0 y (see Figure 1a), on which it has different error values, into two parts- -and to perform an approximation of the reciprocal square root function separately in these subintervals. The variable x , as defined in Equation (2), then has different values in the last bit of the exponent x E -zero in the first case and one in the second. We split the interval only for the initial approximation 0 y -where the magic constant R is used-and for the corresponding modified first iteration of the FISR method. As shown below, this technique allows us to reduce the maximum relative error of the algorithm after the first iteration by an order of magnitude compared to Algorithm 2 (RcpSqrt2). Let us now consider this method in more detail. We require that, on the first and second parts of the interval , the relative errors of two adjacent piecewise linear initial approximations 01 y and 02 y have the same scope-the same difference between the maximum and minimum values-as shown in Figure 2a. From this, it follows that the relative errors of the approximations 01 y , have a similar symmetrical nature with respect to some common value at a point where, in this case, (16) Note also that the third linear approximation, which is not used in the result, has the form Hence, the expressions for the relative errors are Having found the points of maxima and contact, we can determine the value of R m : (20) and the corresponding value of the magic constant R for SP numbers: . Here, we ignore the relative errors on other intervals.
Let us now examine the interval . In the same way, we require that, for the second and third parts, the relative errors of two adjacent piecewise linear initial approximations 02 y and 03 y have the same scope (see Figure 2b). In this case, Note that, at the same time, Hence, we find that 6875 4249877929 As a result, the combined approach with magic constants 1 R , (21), and 2 R , (27), on two different subintervals gives us an opportunity to obtain the piecewise linear initial approximation 0 y on the interval , as shown in Figure 1b. This can potentially offer a better approximation of the function x y / 1 = , but only after some alignment (bias at each subinterval). For comparison, both of the magic constants used in the algorithms RcpSqrt1 and RcpSqrt2 give the initial approximation, as described in Equations (9)-(13) (see Figure 1a). Although the basic FISR method gives a much more accurate approximation at this stage, our method has four piecewise linear sections of the approximation 0 y rather than three, providing higher accuracy after the first iteration (see Figure 3). This trick is possible due to the modification of the first NR iteration at each of the subintervals. In other words, we align the corresponding errors of the initial approximation as described below. For each subinterval, we modify the first NR iteration according to Equation (1) as follows: where 1 k and 2 k are some FP constants that minimize the maximum relative error of the algorithm and depend on the value of the magic constant R . This modified iteration also involves four multiplications. In order to determine the subinterval in the IEEE 754 format to which x belongs- (29) In the SW implementation, we apply a bit mask to x . Then, we choose the magic constant and the first modified NR iteration that correspond to this value. We call this technique the method of switching magic constants or the dynamic constants (DC) method.
It should be noted that this method can be generalized to more constants. In this case, each of the indicated subintervals is further divided into two, four, eight, etc., equal parts depending on the value of one or more most significant bits of the fractional part of the mantissa x m , and the bit mask is changed accordingly. However, in general, it cannot be guaranteed that such a partition will significantly improve the accuracy of the algorithm in all the parts; therefore, some parts should be divided further.
The general structure of the proposed SW RcpSqrt3 algorithms for two magic constants, with the modified FISR initial approximation 0 y , the first modified iteration (28), and a branching statement (comparison with zero) using bit masking (bitwise AND), is shown in Template 1. Here, the input argument x has type <fp_type>, which can be float, double, etc., and <int_type> is the corresponding integer type. The names of the algorithms for the reciprocal square root are constructed according to the template, as follows: RcpSqrt3<iter><version><fp_type_abbr>, where <iter> is the number of iterations used in the algorithm, <version> is an optional index, and <fp_type_abbr> indicates the required FP data type. This also applies to the square root calculation algorithms, which we denote as Sqrt3. The proposed method can be thought of as a relatively accurate and fast initial guess 1 y -the DC initial approximation-for other iterative algorithms (see Template 1, lines [14][15][16]. In this paper, we consider modified NR iterations written in a special form, with combined multiply-add operations. Furthermore, in this section, we present final (ready-to-use) codes for the proposed algorithms in C/C++ and give their errors (accuracy results).
Note that-when determining the relative errors of these algorithms-giving here an example for the reciprocal square root function-we used the following notation for the upper and lower limits of the maximum relative error, respectively: Alternatively, without taking into account the sign of the error, the maximum relative error was determined using the formula To iterate over all possible float values in the interval , we used the (33) Error measurements for the algorithms were performed on a quad-core Intel Core i7-7700HQ processor using a GNU compiler (GCC 4.9.2) for C++ on a Windows 10 (64-bit) operating system with options as follows: -std = c++11 -Os -ffp-contract = on -mfma.  (37) On some platforms, when implemented in HW, this function can increase both the speed and the accuracy of the algorithms. The fma operation has fewer roundings and much higher precision for the internal calculations. In the remainder of this section, unless otherwise specified, we use the fma function in all algorithms.
Taking into account the rounding errors and the issue of the best practical representation of the theoretical parameters in the target SP FP format, we can improve our theoretical parameters 1 R , 2 R , 11 k , 12 k , 21 k , and 22 k (given in (21), (27), (34), (35)). A brute force optimization method was used in a certain neighborhood of the defined theoretical parameters to minimize the maximum relative error of the algorithm on each of the subintervals. This approach also includes elements of randomized multidimensional greedy optimization for coarse search. In this case, three parameters R , 1 k , and 2 k are optimized simultaneously. The method is described in more detail in [38]. Algorithm 3 below gives the proposed improved RcpSqrt31f algorithm for SP numbers with one iteration. This algorithm provides slightly lower values for the maximum relative errors:  13.71 correct bits. Consequently, the error is reduced by a factor of more than 11.7.

Two Iterations
To increase the accuracy of the RcpSqrt31f algorithm described above, it is possible to apply an additional NR iteration over the entire range . The second iteration is common to both subintervals, uses the fma function, and has the specific form where, in this case, for the SP version, which corresponds to the classical Equation (1). The use of the fma function in the second iteration in the form given in (39) is important, since it greatly improves the accuracy of the algorithm at the final stage of the calculations. However, compared to the RcpSqrt1 and RcpSqrt2 algorithms, it has four multiplications rather than three in the second iteration (a further addition is also hidden inside fma ). If we write the second iteration in the same way as in Equation (37), we only get 22.68 bits of accuracy (

SP Square Root (Sqrt3 for Float)
We now turn to the square root calculation algorithms to find an approximation for x y = in SP. As noted in Section 1, these algorithms can easily be obtained from those previously described, simply by multiplying the result by the value of the input argument x . However, this involves an additional multiplication operation, and in our algorithms, in most cases, this can be avoided by modifying the last iteration.

One Iteration-The DC Initial Approximation
For one iteration, we make a substitution

Two Iterations
When we use two NR iterations to calculate the square root function (Sqrt32f), the structure of the algorithm does not change compared to RcpSqrt32f, and we need only modify the second iteration (see Algorithm 6, line 6). This algorithm has slightly lower accuracy than RcpSqrt32f, with float r = fmaf(y, -c, 1.0f); 6: y = fmaf(0.5f*c, r, c); // modified 7: return y; 8: }

One Iteration-The DC Initial Approximation
Similarly, this method can be applied to FP numbers of DP. In this case, the theoretically determined magic constants based on (20) and (26)

Two Iterations
The RcpSqrt32d algorithm for finding the DP reciprocal square root with two iterations is shown in Algorithm 8. Note that, here, we use the same constants for the DC initial approximation as in the RcpSqrt31d algorithm. This algorithm has a second iteration in the form of (39), with changes in the following two constants: double c = x*y; 5: double r = fma(y, -c, 1.000000008298416); 6: y = fma(0.50000000057372*y, r, y); 7: return y; 8: }

Three Iterations
For three iterations in DP, we present two versions of the algorithm: one with fewer multiplication operations (RcpSqrt331d) and one with higher accuracy (RcpSqrt332d). The complete RcpSqrt331d algorithm is given in Algorithm 9. The errors of this algorithm have the following boundaries: double r = fma(mxhalf, y*y, 0.5); 7: y = fma(y, r, y); 8: return y; 9: } On the other hand, if we do not make this substitution and write the last iteration using 2 2 xy c = , we obtain an algorithm RcpSqrt332d that contains one more multiplication and an additional coefficient in the last iteration. In this case, the last two iterations of the algorithm are (see Algorithm 10, lines 4-7) ) , , ( fma where 00724769 0.50000000 . (56) Compared to RcpSqrt331d, the RcpSqrt332d algorithm has lower maximum relative errors after the third iteration,

One and Two Iterations
In the same way as for the type float and reciprocal square root, we construct algorithms for the square root in DP using one and two iterations. These are based on the RcpSqrt31d and RcpSqrt32d algorithms. The errors of these algorithms are close to those of the reciprocal square root (see (47) and (49)).

Three Iterations
For the algorithm with three iterations, we cannot avoid the additional multiplication, as we did in the RcpSqrt331d algorithm described above. Hence, we present only an algorithm that has four multiplications in the third iteration-including fma operations. It is based on RcpSqrt332d, in which the last iteration (55) is modified for the square root calculation, and the corresponding parameters are optimized, as shown in Algorithm 11. After the third iteration, this algorithm has errors of

Experimental Results and Discussion
Performance testing of these algorithms was conducted on a Raspberry Pi 3 Model B mini-computer and an ESP-WROOM-32 microcontroller. The Raspberry Pi is based on a quad-core 64-bit SoC Broadcom BCM2837 (1.2 GHz, 1 Gb RAM) with an ARM Cortex-A53 processor [6]. We used the GNU compiler (GCC 6.3.0) for Raspbian OS (32-bit) with the following compilation options: -std = c++11 -Os -ffp-contract = on -mfpu = neon-fp-armv8 -mcpu = cortex-a53. The 32-bit Wi-Fi module ESP-WROOM-32 (ESP-32) from Espressif Systems has two low-power Xtensa microprocessors (240 MHz, 520 Kb RAM) [39]. The microcontroller was programmed via the Arduino IDE (GCC 5.2.0) with the following compilation parameters: -std = gnu++11 -Os -ffp-contract = fast. The speed (latency) of the algorithms was measured using the chrono C++ library. Depending on the platform, at least 200 tests were run in which functions were called sequentially in a single thread (core), a million or more times. The average results of these performance tests are given here.
It should be noted that, although we chose C++ to implement the algorithms, it is worth using an inline assembly code for more efficient and better performance optimization on each specific platform. However, the chosen compilation options gave a fairly effective fast code optimization and allowed us to automatically translate the ) , , ( fmaf c b a and ) , , ( fma c b a C++ functions into the corresponding HW instructions; for the microcontroller, the compiler may even automatically replace successive multiplication and addition/subtraction operations of SP with the corresponding fma HW instructions (the -ffp-contract = fast option, which enables FP expression contraction).
The accuracy and latency measurements for the reciprocal square root ( and square root ( x y = ) functions, in both SP and DP, are summarized in Table 1. In this table, we consider the various methods available on the mini-computer and microcontroller, including the cmath SW library functions ( ) ( sqrtf x and ) ( sqrt x ) [5] and the built-in NEON instructions (FRSQRTE and FRSQRTS) [11] for an approximate calculation of the reciprocal square root in Raspberry Pi using NR iterations. We also compare the method of Walczyk et al. (the RcpSqrt2 algorithm) [29], its modification for calculation of the square root, and the proposed method of switching magic constants (the DC algorithms from Section 3). Here, both the Walczyk et al. and the DC algorithms are implemented using the fma function.
Even on platforms that do not have special HW instructions for the square root and reciprocal square root (either approximate or with full accuracy), such as ESP-32, the C++ function ) ( sqrt x is available for both SP and DP numbers. Modern platforms, such as Intel or ARM, may also have the appropriate hardware FSQRT instructions. These are IEEE-compliant and ensure the full accuracy of the result (see the ) ( sqrtf x and ) ( sqrt x functions in Table 1).
Looking at the results from Table 1, it becomes obvious that the main feature of our proposed DC algorithms for the float and double types is that the algorithms RcpSqrt32f, Sqrt32f, RcpSqrt331d, RcpSqrt332d, and Sqrt33d allow the result to be obtained up to the last bit-although the 24th and 53rd bits may be wrong. At the same time, the RcpSqrt32f and RcpSqrt332d algorithms for the reciprocal square root have somewhat higher accuracy than the naive method using division.
All the platforms tested have HW-implemented multiplication, addition, and fma operations for FP numbers of SP and, except for ESP-32, DP [10][11][12]. Since the ESP microcontroller is a 32-bit system, all DP operations are performed by SW. Note that it also does not have a division instruction in SP [12], meaning that the latency of the corresponding operations is much higher.
As shown in Table 1, the proposed algorithms give significantly better performance than the library functions on the Raspberry Pi, from 3.17 to 3.62 times faster, and for SP numbers on ESP-32, 2.34 times faster for the reciprocal square root and approximately 1.78 times faster than the ) ( sqrtf x function. At the same time, on the microcontroller with the SW implementation of DP fma , the RcpSqrt331d algorithm is a little faster than the naive method using the ) ( sqrt x function, but has slightly lower accuracy. The Rcp-Sqrt332d algorithm, in contrast, has higher accuracy, but worse performance. The proposed algorithms are also not efficient on ESP-32 for calculating the square root of DP numbers (in contrast to the Raspberry Pi). However, in some cases, it is possible to improve the performance of the algorithms in DP if fewer fma functions are used in the code, as shown later (see also [26] for more details).
In ARM with NEON technology [11], the HW SP instructions FRSQRTE and FRSQRTS, for which the corresponding intrinsics are 32 vrsqrte_f and 32 vrsqrts_f , can be used to calculate a fast approximation of the reciprocal square root function and to perform the classical NR iteration (step), respectively. The FRSQRTE instruction is based on a LUT and gives 8.25 correct bits of the result (our DC initial approximation gives 13.71 bits). A combination of these instructions in the Raspberry Pi gives poorer accuracy and latency results than the RcpSqrt32f algorithm (see Table 1). It should also be noted that the HW FSQRT and 64-bit FRSQRTE instructions of the ARMv8 (AArch64) architecture are not available on the Raspberry Pi for the specified official 32-bit OS [11].
In order to ensure a fair comparison between the proposed algorithms and other advanced FISR-based methods, we also implemented an algorithm proposed by Walczyk et al. [29] in a specific form using the fma functions. This allowed us to strike a better compromise between accuracy and speed compared to the original RcpSqrt2 algorithm (see Table 1). For the square root calculation, we used the same method that we suggest for the DC algorithms. The results show that, although these algorithms are faster, their accuracy is much lower. A comparison of the FISR-based algorithms that also provide 23 exact bits for a float and 52 exact bits for a double is given in Table 2 for SP numbers. Note that, here, TMC denotes the method of two magic constants [26] and Ho2 denotes the approach based on the second-order Householder's method [32] for the reciprocal square root. The relative performance of the proposed algorithms depends on the operations used, their sequence, and the characteristics of the platform. For example, the RcpSqrt32f (Algorithm 4) and InvSqrt5 [32] (Section VI) algorithms are fairly efficient on ESP-32 for SP numbers. However, to the best of our knowledge, the DC initial approximation is the only FISR-based method with one NR iteration-four multiplications and one subtraction-that provides 13 correct bits of the result for the square root and reciprocal square root functions. It can be implemented on FPGAs using a small LUT, and only one bit of the input argument (LSB of the exponent) needs to be controlled.  Figure 4 shows the results of a comparison of the Lomont [8,25], Walczyk et al. [29], and switching magic constants (DC) methods after each iteration for DP numbers-with and without the use of the fma operations. Here, we consider different ways of implementing NR iterations using the fma functions. As shown by the graphs, the accuracy is almost the same in both cases, with the sole exception of three iterations. The proposed DC algorithm (RcpSqrt331d), in most cases, has slower performance on the platforms considered here than the Lomont and Walczyk et al. algorithms (except perhaps one iteration), but is significantly superior in terms of accuracy. It allows us to obtain highly accurate results for the square root and reciprocal square root calculations for DP numbers by the third iteration. Note that, when the fma function is used, we obtain 52.28 correct bits of the result (see RcpSqrt331d in Table 1), and, otherwise, we have an accuracy of 51.52 bits (see Figure 4a). For the Raspberry Pi, the latency of the algorithms with and without fma is similar, but is slightly smaller for some algorithms when using the fma function (after the first and second iterations). We obtain similar performance results for the ESP-32 microcontroller. The latency of all the algorithms is almost the same for one iteration. However, given the above comments on HW support for FP operations of DP, it should be noted that the algorithms that do not use fma are faster in this case. Figure 4 shows that the speed of the DC algorithm for three iterations is 6092.1 ns without fma and 6828.2 ns with SW fma functions. For ESP-32, we recommend using the DP fma function only in the third iteration of the DC algorithms, in order to obtain a better compromise between accuracy and speed.  It should also be noted that the disadvantages of the FISR methods and the approximate FRSQRTE instructions in comparison with the cmath library functions and fast HW FSQRT instructions are that they generally do not work correctly with subnormal numbers and do not handle other exceptional situations (e.g., 0 ± and Inf ± ), although this does not apply to numbers in the NaN (not a number) range. However, as described in [29], FISR-based methods can be modified to support subnormal numbers.

Conclusions
This article proposes a set of algorithms for calculating the square root and reciprocal square root of normalized FP numbers of SP and DP, using the method of switching magic constants for the initial approximation. The proposed DC initial approximation provides about 13.71 correct bits of the result; compared to the RcpSqrt2 algorithm for one iteration, the maximum relative error is reduced by a factor of 11.7. The main feature of this method is the modification of a magic constant and subsequent NR iteration, depending on the input subinterval. It uses fast HW fma instructions, and allows us to obtain results with fairly good accuracy after two iterations for numbers of type float and 52.27 bits for the x function). To achieve correct rounding, you must additionally apply a rounding-error adjustment step, e.g., using the method described in [4] for the square root. As a result, the proposed method reduces the number of iterations required without using large LUTs. It has a low overhead compared to the baseline FISR, which is widely used in many scientific and commercial applications [8,30,34,35,37], and provides a better compromise between latency and accuracy than other known algorithms that use a magic constant, particularly those of Lomont [25] and Walczyk et al. [29]. It should be noted that the proposed DC algorithms can be extended to other data formats, such as extended, quadruple, and octuple formats [26,29].
The algorithms described here can be most useful on microcontrollers and other computer platforms that support FP computations but do not have HW-implemented FPUs or fast HW instructions available for the square root or reciprocal square root calculation, such as ESP-WROOM-32 [12,39] or Raspberry Pi [11]. As it was shown for these platforms, the proposed approximation algorithms in certain cases give a performance gain of 1.5-3.5 times compared to the library functions. They can also be straightforwardly implemented on modern FPGA platforms such as Intel Cyclone [13] and Intel Stratix [40], which have SP FP blocks for add, multiply, and fma operations.