Fast and Accurate Approximation Methods for Trigonometric and Arctangent Calculations for Low-Performance Computers

: In modern computers, complicated signal processing is highly optimized with the use of compilers and high-speed processing using ﬂoating-point units (FPUs); therefore, programmers have little opportunity to care about each process. However, a highly accurate approximation can be processed in a small number of computation cycles, which may be useful when embedded in a ﬁeld-programmable gate array (FPGA) or micro controller unit (MCU), or when performing many large-scale operations on a graphics processing unit (GPU). It is necessary to devise algorithms to obtain the desired calculated values without an accelerator or compiler assistance. The residual correction method (RCM) developed here can produce simple and accurate approximations of certain nonlinear functions with minimal multiply–add operations. In this study, we designed an algorithm for the approximate computation of trigonometric and inverse trigonometric functions, which are nonlinear elementary functions, to achieve their fast and accurate computation. A fast ﬁrst approximation and a more accurate second approximation of each function were created using RCM with a less than 0.001 error using multiply–add operations only. This achievement is particularly useful for MCUs, which have a low power consumption but limited computational power, and the proposed approximations are candidate algorithms that can be used to stabilize the attitude control of robots and drones, which require real-time processing.


Introduction
The approximation of elementary functions is a critical issue in science and engineering, including computer graphics, scientific computing, and signal processing [1][2][3][4]. Central processing units (CPUs) have evolved to have computation instructions for efficient processing, but electronic circuits with limited resources, such as field programmable gate arrays (FPGAs) and micro controller units (MCUs), may not have accelerators such as a floating-point unit (FPU) or digital signal processor (DSP). Therefore, techniques to avoid calculation with a large number of division and non-linear calculation cycles are important for large-scale calculations and embedded systems [5]. Division and square root are frequently used, even though they are up to ten times slower than addition and multiplication, and this trend has not changed in the latest architectures [6,7].
In particular, MCUs have a low computational power with an instruction set limited to basic arithmetic and logic operations, but they are small in size, have a low power consumption, and are used in a wide variety of products, including smartphones, audio accessories, video game peripherals, and advanced medical equipment [8][9][10][11][12][13]. Furthermore, a small size and low power consumption features are required in areas such as IoT devices [14,15] and drone control [16][17][18][19].
Even in the rapidly developing field of machine learning, fast approximate computation is critical to solve problems such as short learning times and the real-time processing of inferences [20][21][22][23]. GPUs can process a large number of multiply-add operations in parallel, and by replacing nonlinear calculations used in machine learning with approximate calculations using multiply-add operations, it may be possible to achieve higher speeds and smaller models.
When the compiler optimization performance was low, implementation techniques to avoid division were very useful in the field of embedded electronic circuits [24]. Even today, many MCUs do not have an extended instruction set, so such techniques are very important. For example, replacing division by a product of reciprocal constants and utilizing bit shifting for multiplying and dividing powers of two are well-known techniques [25]. Computational tricks using the floating-point structure of the IEEE754 are effective, and useful techniques are known for obtaining fast approximations of exponential and logarithmic functions [21,26] and inverse square roots (reciprocal sqrt) [25,[27][28][29][30][31][32]. In particular, the latter is a well-known algorithm called the fast inverse square root (FISR), which is a useful technique that can be used to find the reciprocal and square root.
As mentioned, it is important to use such techniques to obtain elementary functions at a high speed and low cost when considering integration into electronic circuits or use in machine learning with GPUs. Among elementary functions, the addition and multiplication in four fundamental arithmetic operations can generally be processed by CPUs at a high speed, and the division can be calculated by using the fast inverse square root described above. Moreover, exponential and logarithmic functions, which are nonlinear functions of elementary functions, can be obtained using IEEE754 calculation tricks. The remaining elementary functions include the computation of trigonometric and inverse trigonometric functions, for which, some methods have been proposed.
In general, approximations of trigonometric functions are often obtained using Taylor expansions, which are slow to converge and cannot be computed over a wide definition domain without the terms exceeding the seventh order. Inverse trigonometric functions are similarly computed using polynomial approximation [33], rational approximation [34][35][36][37], look up table (LUT), and coordinate rotation digital computer (CORDIC) algorithms [38][39][40], but each of them have disadvantages, such as requiring a large amount of memory and iterative calculations.
To solve this problem, we proposed an algorithm to approximate trigonometric functions (sin and cos) and inverse trigonometric functions (atan2) using only the multiply-add operation [41]. In the proposed method, the approximation is achieved by using one addition and two multiplications. The proposed residual correction method (RCM) is an approximation that minimizes the error over the entire domain, rather than a locally precise approximation such as the Taylor expansion. The atan2 function created by the proposed method is faster than the previously used DSP-trick [42], and the approximation is obtained with fewer errors. The DSP-trick is a suitable algorithm for MCUs, but it must always perform division. Thus, it is a tightly coupled algorithm with division. In contrast, our proposed method can be executed with only multiply-add operations if the norm is known. Therefore, it does not necessarily require division. Hence, it is a loosely coupled algorithm with division. Since the norm of the measured values is known for most of the posture measurement sensors, the proposed method, skipping normalization, is suitable for embedded devices. If normalization is required, the algorithm can be integrated with any normalization method, such as fast inverse square root for MCUs or single instruction, multiple data (SIMD) instructions for advanced computing devices.
In this study, we extend our proposed method to compute even more accurate approximations of trigonometric and inverse trigonometric functions at a similar computational cost.

Concept of Approximation by Residual Correction Method
We have been developing wearable posture and fatigue measurement devices [43,44] and assistive suits [45][46][47] using small, low-power MCUs. For this purpose, it is necessary to measure human posture in real time from sensor data using only simple arithmetic and logical operations that are possible with MCUs. Therefore, in our previous research, we developed approximate formulas based on multiply-add operations of trigonometric and inverse trigonometric functions to realize posture calculation on MCUs [41].
We proposed the concept of designing approximate expressions utilizing the residual correction method using simple algebraic expressions with only multiply-add operations, without division operations and iterative convergence algorithms, to achieve efficient calculations for embedded systems.
In a previous study, as an example of approximation using the residual correction method, an approximation formula based on the multiply-add operation of trigonometric functions (sin and cos) and inverse trigonometric functions (atan2) was created.
The following procedure was used to design the approximate formula.
(1). Linear approximation with fixed origin and domain endpoints: (2). Evaluate residuals with approximate target function: (3). Design of residual correction function (RCF) : r(x) ≈ e res ; (4). Approximate formula with residual correction: The concept of the residual correction method is to express the RCF using only multiply-add operations, which must be explored heuristically. Due to the restriction of using only multiply-add operations, the order of the candidate functions for the RCF is less than a third. The RCF was evaluated in terms of both computational cost and approximation error, and it was shown that, for the nonlinear functions mentioned above, the use of parabolic functions provides an efficient approximation formula that expresses each function using only multiply-add operations.
Although the above method can easily obtain a quadratic approximation formula for a nonlinear function, some applications require an even smaller error. In the next section, a method to reduce the approximated error while maintaining the computational cost by extending the approximation formula with residual correction is proposed.

Second Approximation by the Residual Correction Method for Trigonometric Functions
In a previous study, the approximations made by the residual correction method for trigonometric functions were as follows [41]: These are of the same form as the famous quadratic approximation for a sine function [48] in [−π, π], but the error is larger around θ = π 4 , as shown in Figure 1. Here, the above approximation formula is used as the first approximation by the residual correction method, and a method to further reduce the error by utilizing the residual correction method under the constraint of using only multiply-add operation is examined.

Exploring RCF Candidates
Here, we consider the sin-function and evaluate the residuals of the first approximation. The residuals of the first approximation have the shape of an odd function, as shown in Figure 2a. The following RCF candidates obtained by the multiply-add operation can be considered for this residual: (1). RCF with signed quadratic function: (2). RCF with cubic function : (3). RCF with co-function : Although the square root appears in the RCF using the co-function, the built-in implementation will use the fast inverse square root. Figure 2b shows the evaluation of the RCF candidate functions. Here, the coefficients of the RCF candidate functions are designed as α sq = 0.224, α cubic = 0.145, and α co f unc = 0.5 so that the residuals are well corrected. As can be seen from the graph, the signed quadratic function is closest to the residuals of the first approximation. In addition, the signed process is faster than the multiplication and fast inverse square root, indicating that the signed quadratic function is suitable for RCF.

Second Approximation of Trigonometric Functions
Using r sq (θ) for RCF, the second approximation of the trigonometric function is A similar argument can be made for the second approximation of the cos function: The approximate shape of the final second approximation function is shown in Figure 3. To evaluate the performance of the second approximation, the final residuals and computation time are shown in Figure 4 and Table 1. The final residual is, at most, approximately 0.00092, which is sufficiently small when compared with the first approximation. The computation time was evaluated using an Intel Xeon E3 (2500 MHz) CPU. As a result, the processing time was 1.4 ns for the sine's approximation and 2.04 ns for the cosine's approximation, which is approximately twice as long as the first approximation, but more than 30 times faster than the built-in functions in math.h.
Comparing the first approximation, Equation (3), with the second approximation, Equation (4), we observe that they have the same parabolic function shape. The second approximation first keeps the results of the calculation of the first approximation in memory, and then uses the results to input them into a parabolic function of the same shape. The first approximation requires the computational cost of one parabolic function, and the second approximation requires the computational cost of two parabolic functions. Therefore, the computation cost is approximately doubled; however, the computation error is reduced to approximately 1/60.

Second Approximation by the Residual Correction Method for Inverse Trigonometric Functions
Approximate formulas based on the residual correction method for inverse trigonometric functions were also proposed in our previous study [41].
Note that this equation is designed assuming that the norm is known, x 2 + y 2 = 1, and the domain of definition is [−π/2, π/2]. In many cases, the norm is known for sensors that measure the directional cosine in embedded systems, but when the norm changes occasionally, it is assumed that the norm is normalized in advance using fast inverse square root. The computational complexity of this normalization will be discussed later. Next, we will consider a higher accuracy by the second approximation for inverse trigonometric functions and trigonometric functions.

Domain Expansion
Since the domain of Equation (5) is [−π/2, π/2], it is necessary to extend the domain in order to treat it the same as atan2(y, x). Two methods are evaluated in this paper: one based on a conditional branch and the other based on an extension using the half-angle identity.
The method using the conditional branch changes the formula used by using the sign of the variable.
Another way to expand the domain of definition by a factor of two is to use the half-angle identity. This can be expressed in a single algebraic expression, eliminating the need to switch equations. The atan's half-angle identity, atan2(y, x) = 2atan y together with the constraints of x 2 + y 2 = 1, can be organized into Here, 1/ √ · appears in the equation, and G(·) is used as the fast inverse square root. This allows us to extend the definition domain from [−π/2, π/2] to [−π, π], so that we can approximate atan2 with a single algebraic expression.
The results of each domain extension are shown in Figure 5a, and the residuals are shown in Figure 5b.

Exploring RCF Candidates
Next, we consider candidates for RCF, but, as the atan2 function is a two-variable function, it is difficult to search for functions by simple multiply-add operations, such as parabolic functions. Therefore, we consider using Newton's method by referring to the steps for improving the accuracy of the fast inverse square root.
From the constraints on the trigonometric functions, the evaluation function f (θ) and its derivative in Newton's method by x, y, and θ are Using this, the error converges to zero by repeating this equation With reference to this, the second term on the right-hand side, which is the correction term, can be used as a candidate for RCF.
Thus, it was found that the sin and cos functions are necessary for RCF to cancel the approximation error of atan2. From the evaluation in the previous chapter, the trigonometric functions in the math library are computationally expensive when implemented in programs. Therefore, we replace them with s 2 (θ) and c 2 (θ), which are highly accurate and fast approximate calculations.
Therefore, let the candidate functions of RCF be as follows: There are two problems with this RCF candidate. One is that it violates the design concept of the residual correction method because it involves division. The other is that the denominator may become zero due to approximation errors in the trigonometric functions. Therefore, r atan2 (y, x) is not suitable for RCF. However, by analyzing the behavior of r atan2 (y, x), a new RCF can be found. When there is no approximation error in the trigonometric function, the problem of the denominator being zero does not occur because the numerator also becomes zero at that time, and the effect of the correction term disappears. From this, the requirement for RCF is that the denominator of r atan2 (y, x), f (θ) is zero when it is also zero. Therefore, f (θ) itself is a candidate for RCF here.
Although there are design degrees of freedom in the coefficients, the result of this optimization is approximately 1.0 as shown next, thus eliminating one multiplication.
The same result can be obtained using either at2 i f (y, x) or at2 ex (y, x) as at2 * (y, x). A plot ofr atan2 (y, x) is shown in Figure 6a. Here, we use the domain expansion at2 ex (y, x) based on the half-angle identity. It can be seen thatr atan2 (y, x), which can be computed by the multiply-add operation, agrees very well with the residual of at2 ex (y, x) and is appropriate as RCF. Figure 6b shows the case where at2 i f (y, x) is evaluated with the same RCF, and this RCF can always be applied if a good approximation is obtained in the first approximation. Therefore, other atan2 approximation methods, such as DSP-Trick, for example, are also RCFs that can be made highly accurate.

Second Approximation of Inverse Trigonometric Functions
Having obtained the RCF, the second approximation of atan2 can be computed using the second approximation of trigonometric functions as follows: Figure 7 shows the residuals of at2 i f (y, x) and at2 ex (y, x). Both at2 i f (y, x) and at2 ex (y, x) can be used for at2 * (y, x) in terms of error, but at2 i f (y, x) should be used in terms of the computational cost. In addition, the constraint x 2 + y 2 = 1 was used in the approximate formula design up to this point, but normalization may be necessary in some applications. Our proposed atan2 approximation can be combined with any normalization algorithm. Recent CPUs can compute faster with the SIMD instruction set. However, for MCUs with only a simple instruction set, normalization is a major problem [49]. Therefore, fast inverse square root, which can be executed using only multiply-add and logic operations, is a useful technique for MCUs. Any normalization algorithm or SSE instruction can be used in general; however, we assume an MCU and verify the use of fast inverse square root. The normalized second approximation is defined as at2 2n (y, x) when the fast inverse square root is used for the variables. at2 2n (y, x) can be calculated using the fast inverse square root G(·) as follows: The error between the calculation results and the Math library is shown in Figure 8. The increase in computational complexity due to normalization is shown as at2 2n in Table 2. Compared with at2 2 , it is two times greater, which indicates that a highly accurate approximation of at2 2 can be performed with the same computational complexity as that of the fast inverse square root. In this study, the original fast inverse square root [32] is used for the evaluation implementation, but faster implementations have been studied, such as the algorithm [50], which is faster for FPGA.

Results
Second approximations of the trigonometric and inverse trigonometric functions have been developed, and the results are summarized here. The maximum error indicates the maximum value of the final residuals of the second approximation. The computational cost indicates the number of multiply-add operations used. Although an example of the computation time has already been shown, only the number of operations is shown here because it varies depending on the CPU performance and architecture.
For the second approximation of the trigonometric functions, Equations (3) and (4) are considered good approximations, and the error and computational cost are as shown in Table 3. The final error is less than 0.001 for the entire definition region, and the computational cost is very low. Equation (17) is considered a good second approximation of the inverse trigonometric function. The error and computational cost are shown in Table 4. The small approximation error is less than 0.001 rad over the entire definition region. The computational cost is a little higher because two second approximations of trigonometric functions are required, but the approximation formula is only a finite number of multiply-add operations.

Discussion
In this study, we have designed a second approximation of RCF to achieve a higher accuracy in approximating trigonometric and inverse trigonometric functions.

Approximation of Trigonometric Function
As mentioned in Introduction, the Taylor series is slow to converge but can compute mathematically exact values of trigonometric functions. We compare the Taylor series and the proposed method in the domain of definition [−π, π]. Taylor series are evaluated from the first to the ninth order. In addition, the computational cost is evaluated based on the definition formula and the following formula expansion to reduce the number of multiplications for embedding.
The left-hand side of Equation (20) is the definition for the kth-order Taylor expansion and the right-hand side is the optimized number of multiplications by storing x 2 in memory. The error is evaluated numerically by Romberg integration, and the computational cost is compared to the number of multiplications. Table 5 and Figure 9 show the comparison results. As mentioned above, the Taylor series is mathematically exact, but its convergence is slow, so a large error remains if a wide domain of definition is used. Although the proposed method cannot reduce the error to zero, it enables obtaining results with a reduced number of calculations and reduced error.

Approximation of Inverse Trigonometric Function
Next, we discuss the approximation of the inverse trigonometric function. Depending on the characteristics of the application, the desired accuracy can be used by selecting whether the first approximation is sufficient, second approximation is necessary, or whether normalization is necessary. Table 6 shows a comparison with other methods. Basically, all are trade-offs between accuracy and computational complexity, and should be used appropriately depending on the application. The approximate formula by the residual correction method is an algebraic formula and is extremely easy to implement as a program, so it can be one of the options for approximate calculation. While more exact calculation methods are suitable for obtaining exact values, such as in numerical simulations, multiplyadd operations are very effective in compact systems such as MCUs, FPGAs, and large-scale machine learning operations. In such fields, the proposed method should be used to obtain highly accurate approximations with a finite number of multiply-add operations without using division. The following is an example of compilation and execution results on an actual MCU. The program memory occupancy of the proposed algorithm with an MCU grade is shown in Table 7. The famous Microchip PIC family and the XC8 compiler are considered in these examples. In the lowest grade model, the atan2 function alone consumes the entire program memory, and thus the program cannot be compiled. In contrast, the proposed method makes it possible to successfully implement the algorithm. Other grades of MCUs without FPU or DSP occupy the same amount of program memory. Since embedded systems usually require not only function calculations but also various processes, such as measurement and equipment control, the program memory of the MCU must be saved as much as possible in order to implement these processes. The second approximation of the proposed method also uses s2 and c2 in the calculation process; thus, trigonometric functions can be simultaneously used. A total of 7.7 KB is required to include them in the Math library, so the approximation is advantageous for performing complicated calculations.
Next, the calculation speed is measured using an actual MCU (PIC18F2580). The expetimental results are summarized in Table 8. The calculation speed is fast enough for the first approximation of atan2, and the speed of the second approximation is close to that of the Math library as the calculation volume increases. Therefore, if there is sufficient program memory, using the Math library is the most accurate way to calculate accurate values. However, for limited program memory, such as in low-grade MCUs, the first approximation can be used for speed and the second approximation can be used if accuracy is required. This increases the degree of freedom of implementation. As mentioned above, MCUs perform not only function calculations but also various other functions, such as measurement and equipment control, so function calculations should be as fast as possible.  In this study, we developed highly accurate approximation formulas for sin and cos as trigonometric functions and atan2 as an inverse trigonometric function using the residual correction method. The tan, asin, and acos functions can be calculated from the fast inverse square root and the approximation formula proposed here using the trigonometric identities, respectively. Therefore, it is known that log and exp functions can be computed from the IEEE754 structure as the fast inverse square root, implying that, besides the four arithmetic operations, logarithm and exponential, trigonometric, and inverse trigonometric functions can also be computed quickly. This enables the computation of highly accurate approximations of all elementary functions by a finite number of multiply-add operations and bit shifts, when considered for embedded systems.

Conclusions
In this paper, we proposed a second approximation of the residual correction method that can approximate trigonometric and inverse trigonometric functions with a high accuracy using a small number of calculations. Using our proposed algorithm, nonlinear calculations that were previously impossible to implement can now be realized even on small, low-power computers, such as MCUs. It cannot reduce the error to completely zero; however, it can compute accurate approximations with low-cost computation. This can contribute to machine downsizing and cost reduction, and is a candidate algorithm for developing MCU-based devices that require real-time performance, such as drones and IoT devices.
To approximate trigonometric and inverse trigonometric functions with a high accuracy using finite multiply-add operations, a second approximation using the residual correction method was proposed. Trigonometric functions can be computed with a maximum error of less than 0.001 by two additions and four multiplications, and atan2 can be computed with a maximum error of less than 0.001 rad by seven additions and fourteen multiplications. For the formulation of the second approximation, we designed and evaluated a formula using floating-point arithmetic. In future research, we aim to optimize the formula for embedded systems by converting it to fixed-point arithmetic. Since the residual correction method is formulated using only the multiply-add operation without division, it is well suited to the fixed-point arithmetic, and further acceleration can be expected when considering implementation in embedded systems.