An Efﬁcient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions

: The efﬁcient and precise hardware implementations of tanh and sigmoid functions play an important role in various neural network algorithms. Different applications have different requirements for accuracy. However, it is difﬁcult for traditional methods to achieve adjustable precision. Therefore, we propose an efﬁcient-hardware, adjustable-precision and high-speed architecture to implement them for the ﬁrst time. Firstly, we present two methods to implement sigmoid and tanh functions. One is based on the rotation mode of hyperbolic CORDIC and the vector mode of linear CORDIC (called RHC-VLC), another is based on the carry-save method and the vector mode of linear CORDIC (called CSM-VLC). We validate the two methods by MATLAB and RTL implementations. Synthesized under the TSMC 40 nm CMOS technology, we ﬁnd that a special case AR | RV ( 3, 0 ) , based on RHC-VLC method, has the area of 4290.98 µ m 2 and the power of 1.69 mW at the frequency of 1.5 GHz. However, under the same frequency, AR | CV ( 3 ) (a special case based on CSM-VLC method) costs 3196.36 µ m 2 area and 1.38 mW power. They are both superior to existing methods for implementing such an architecture with adjustable precision


Introduction
Artificial neural networks (ANNs) are widely used in the applications of pattern recognition, image classification, biological systems and so on [1]. In the past, ANNs were generally implemented only in software. But in recent years, with the development of artificial intelligence and integrated circuits, their hardware implementations have become more and more important because of the performance gains compared with software implementations. In ANNs, each layer or component is important, mainly including convolution, pooling, full connection and activation function. The activation functions can introduce nonlinear factors to neurons, so that the ANNs can approximate any nonlinear functions and be applied to nonlinear models. Therefore, an efficient and accurate implementation of non-linear activation functions, such as tanh and sigmoid, is of high interest.
The tanh and sigmoid functions both shape like an "S" curve, where the outputs of tanh vary between (−1, 1) and the range of sigmoid outputs is (0, 1). Mathematically, their functions are defined below (S(x) → sigmoid function, T(x) → tanh function).
In fact, the tanh function can be converted to: By dividing numerator and denominator by e −x , Equation (2) can be rewritten as: It follows that they are functionally related, as above, which enables them to be implemented in a unified architecture. From the formulas, we know that the exponential terms are the key factors that generate the nonlinear behavior. Meanwhile, the accuracy and hardware cost of exponential and division operations can influence the performance of whole neural networks [2,3]. Up to now, some methods have been put forward to solve their implementation problems and can be divided into the following types: look-up tables (LUT) or range addressable LUT (RALUT)-based methods [4][5][6][7], piecewise linear (PWL)-based methods [8][9][10][11], piecewise nonlinear (PWNL)-based methods [12,13] and Taylor series (TS)-based methods [14].
The LUT-or RALUT-based methods take the advantage of look-up tables to store large amounts of points or range values, respectively. Take the LUT-based method as an example it approximates tanh or sigmoid functions with a finite number of distributed points. In general, these points are evenly distributed, or not evenly distributed. The number of points is related to the number of bits used in hardware implementation and it determines the maximum error or average error of results. The biggest advantage of this method is the short latency and it usually just needs one clock cycle. However, its biggest disadvantage is that it requires a lot of memory to store the lookup table. The larger the input range of the computation, or the higher the precision of the results, the larger the memory requirements, which is obviously not desirable in hardware implementations.
Similarly, PWL-based methods use some straight lines to simplify the nonlinear curves of the functions, and the accuracy depends on the number of lines. Compared with LUT-or RALUT-based methods, this method does not store massive data and requires less memory. However, the essence of transforming non-linear behavior into linear behavior makes its accuracy not very high. Especially at the inflection point of the function curve, the number of segments needs to be increased to obtain better accuracy. More segmentations means more storage, and more complex judgment logic. On the other hand, its fitting linear segment formula is as follows: y = k 1 + k 2 × x, where one multiplier is used. As we know, the multiplier is complicated in hardware implementation because of the negative effect on circuit throughput and area utilization. In many designs, it is difficult to accelerate the working frequency because of the existence of the multiplier. Therefore, it is preferable to avoid using multipliers.
By contrast, PWNL-based methods use nonlinear approximation in each segment. Similar to the PWL-based method, it is fitted by the following nonlinear formula: y = k 1 + k 2 × x + k 3 × x 2 + · · · + k n × x n−1 , where the nth order needs (n + 1)n/2 multiplication operations. The main advantage of this method is that it can achieve higher precision than the PWL-based method. However, a lot of multiplications result in a low working frequency for the whole architecture [12].
The TS-based method faces the same challenges as LUT-or RALUT-based methods. Its accuracy can not be very high because it is limited by the chosen range.
Therefore, it is difficult for each of the above methods to achieve all the advantages of computing tanh and sigmoid functions: high hardware efficiency, adjustable precision, extensible input range, and high computation speed. If we use these methods to build such an hardware architecture, they will be very costly. However, when designing a universal reconfigurable accelerator (such as RASP [15]), it usually includes many operators, such as tanh and sigmoid. If the operator has adjustable precision and the above advantages, it will be very useful. Thus, our proposed architectures to implement tanh and sigmoid have important application value. On the whole, we make the following contributions: • For the first time, we propose a hardware architecture with adjustable precision and extensible input range to implement tanh and sigmoid functions, which is based on the RHC-VLC method.
In addition, we propose another architecture with unlimited input range and adjustable accuracy, which is based on CSM-VLC method.

•
Of the two proposed methods, the accuracy magnitude based on the CSM-VLC method can reach 10 −4 best, while the magnitude of the accuracy based on the RHC-VLC method can be much better, such as 10 −7 or 10 −8 . At the same time, the proposed methods can change the accuracy by adjusting the iterations of CORDIC without changing the current computing architecture. The lower the accuracy requirement, the lower the overall latency. Other methods of adjusting the precision require architectural changes, which are extremely unfriendly in hardware implementation.

•
In hardware implementation, the RHC-VLC-based method only requires shift-and-add (or subtract) operations. Another method based on CSM-VLC requires a constant multiplier, we also optimize it to achieve shorter delay and smaller area. Compared with other existing methods to implement a hardware architecture with adjustable precision, our hardware implementation is more efficient. The proposed architecture has dual computing capabilities (tanh and sigmoid), which can be determined by the input selection bit.

•
Under TSMC 40 nm CMOS technology, the hardware architecture based on our proposed methods can work at 1.5 GHz frequency or even higher. Except for the high frequency, our methods are also compatible with other advantages: efficient hardware, adjustable precision, and extensible input range. With the same adjustable range of precision, our architecture has lower area and power consumption compared with other methods.
The rest of this paper is organized as follows. Section 2 details the first proposed architecture based on the RHC-VLC method to implement tanh or sigmoid functions. Section 3 presents another proposed architecture based on the CSM-VLC method. Section 4 shows the general hardware implementation and a specific case of these two methods. Then, we make a comparison with other methods. Finally, Section 5 draws the conclusions.

Proposed Architecture Based on RHC-VLC Method
To solve the problem of not being able to build a hardware architecture that has all of the above four advantages, we propose two solutions: The RHC-VLC-based method and CSM-VLC-based method. In this section, we first detail the architecture based on the RHC-VLC-method, including basic theory, method process, software simulation and algorithm construction. Another architecture based on the CSM-VLC method will be presented in the next section.

Overview of RHC and VLC
Before introducing the first method, we need to review what RHC and VLC are. Only based on this theory can we build an architecture with adjustable precision and extensible input range to compute sigmoid and tanh functions.
In 1959, Volder invented CORDIC to evaluate trigonometric functions, multiplication and division [16]. Later, Walter extended its computation capacities to compute exponentials, logarithms and square roots [17]. In this paper, only VLC and RHC are adopted, so we omit the introduction to other types of CORDIC. Their iterative formulas can be found in [18].
VLC : where η = sign(y k ), k ≥ 0 and k is an integer. After a few iterations, RHC and VLC will converge to some common functions, which are shown in Table 1.

Mode of CORDIC Outputs
and called the scale-factor, where the terms (k = 4, 13, 40, . . .) should be multiplied again. When the iteration number of RHC is determined, Γ will be a constant value, so we do not need to compute it in actual applications.
From Table 1, we know that RHC has the ability of calculating hyperbolic sine (sinh) and hyperbolic cosine (cosh) if we initialize the inputs as x 0 = 1/Γ and y 0 = 0, VLC can implement the division operation if z 0 is initialized to be 0. They all lay the foundation to compute the sigmoid and tanh functions.

The RHC-VLC-Based Method
This method of computing S(x) contains two steps. The first step deals with the exponential operation of e −x , which can be implemented by RHC. Because we can initialize the inputs of RHC as follows: Then, we can get the result of e −x through x n plus y n . The second step handles the division operation, which can be calculated by VLC. To this end, the inputs of VLC are initialized to so the output z n will converge to 1 1+e −x and it is exactly the result of S(x). From the relationship in Equation (1), we can design the architecture shown in Figure 1 to calculate S(x) or T(x). "t" represents the functionality to be implemented by the current architecture, t = 0 means computing S(x), and t = 1 stands for computing T(x). Since the convergence range of RCH or VLC is limited, we should analyze the range of the input variable x of computing S(x) and T(x) (called S&T). As can be intuitively seen from Figure 1, the convergence range of RHC is affected by VLC and the range of input x is determined by RHC.
As for the standard CORDIC, the convergence range of RHC is |z 0 | ≤ 1.1182, and that of VLC is y 0 x 0 ≤ 1 [19]. Because the output of T(x) varies between [−1, 1] and the S(x) output varies between [0, 1], their ranges happen to be suitable for VLC, and we do not need to expand the converge range of VLC. Then, we analyze the range of the variable x when we use RHC to compute e −x or e 2x . x belongs to [−1.1182, 1.1182] for computing S(x) and x only belongs to [−0.5591, 0.5591] for computing T(x), which are both very limited and not suitable for practical applications. Therefore, we should expand the range of input variables z 0 of RHC. Hu [19] shows that, if we extend the iteration index set from k = 0, 1, . . . , n to k = −m, −m + 1, . . . , 0, . . . , n, the convergence range of RHC will be expanded. However, the iterative formulas will be changed when k ≤ 0, where the terms containing 2 −k need to be replaced by 1 − 2 −2 −k+1 . That is The convergence range of the expanded RHC depends on the maximum sum of rotation angles, which is defined as where the terms of iterative number k = 4, 13, 40, . . . should be accumulated twice. Ulterior to this, the convergence range of RHC will be updated to z 0 ≤ θ (m,n) max , and we can enlarge or reduce the range through changing the value of m. For example, n is assumed to be 15, then we can change the value of m to get different convergence ranges of RHC, including the corresponding input ranges for T(x) and S(x), as shown in Table 2.
In consideration of the iterations of the expanded RHC are related to two parameters (let them be m and n, m ≤ 0, n ≥ 1) and the iterations of VLC are relevant to one parameter (let it be p, p ≥ 1), we can denote the architecture of computing S&T as Ar RHC VLC (m, n, p).

Software Test for the RHC-VLC-Based Method
Before we go further and design an architecture with adjustable precision, it is necessary to conduct a software simulation for validating and exploring the accuracy of computing S&T. First, we input different data x to verify the correctness of S&T. Then, we explore the accuracy distribution of computing S&T through changing the iterations of VLC and RHC.
In order to evaluate the accuracy of the proposed architecture, we characterize it with maximum absolute error (MAE) and average absolute error (AAE) and they are defined as: where 1 ≤ i ≤ NU M, V ori stands for the value of original functions, V pro represents the value of proposed architecture, and NU M means the total test points.
Since the accuracy of RHC and VLC is closely related to the number of iterations, we can first explore the distribution of the respective accuracy of RHC and VLC. All cases set NU M to be 100,000 and the input x is generated by a random number. (Figures 2 and 3 both use logarithmic ordinates.)  The first case: We set the number of iterations (n) of RHC to range from 1 to 20 and the input range is [−1.118, 1.118]. Since all iteration numbers are positive, each iteration of RHC will adopt the iterative formulas, described by Equation (4). As can be seen from Figure 2, when n is greater than 10, the order of magnitude of AAE changes from −4 to −7. The magnitude of MAE varies from −4 to −6 when n is greater than 11. The more times of positive iterations, the higher the accuracy.
The second case: We set the number of negative iterations (m) of RHC to vary from 0 to 4 and the input range is [−2.028, 2.028]. The maximum positive iteration number (n) ranges from 1 to 20. The terms with positive iteration numbers still adopt the iterative formulas, shown as Equation (4), but those with negative iterations will use Equation (9). From Figure 3, we can see that the negative iterations basically do not affect the accuracy of RHC and the order of magnitude of AAE or MAE is the same as Figure 2.
The third case: In order to evaluate the computational accuracy of VLC, we set the number of iterations (p) to scale from 1 to 20 and the input range is [1,100]. The iterative formulas used in the simulation are shown in Equation (5). The following conclusions can be drawn from Figure 2: The larger the p value, the higher the calculation accuracy of VLC. When p ≥ 9, the order of magnitude of AAE is smaller than −4. However, when MAE reaches the same magnitude, p must be greater than 10.
Next, based on the accuracy of the above two computation modules-RHC and VLC-we use the control variable method to explore the precision distribution of computing S&T. All cases set NU M to be 100,000 and the input x is generated by random number. From Figure 3, we know that the parameter m of RHC has little effect on the accuracy, so we set it to 0 first. In this case, the accuracy of computing S&T is only affected by the iterations n and p. In order to obtain n and p, corresponding to different magnitudes of MAE and AAE, we divide the discussion into three cases. (Input range x: The simulation result is shown in Figures 4 and 5. They tell us that, for the same iterations (n and p), the accuracy of tanh is slightly lower than that of sigmoid. Then, when 3 ≤ p ≤ 7 and 1 ≤ n ≤ 10, the accuracy is improved with the increase in p. Finally, if p is small and n is large, the precision stays pretty much the same.
Case 2: The simulation result is shown in Figures 6 and 7. Similar to Case 1, when 8 ≤ p ≤ 14, no matter how n varies in 1 ≤ n ≤ 10, the accuracy remains basically the same. However, when n is increased in the range of 11 ≤ n ≤ 20, the accuracy is greatly improved. On the whole, the precision under the condition of Case 2 is better than that of Case 1.
The simulation result is shown in Figures 8 and 9. Except for some features common to Case 1 and Case 2, the overall accuracy of Case 3 is higher than that of Case 1 and Case 2 (n in the same range). As p increases, even if n is small or large, the accuracy remains basically the same. As n and p both increase gradually, the precision tends to be of the same magnitude.
(1 ≤ p ≤ 7, 1 ≤ n ≤ 10, AAE ) (1 ≤ p ≤ 7, 1 ≤ n ≤ 10, MAE ) (1 ≤ p ≤ 7, 11 ≤ n ≤ 20, AAE ) (1 ≤ p ≤ 7, 11 ≤ n ≤ 20, MAE )     From the above experimental results, it can be seen that the computation accuracy of S&T is strongly associated with the combination of RHC and VLC iterations. As long as the iterations of RHC and VLC are determined, the output precision of these two activation functions will be determined accordingly. In this way, the architecture based on the RHC-VLC method is not only with adjustable precision, but is also easy to implement and extend in hardware.

Proposed Algorithm with Adjustable Precision
Through the simulation verification in the previous subsection, we know the relationship between the computation precision of S&T and the iterations of RHC and VLC. Since different precision magnitudes correspond to multiple combinations of RHC and VLC iterations, we further classify and screen them according to a principle of fewer total iterations (Principle: AAE and MAE are expressed as "K × 10 E ", K (1, 5] and E is a negative integer), as shown in Tables 3 and 4. Here, only partial results corresponding to the order of magnitude of precision are shown. If the magnitude is less than −6, it can be obtained according to the above method, which will not be detailed in this paper.  Next, according to the Sections 2.2 and 2.3, we propose an algorithm (Algorithm 1) to compute S&T with adjustable precision, which is called the RVST algorithm in this paper. It can be seen from Tables 3 and 4, that the accuracy of the same magnitude may correspond to multiple iterative combinations, and it is better to take n smaller and p larger than the criterion. The reason is that according to Equations (4) and (5), the hardware implementation of VLC is less complex than that of RHC. Therefore, the combinations of m, n and p in the RVST algorithm are determined and they are also the best combinations. The input range is shown in Table 2 and it varies with the value of m. However, the value of m has little effect on the accuracy. In the RVST algorithm, MAE is used as the accuracy standard. Besides, AAE can also be adopted.

Input:
The parameter x that needs to be calculated; the calculated type t (t = 0 ⇒ S(x) and t = 1 ⇒ T(x)); the required order of magnitude (rm) of MAE (all are positive numbers, for example, 2 actually represents "−2"); m i determines the input range of As can be seen from Algorithm 1, in addition to the conditional statements, there are two functions: S() and T() (more on that below). In general, the algorithm is clear and simple in hardware implementation.
Further, we introduce the algorithm of function S(). The principle of the algorithm is based on Figure 1 and it is termed as S algorithm. It should be noted that the values of 1/Γ in Figure 1 are different under different iterations of RHC, as shown in Table 5. Another thing to note is that when RHC contains non-positive indexed iterations, the new scale-factor Γ should be redefined as  Similarly, we can derive the algorithm for the function T() from Figure 1, which is called T algorithm. In the hardware implementation, multiplying by 2 is equivalent to moving one bit to the left. For RHC and VLC, only shift-add (or subtract) operations are required. Therefore, Algorithms 1-3, there is no complicated hardware logic, which ensures the efficiency of the whole architecture. Specific hardware implementation of the proposed architecture is detailed in Section 4.

Input:
The parameter x that needs to be calculated; m, n, and p correspond to the iterations of RHC and VLC, respectively Output: S() result SR

Proposed Architecture Based on CSM-VLC Method
In this section, we introduce another method based on the CSM (carry-save method) and VLC. The difference between this method and the RHC-VLC-based method is that the latter uses RHC to compute the exponential directly, whereas here the exponential is calculated based on the CSM. Therefore, we focus on how to calculate e −x and e 2x in the following.

Compute E −X and E 2X
Generally speaking, the expansion of exponential functions in accordance with Taylor's formula requires a large number of multiplication operations, while the hardware implementation of multiplication not only consumes a lot of logical resources, but it also leads to a long data path delay. In addition, the input range of the exponential functions should not be large, otherwise the data bit width is too large or the precision is not guaranteed in the hardware implementation. Therefore, in order to overcome the above disadvantages, we design a special hardware architecture for e −x and e 2x in tanh and sigmoid functions. First, by means of the mathematical formula transformation, we convert e −x and e 2x into: When we get the values of r 1 and r 2 , then we can easily separate their integer and decimal parts, assuming: r 1 = I 1 + D 1 , r 2 = I 2 + D 2 , where and f loor(x) is a function that returns the greatest integer less than or equal to x, which is usually denoted as " f loor(x) = [x]"-for example, f loor(−1.2) = −2 and f loor(1.2) = 1. Further, we can simplify Equation (13) as follows: "<< X" means to move X bit to the left and ">> Y" means to move Y bit to the right. Therefore, two natural exponential functions are transformed into a same exponential function with base 2, and the computation range is changed from "(−∞, +∞)" to "[0, 1)", which just needs an extra shift operation. For "2 D 1 or 2 D 2 , D 1 , D 2 [0, 1)", we can easily implement them according to the required precision by taking advantage of the LUT-based or PWL-based methods.
After introducing the above transformation process, we can obtain the hardware architecture shown in Figure 10, which contains a constant multiplier, an exponential calculator (EC) and a shift unit (SU). As we mentioned earlier, a multiplier not only consumes a lot of resources in hardware implementation, but also causes long data path delay. So we introduce a method of optimizing the constant multiplier used in this paper, which can greatly reduce its hardware complexity and its critical path. The optimization method includes algorithmic level and architectural level.
First, we talk about the optimization at the algorithmic level. We know that (−x) · log 2 e and (2x) · log 2 e can be written as where log 2 1 e is approximately equal to −1.442695040888963 and log 2 e 2 approximates to 2.885390081777927. We can further get x · (−1.442695040888963) = x · (−2 + 0.557304959111037) Then, we introduce the optimization at the architectural level. In binary terms, we know that 0.557304959111037 is approximately (0.1000111) 2 (=0.5546875) and 0.885390081777927 is approximately (0.1110001) 2 (=0.8828125). If we design a constant multiplier in the normal way, there will be four shifts and three additions. However, we can simplify it to three shifts or two shifts, one addition and one subtraction, which is called the "3-1-1" or the "2-1-1" method in this paper. The implementation process of this method is as follows.
The above two binary numbers can be further rewritten as: In this case, that is The top one is "3-1-1", and the bottom one is "2-1-1" in Equation (20). In general, "one plus and one minus" can be converted to "three plus"; that is, subtracting a number is equal to adding its two's complement. Then, we turn Equation (20) into: Supposing that x 1 and x 2 are all M bits, if we use a traditional method to compute the above formula, 3M full adders (FAs) are needed and the critical path time is 3Mτ (τ is the operation time of a full adder). However, we use a method called CSM to reduce both the number of FAs (from 3M to 2M) and the critical path time (from 3Mτ to (M + 1)τ). Of course, a reduction in the number of FAs means that the area and power consumption of the whole architecture will be also reduced. In fact, we can also consider using the carry select adders [20][21][22] to implement Equation (21), but for the sake of balancing area and speed, we do not adopt it in this paper.

Software Test for the CSM-VLC-Based Method
Now, we use the accuracy criteria (AAE and MAE) to evaluate the precision of this method. Compared with the method based on RHC-VLC, the CSM-VLC-based method has a wider input range. Although there is no scope limitation in nature, we still set two limited test input ranges: [−2, 2] and [ −12, 12].
Since the iterations (p) of VLC can be set flexibly, the precision of division operation will also change accordingly. Therefore, we set the parameter p from 1 to 20 to explore the accuracy of computing S&T with the CSM-VLC-based method. The final simulation results can be obtained through MATLAB, as shown in Figure 12 From Figure 12, we can draw the following conclusions: (1) with the increase in p, the accuracy of computing S&T with this method becomes higher; (2) the greater the p value, the more the accuracy tends to remain unchanged; (3) the larger the input range and the larger the p value, the higher the accuracy; (4) compared with the method based on RHC-VLC, the overall computation accuracy of this method is not dominant; (5) the input range of this method is essentially unlimited, but the RHC-VLC-based method needs to adjust the iterations of RHC to expand the input range, so the architecture based on CSM-VLC is superior in this respect.

Proposed Algorithm with Adjustable Precision
Through the previous simulation verification, we obtain the relationship between the computation accuracy of S&T and the iterations of VLC. Since different precision magnitudes correspond to different iterations of VLC, we first classify and screen them according to a principle of fewer iterations. (Principle: AAE and MAE are expressed as "K × 10 E ", K (1,5] and E is a negative integer), as shown in Table 6. From Figure 12, we know that even if the p value becomes larger, it is difficult to improve the accuracy by another magnitude. Therefore, only MIV corresponding to three kinds of accuracy magnitude are shown in the Table 6, which also lays the foundation for another proposed algorithm based on CSM-VLC with adjustable precision.
According to the final optimization of e −x and e 2x and Table 6, we can obtain another new algorithm (based on CSM-VLC) to compute S&T, which is called the CVST algorithm (Algorithm 4) in this paper. The specific implementation of VLC() in this algorithm refers to the previous algorithm. In the CVST algorithm, we use MAE as the accuracy standard an,d of course, we can also adopt AAE.  else p = 0 13: end if 14: 18: if t = 0 then 19: Return FR = V zn 20: else 21: Return FR = 1 − V zn · 2 22: end if

Hardware Implementation
After proposing two algorithms with adjustable accuracy to calculate S&T functions, and learning their advantages and disadvantages through software simulation, we further compare their pros and cons through specific hardware implementations, including the comparison with other methods.
For the convenience of expression, we denote the hardware architecture of RVST algorithm as AR | R V (rm, m i ). Similarly, the CVST algorithm in hardware implementation is named AR | C V (rm). Here, rm refers to the magnitudes of MAE and m i determines the range of input x.

General Architecture Based on RHC-VLC Method
According to the RVST algorithm, we can design two structures: one is a non-pipelined structure, which can save a lot of hardware area. The other is a pipelined structure, which can obtain high throughput. In this paper, all hardware designs adopt the pipelined structure.
The general architecture AR | R V (rm, m i ) is shown in Figure 13, which mainly includes PS (precision selector) module, RHC module, VLC module and Adder. When the inputs (x, t, rm, m i ) are valid, the sign bit of x will flip if t = 0, otherwise, x will be shifted left for one bit as with the initial input z 0 of the RHC module. The PS module selects the corresponding iterations (m, n, p) according to the values of rm, m i and t, then it transmits them to the RHC module and the VLC module, respectively. After a fixed number of iterations, the output of RHC module is valid, x n and y n of RHC will be added up by adder and used as the initial input x 0 of the VLC module. Similarly, when the output of the VLC module is valid, if t = 1, z p will convert its sign bit and then be shifted one bit to the left, the result is added up by 1 as the final result of T(x). However, if t = 0, z p will be directly the final result of S(x).  Next, we detail the architecture of each module, namely the PS module, RHC module and VLC module. First, we introduce the PS module, as shown in Figure 14. Since m i determines the input range of x, which is independent of the current function to be calculated and does not affect the calculation accuracy, it is directly equal to m. The values of rm and t determine the different combinations of n and p. m and n are transmitted to the RHC module and p to the VLC module. See Tables 3 and 4 for "case o f sigmoid" and "case o f tanh".  Figure 15 shows that the inputs and outputs of each iteration of VLC are cascaded back and forth. The kth stage includes the shift operations, which needs to shift x k by k bits to the right. The constants contained in the look-up table are calculated by 2 −k . In the same way, Figure 16 describes a pipelined architecture of RHC module. Unlike the VLC module, there are two kinds of architecture in the RHC module. One is the RHC of positive iterative number (called P-RHC), another is that of the non-positive iterative number (called NP-RHC). For the P-RHC, the kth stage needs to shift x k and y k by k bits to the right and the constants in Lookup Table 2 are computed by tanh −1 (2 −k ). However, the kth stage of NP-RHC should shift x k and y k by 2 1−k bits to the right and the elements in Lookup Table 3 are calculated by tanh −1 (1 − 2 −2 −k+1 ). There are seven cases of the initial value x 0 , the specific constant values are shown in Table 5 and stored in Lookup Table 1. ...

P-RHC k=n
Lookup Table 3 Sign

General Architecture Based on CSM-VLC Method
Different from AR | R V (rm, m i ), the input m i is no longer needed in AR | C V (rm). In the PS module, there can be no case of rm ≥ 5 because the upper limit of rm is 4 (See Table 6 for details). There only exists one parameter p in the PS module and it is passed to the VLC module. The RHC module will be replaced by a CSM-based exponential computing module (called the CEC module), which is shown in Figures 10 and 11 and not further described here. The Adder is no longer needed because the output of the CEC module is equivalent to the output of it.

Implementation of a Specific Case
We design a specific case AR | R V (3, 0) and AR | C V (3), respectively, using Verilog HDL, and synthesize them under the TSMC 40 nm CMOS technology. The report shows that the area of AR | R V (3, 0) consumes 4290.98 µm 2 and the power of that costs 1.69 mW at the frequency of 1.5 GHz. Under the same frequency, AR | C V (3) costs the area of 3196.36 µm 2 and the power of 1.38 mW. Compared with AR | R V (3, 0), AR | C V (3) can save 25.51% area and 18.34% power. In fact, both architectures can operate at a frequency of more than 2 GHz. In the following section, we present the details of the hardware design, including word length, latency analysis and error analysis.
First, let us look at the word lengths required for each module (Input Range: 1]). According to Section 2.3, we know that the AAE of AR | R V (3, 0) are 4.77 × 10 −3 for the sigmoid function, and 3.84 × 10 −3 for the tanh function, respectively. The AAE of AR | C V (3) are 4.31 × 10 −3 for the sigmoid function, and 3.29 × 10 −3 for the tanh function, respectively. To achieve the accuracy, we set the fractional part of the input or output data of top-level module to be nine bits. The minimum number that nine fractional bits can represent is 1/(2 9 ) = 1.95 × 10 −3 , which is lower than MAE, mentioned above (1/(2 8 ) = 3.91 × 10 −3 > 3.84 × 10 −3 ).
From Figures 2 and 3, we know that the MAE of the RHC module (m = 0, n = 8) is 2.82 × 10 −2 for the sigmoid or tanh functions. For AR | R V (3, 0), the MAE of the VLC module is 3.91 × 10 −3 (p = 8) for the sigmoid function, and 9.76 × 10 −4 (p = 10) for the tanh function, respectively. For AR | C V (3), the MAE of VLC module is the same as AR | R V (3, 0) for the sigmoid function, and 1.95 × 10 −3 (p = 9) for the tanh function. Therefore, we set the fractional part of input or output data of RHC module to be six bits, whose minimum number is 1/(2 6 ) = 1.56 × 10 −2 . The fractional part of input or output data of the VLC module are set to be 11 bits (1/(2 11 ) = 4.88 × 10 −4 ). Since the MAE of computing e −x using the CEC module (Input Range: [−2, 2]) is 2.68 × 10 −2 , and that of computing e 2x using CEC module (Input Range: [−1, 1]) is 1.32 × 10 −2 , respectively, we set the fractional part of input or output data of CEC module to be seven bits (1/(2 7 ) = 7.81 × 10 −3 ). The fractional part of input or output data of Adder is the same as the RHC module. The required integer bits of input or output data of each module can be calculated from the input range of the top-level module. The word length setting of each module is outlined in Table 7. Next, we analyze the latency of AR | R V (3, 0) and AR | C V (3). In these two cases, m = 0, n and p vary with rm and t, the RHC module has (n + 1) cascaded stages. Because n is smaller than 13 in this special case, the RHC module only just needs 1 repeated stages (n = 4). Each stage needs one clock cycle. Thus, the RHC module requires (n + 2) clock cycles. The VLC module has (p + 1) cascaded stages and it needs (p + 1) clock cycles. The Adder also needs additional one clock cycle. On the whole, the latency of computing S&T based on RHC-VLC method is n + p + 4 clock cycles. For the method based on CSM-VLC, since the CEC module can compute the value of e −x or e 2x in one clock cycle, the latency of this method is (p + 2) clock cycles in total. From Tables 4 and 6, it can be concluded that 20 clock cycles are needed to calculate the sigmoid function by the RHC-VLC method, and 22 clock cycles are needed to calculate the tanh function. If the method based on CSM-VLC is used, 10 clock cycles are needed to calculate sigmoid function and 11 clock cycles are needed to calculate tanh function.
As for arbitrary AR | R V (rm, m i ) and AR | C V (rm), different order of magnitude of accuracy correspond to different number of iterations (n and p) of CORDIC, as shown in Tables 4 and 6, and the total latency is also different. It is important to note that when n = 13, the RHC module requires two more clock cycles according to the convergence.
Finally, we make an error analysis of the two cases. We generate three groups of 10,000 random numbers by MATLAB and take them as the test benchmark of the example circuit. The accuracy of the specific cases are evaluated by comparing the outputs from Vivado with the results of MATLAB's original functions. After the comparison, we find that the order of magnitude of AAE and MAE keeps the same for both hardware implementation and software validation. Additionally, we use another metric "bit position error" [23] to compute the probability error (PE) of each bit. Taking the AAE of computing sigmoid function as an example, Figure 17 shows that the PE of all bits is lower than 0.5. Of all the bits, the PE of the last two is close to 0.5, which indicates that these two bits are not very accurate. Overall, however, the accuracy of the example circuits is as expected. In addition, we can also find that AR | R V (3, 0) and AR | C V (3) have roughly the same probability of error, and the phenomenon is basically consistent with the software simulation results.

Comparison with Existing Methods
In this section, in order to illustrate that our proposed methods have the advantages of efficient hardware architecture with adjustable precision and high speed, we analyze the cost of implementing S&T with adjustable precision in other methods from the perspective of hardware implementation.

Comparison with LUT-Based Method
The LUT-based method typically stores all possible results in memory. The more data that needs to be stored in memory, the larger the hardware area. Therefore, it will consume a lot of logic and storage resources when using this method to implement a hardware architecture with adjustable precision and input range. If the input range is [M, N] and the accuracy is required to be RA = 2 −a , and the amount of data to be stored will be Each piece of data is composed of an m integer bit and n fractional bits. In that case, all of the data need BitTo = DaTo(m + n) bits. That is where n is equal to a. It can be seen from the formula that the hardware area required to store these data is positively correlated with the accuracy or input range. For example, if a = 5, M = −1, N = 1, RA will be 2 −5 (3.125 × 10 −2 ), DaTo will be 65. m is at least 1 and n is 5. BitTo will be 390. However, as the accuracy increases from 3.125 × 10 −2 to 4.883 × 10 −4 , a (n = a) is at least 11. DaTo will be 4097 and BitTo will be 49,164. Thus, it can be seen that when order of magnitude of accuracy is changed from −2 to −4, the required hardware area will be increased by approximately by 126 (49,164/390) times.
Although the LUT-based method has a low latency at low precision, a large amount of data needs to be stored at high precision. For example, to reach the precision level of 10 −4 , the number of node data and result data is 2048, respectively (1/2048 = 4.88 × 10 −4 ). In order to find the result data, the latency of serial search mode will be huge. Although the parallel search mode can be adopted, the number of parallel paths should not be large considering the area overhead. Assuming that a 32-way parallel search is adopted, and 64 nodes data of each way are compared in a serial way. Each comparison consumes one clock cycle, then at least 64 cycle clocks are required, which is higher than our methods. Obviously, the low latency advantage of this method will not exist when we need high precision.
Since memory in hardware costs a lot of area, it would be costly and undesirable to use this method to compute S&T functions with adjustable precision and variable input ranges. This method is only applicable to the case of narrow input range and low precision requirement, so as to give full play to the maximum benefit of short delay. The flexible precision and input ranges mean that the data are diverse and it is hard to be compatible with the method's data fixation.

Comparison with PWL-Based Method
The PWL-based method approximates the target functions into many linear segments. In each input interval, the results are computed according to the corresponding linear segment. Suppose the target function is f (x) and the input range is [M, N]. After PWL segmentation, the target function is divided into a number of small linear line segments [M, N]. In this method, the linear segmentation of f (x) is usually carried out by software simulation in advance according to the accuracy requirement. Then, in hardware implementation, the slope and intercept of the linear segment are stored in memory. In other words, if the method is used to design a hardware architecture with adjustable precision or flexible input range, the segmentation process at the software level must be implemented in hardware. Otherwise, when the accuracy or input range needs to be changed, we must first use the software for linear segmentation, which is not only inefficient, but also tedious.
Further, we analyze the cost of designing a hardware architecture with adjustable precision or flexible input range based on the PWL method. From [24], we learn that, in order to obtain the slope and intercept of all linear segments, the PWL-based method needs division and multiplication operations for many times. Moreover, if we want to achieve the optimal segmentation results, the number of iterations is also difficult to determine, because it is related to the required precision and input range. In the hardware implementation, the multiplier is also required to compute the final result.
Finally, we make an analysis of the latency. In order to build an adjustable-precision hardware architecture, we need to use dividers and multipliers to obtain all slopes and intercepts. Assuming that the latency of multiplier and divider is one clock cycle each (in order to improve the working frequency, it is actually far more than one clock cycle), the input range is divided into N intervals. Each interval requires at least one multiplication and division operation, that is, to achieve the slope and intercept of the fitting line segment of each interval, it requires at least 2N clock cycles. From [24], in order to obtain the maximum absolute error of 1 × 10 −3 , 19 segments (N = 19) are needed to calculate sigmoid function and 27 segments (N = 27) are needed to calculate tanh function. Therefore, the total latency of the PWL-based method is much higher than that of our proposed methods.
In basic mathematical operations, the hardware implementation of division and multiplication is the most complicated, which not only consumes a lot of logic resources, but also reduces the working frequency of the whole architecture. Compared with our proposed architecture to implement S&T functions with adjustable precision and extensible input range in this paper, the PWL-based method is worse and incomparable.

Comparison with Other Related Methods
In addition to the above mainstream methods, there are also some other methods to compute tanh and sigmoid, such as stochastic computing [25] or a mixture of stochastic logic and PWL [11].
In [25], we learn that the S&T functions can be implemented in stochastic computing according to the Horner's rule for Maclaurin expansion. It is shown that, if the coefficients are alternately positive and negative and their magnitudes are monotonically decreasing, a polynomial can be implemented using multiple levels of NAND gates based on Horner's rule. Truncated Maclaurin series expansions of arithmetic functions are used to generate polynomials which satisfy these constraints. In contrast to the PWL-based method, it does not require the multipliers, but rather a stochastic number generator to convert digital values to stochastic bit streams. The most obvious disadvantages are that the precision is not high and the input range is limited, so it is difficult to build an architecture with adjustable precision and flexible input range to compute S&T.
The method [11] can improve the accuracy of the calculation results through linear feedback registers, but it needs to produce sufficient and uncorrelated random numbers. Compared with [25], it can also reduce the area. However, it is still hard to build a hardware architecture with adjustable precision and extensible input range compared with our methods.
In order to reveal the strengths and weaknesses of the above methods and our methods, we implement the hardware architectures based on these methods. Synthesized under the TSMC 40 nm COMS technology, their performance indicators are shown in Table 8. In conclusion, our methods have the following characteristics, which are difficult for other methods to be compatible with for all advantages.

•
Efficient hardware: Our RHC-VLC-based method only requires shift-and-add (or subtract) operations, which can avoid the direct use of inefficient multiplication and division. For the constant multiplier used in the CSM-VLC-based method, we also made specific optimization and design to improve the hardware efficiency.

•
Adjustable precision: Both of our methods can easily adjust the accuracy of calculation results by increasing or decreasing the number of CORDIC iterations.

•
Extensible range: According to the application requirements, the method based on RHC-VLC can adjust the negative iterations of RHC freely to expand or narrow the input range. In theory, the method based on CSM-VLC even has no limitation of input range.

•
High speed: The hardware architecture based on our proposed methods can work at the frequency of 1.5 GHz, or even higher, such as 2 GHz.

Conclusions
In this paper, an efficient hardware architecture with adjustable precision and extensible input range is proposed to compute tanh and sigmoid functions for the first time. At first, we introduce two methods in detail-RHC-VLC based and CSM-VLC based. Then, we use MATLAB to conduct accuracy simulation to further find out the relationship between accuracy distribution and CORDIC iterations. Next, at the RTL level, we implement a special case of these two methods and compare them under the TSMC 40 nm COMS technology. Finally, we analyze the costs and drawbacks of other approaches to building an architecture with adjustable precision, and demonstrate that our proposed methods combine the following three advantages: efficient hardware, adjustable precision and high working frequency.
Author Contributions: H.C. conceived and designed the methodology and architecture to implement the sigmoid and tanh functions; H.C. performed the experiments with support from L.J. and H.Y.; H.C. analyzed the experimental results; H.C., L.J., and H.Y. contributed task decomposition and corresponding implementations; H.C. wrote the paper; Z.L. and Z.Y. helped to revise the manuscript; L.L. and Y.F. supervised the project. All authors have read and agreed to the published version of the manuscript.