A Low-Cost High Radix Floating-Point Square-Root Circuit

: In this paper, we propose an efficient architecture of floating-point square-root circuit with low area cost, which is in accordance with the IEEE-754 standard. We extend the principle of the standard SRT algorithm so that the latency and area cost of the proposed circuit are linear with the radix. In addition, no extra computation cycles are required. With 65 nm technology, the area cost of the single-precision floating-point square-root circuit based on proposed architecture is only 6450.84 µ m 2 , and the dynamic power consumption is only 0.764 mW at 300 MHz. The implementation results show that the proposed square-root circuit can reduce the area cost by 60%~90% compared with other designs in the literature.


Introduction
Although square-root operation is not commonly used compared with other arithmetic operation, many instruction set architectures (ISA) include square-root instruction, such as ARM, x86, or RISC-V ISAs. Compared with addition or multiplication units, squareroot circuit usually has higher complexity and longer latency. The common algorithms for square-root operation are SRT, Goldschmidt, Taylor-series, or Newton-Raphson algorithm [1][2][3][4], which can be divided into two categories: multiplication-based approximation algorithms and digital recursive algorithms. It is a challenge to implement an efficient floating-point square-root operation on hardware, which needs to balance the computing performance, area cost, and power consumption etc.
The multiplication-based square-root algorithms (e.g., Newton-Raphson algorithm) are usually approximated by inverse operation. The results of these algorithms are not obtained digit-by-digit. Instead, the calculation accuracies are improved step-by-step through the multiplication and addition operations to get the final results. The convergence rate of these algorithms is quadratic [5], and the algorithms usually have higher computational efficiency. In order to support the iterative calculations, we need more independent multiplier and adder hardware resources. Hence, the timing performance of the implemented circuit is limited by the latency of multiplier. In addition, the mainstream processors usually adopt the IEEE-754 standard, which makes the rounding operation of multiplication-based algorithms more complicated and more difficult to obtain the remainder.
Compared with the multiplication-based algorithms, the digital recursive algorithms need more iteration cycles, and the convergence is linear [6]. In each iteration cycle, the partial square-root digits with fixed bit-width can be obtained. At present, the most widely used digital recursive algorithm is SRT, in Intel or IBM [7,8] processor cores, the SRT algorithm with lower radix is used to implemented square-root circuit. In the standard SRT algorithm, although the higher radix can improve the computational performance, the area cost of the lookup table increases in quadratic with the radix [9]. For instance, Synopsys Design-Ware can provide a single-precision square-root circuit based on SRT-16, and the area cost is about 29 K equivalent gates. In [2], the square-root circuit is implemented based

SRT Algorithm Analysis
According to IEEE-754 standard, any single-precision floating-point number X = Y × 2 e , where Y ∈ [1, 2) is the 23-bit mantissa code, and e ∈ [−128, 127] is exponent code of 8-bit. The square-root operation result of X is , where e/2 can be realized by 1-bit right shift operation. If e is an odd number, it is necessary to shift Y by 1-bit to obtain Y * for mantissa square-root operation. At this time, the mantissa Y * ∈ [2,4) and the result is less than 2, it also complies with IEEE-754 standard. The square-root operation of X is converted into the square-root calculation of Y, and Y can be expressed further as (1), where S is the square-root digits and P is the remainder after the finite precision square-root operation.
In the standard SRT algorithm with radix-r, (1) can be calculated in an iterative manner by shift and subtraction operation. In each iteration cycle, the L = log 2 r bit-width partial square-root digits can be achieved, after n iteration times, the square-root result S can be expressed as (2), and the remainder P can be expressed as (3), where w i is the partial square-root digits generated in the i-th iteration with L bit-width.
Combining (2) and (3), the iterative Formula (4) of partial remainder P n+1 can be obtained, in the standard SRT algorithm, w n+1 = select(S n , P n ) is lookup table function. Generally, it is necessary to construct with the P-D graph [11].
Formula (4) is the basic iterative process of standard SRT algorithm, in which w n+1 = select(S n , P n ) is usually implemented in ROM. It can be seen from (4) that the radix-r is proportional to the performance of the algorithm. With the increase of r, the bit-width of the partial square-roots digits obtained increases in each iteration, and the cycles of iteration required decrease. The calculation accuracy of SRT algorithm is 1 ULP (unit in last place). The latency of data path in square-root circuit based on standard SRT algorithm increases linearly with r, while the area cost of the lookup table increases quadratically with r [5].
The area cost of the lookup table increases about four times with the increase of one bit-width of the partial square-root digits [11,12]. Table 1 shows the area cost of lookup table (implemented by ROM) in standard SRT algorithm with different radices. The area cost, as given in Table 1, adopts 65 nm technology, and under the same technology, the area of a NAND2 cell is 1.8 µm × 1.4 µm. It can be seen that with the increase of radix, the area cost of lookup table increases greatly, which limits the application of high radix standard SRT algorithm.

The Proposed Square-Root Algorithm
In order to solve the problem of the large area cost of lookup table in standard high radix SRT algorithm, we adopt the cascade non-recovery remainder division with a short bit-width to replace the lookup table which is the standard SRT algorithm.
In standard SRT algorithm with radix-r, the partial square-root digits of L = log 2 r bits are achieved in each iteration. The proposed partial square-root digits estimation algorithm can be expressed as (5) and (6), all parameters are expressed in binary, where p * is the highest 2L digits of the partial remainder generated in the previous iteration cycle, p * 2L−i−1 is the 2L − i − 1-th digit of p * , u * 0 is the highest L digits of P n , s * is the highest L digits of S n , and w * n+1 represents the estimated value of partial square-root digits with L bit-width.
In (5), the L-bit partial square-root digits can be obtained by the cascaded non-recoverable division. In addition, in (6) only the addition or subtraction operation with L-bit is needed, compared with standard SRT algorithm, only the full-adder with 2L bit-width is needed.
However, it should be pointed out that errors may occur due to the lack of fullprecision operands in (5) and (6). Therefore, it is necessary to extend the iterative process of the standard SRT algorithm and correct the errors in time to avoid the errors propagation in the iterative process. ∆w = w n − w * n is the errors between the estimated value and the true value of partial square-root digits in the proposed algorithm, the true value of the partial remainder is shown in (7), the estimated value is shown in (8), and the errors of the partial remainder ∆P can be expressed by (9): P n = rP n−1 − w n × 2S n−1 + w n r −n Electronics 2021, 10, 1988 4 of 13 ∆P = P n − P * n = w * n (2S n−1 + w * n r −n ) − w n (2S n−1 + w n r −n ) = w * n 2S n−1 − w n 2S n−1 + w * 2 n r −n − w 2 n r −n = 2S n−1 (w * n − w n ) + (w * n − w n )(w * n + w n )r −n = (w * n − w n )(2S n−1 + w * n r −n + w n r −n ) = −∆w(2S n−1 + w * n r −n + w n r −n ) (9) Substituting (9) into the basic recursive Formula (4) of the standard SRT algorithm, the relationship between the estimated value P * n and the real value P n+1 generated in the next iteration is shown in (10): Considering the general iteration process of digital recursion algorithm to analyze the error conditions of the w * n . Assuming that m represents the full-precision bit-width of P and S, the highest L digits of P and S can be represented as d p = Therefore, the real value of the P and S can be represented as P = d p + ∆p and S = d s + ∆s. According to (6), only the highest L digits of the operands are used for calculation in each estimation cycle, when d p > d s or d p < d s , Equation (5) can obtain the real value of the partial square-root digits by the highest L digits of the two operands, while the remaining digits ∆p and ∆s do not affect the results. When d p = d s , the result of the true value depends on the digits of the remaining digits. If ∆p ≥ ∆s, then the estimated result of (5) and the real result are both "1", and there is no error in the estimated result. When ∆p < ∆s, the estimated result of (5) is "1", but the real result of square-root digit is "0". In this case, because the digits of residual value are not included in the estimation process of (6), the calculation error is generated, and in the worst case, when i = 0, the generated error in (6) accumulates in the calculation of the next stage non-recovery remainder division and the maximum error accumulation is caused.
In order to achieve the results in accordance with IEEE-754 standard, it is necessary to analyze the maximum errors of the estimated partial square-root digits quantitatively. Assuming that in the worst case, P and S satisfy the following conditions: d p = d s = d, and ∆p < ∆s, we can achieve d > 2 L−1 ∆s > 2 L−1 ∆p, and P < S. In the calculation process of full-precision, the partial square-root digits w with L bit-width can be expressed as (11): In (11), U i is the partial remainder generated in the iterative calculation process, and U i can be expressed by (12), where U 0 = P.
Substituting P, S, and d into (12), after L times of iterative calculation, the partial remainder corresponding to the square-root digits of the real value is obtained: Electronics 2021, 10, 1988 5 of 13 In the proposed algorithm the estimated partial square-root digits w * can be expressed as (13), where S * = d and U * i is the estimated partial remainder with L bit-width, and the calculation process of U * i can be expressed as (14), where U * 0 = P − S * = ∆p.
Substituting P, S * and d into (14), after L times of iteration, we can get that the remainder with maximum error corresponding to the estimated partial square-root digits We can get the errors between the real value remainder U L−1 and the estimated remainder U * L−1 can be expressed as: From the constraints ∆s > 0 and d > 2 L−1 ∆s, we can get the following conclusions: Through the above quantitative analysis, we can get the relationship between the estimated value and the real value of the partial square-root digits with L bit-width, which can be expressed as (15): In (15), the error of the estimated partial square-root digits can be corrected by a L-bit subtracter, and when the error occurs, ∆w = −1, the errors of estimated partial remainder can be obtained, and the correction process is shown in (16): According to the constraint condition w n ∈ [0, r) of SRT algorithm, in the correction process described in (16), r − w n+1 can be realized by a simple subtracter with L bit-width, 2S n + (r + w n+1 )r −n−1 can be realized directly by bit splicing operation. Therefore, compared with the standard SRT square-root algorithm in (4), Only one subtracter with L bit-width is added in (16). The proposed square-root algorithm can be summarized as (17)-(19): Compared with the standard SRT algorithm, the proposed square-root algorithm avoids the use of lookup table, and has a general expression, which can be extended to the design of square-root circuit with any radix.

Proposed Square-Root Architecture
According to Equations (17)-(19), a single-precision floating-point square-root circuit with radix-16 is designed. The structure of mantissa iteration is shown in Figure 1. The structure is similar to the square-root circuit base on standard SRT algorithm. In Figure 1, a partial square-root digits estimation circuit is used to replace the lookup table in the standard SRT algorithm. The partial square-root digits correction circuit and the k n , H n correction circuit corresponding to (17) and (18) are added. The necessary adders and multipliers in the standard SRT algorithm are also included. standard SRT algorithm. The partial square-root n H correction circuit corresponding to (17) and and multipliers in the standard SRT algorithm are  The mantissa iterative circuit shown in Figure 1 can generate 4-bit partial square-root digits in each iteration cycle. In order to support the 4 rounding modes specified in IEEE-754 standard, it will takes 7 cycles to perform a single-precision square-root operation to obtain the complete mantissa, while the rounding operation requires an additional cycle to be calculated separately.
Combining (17) and taking into account the rounding mode specified in IEEE-754 standard, the one input of multiplier in Figure 1 is 4-bit, while the other bit-width of the input needs 33 to ensure the correctness of the result, and the bit-width of the adder also needs 33 to complete the calculation result in the last iteration cycle.
According to (13) and (14), the proposed partial square-root digits estimation circuit is shown in Figure 2. Where U is the highest 8-bit of the partial remainder, and S * is the highest 4-bit of the current square-root result. In each iteration cycle, the circuit can achieve 4-bit partial square-root digits.
input needs 33 to ensure the correctness of the result, and the bit-width of the adder also needs 33 to complete the calculation result in the last iteration cycle.
According to (13) and (14), the proposed partial square-root digits estimation circuit is shown in Figure 2. Where U is the highest 8-bit of the partial remainder, and * S is the highest 4-bit of the current square-root result. In each iteration cycle, the circuit can achieve 4-bit partial square-root digits. It can be seen from (14): the estimation process of partial can be composed of 4-stage cascaded full-adders, but even if the carry look-ahead adder is used, it still needs 4-stage full-adder latency to get 4-bit partial square-root digits. In order to obtain a better timing performance, the structure of estimation circuit is improved in this paper. First, for the estimation process of each square-root digit in (14), the composite adder is used instead of the full-adder, so that the addition and subtraction are carried out independently, and the result of the addition/subtraction operation is selected according to the sign of the previous stage. The carry-in delay of full-adder is reduced. In addition, the secondary operation is expanded from (14)  It can be seen from (20) that by decomposing the secondary adder into two independent adders, the execution process of the secondary adder can be carried out simultaneously with the former adder. When the current stage adder obtains the final result, the next result will be obtained only after the latency of the one level 2-input multiplexer. Each cascade structure can reduce the latency of about one level adder, and the advantage of this circuit will be more obvious in higher radix structures. It can be seen from (14): the estimation process of partial can be composed of 4-stage cascaded full-adders, but even if the carry look-ahead adder is used, it still needs 4-stage full-adder latency to get 4-bit partial square-root digits. In order to obtain a better timing performance, the structure of estimation circuit is improved in this paper. First, for the estimation process of each square-root digit in (14), the composite adder is used instead of the full-adder, so that the addition and subtraction are carried out independently, and the result of the addition/subtraction operation is selected according to the sign of the previous stage. The carry-in delay of full-adder is reduced. In addition, the secondary operation is expanded from (14) to (20): It can be seen from (20) that by decomposing the secondary adder into two independent adders, the execution process of the secondary adder can be carried out simultaneously with the former adder. When the current stage adder obtains the final result, the next result will be obtained only after the latency of the one level 2-input multiplexer. Each cascade structure can reduce the latency of about one level adder, and the advantage of this circuit will be more obvious in higher radix structures. Table 2 shows the latency and area cost evaluation results of partial square-root digits estimation circuits with different radices. The evaluation conditions are the worst process angle (voltage in 1.08 V, temperature in 125 • C) with 65 nm technology. From the data given in Table 2, it can be seen that the critical path delay and area cost of the proposed partial square-root digits estimation circuit both almost increase linearly with the radix. Compared with the standard algorithm, the radix-256 partial square-root digits estimation circuit is only 1.8 K gates. In order to achieve accurate results, it is necessary to detect and correct the error of the partial square-root digits and the partial remainder in iteration process. According to (15), the correction circuit of partial square-root digits can be realized by a subtracter with 4-bit. Based on (16), the coefficient k correction circuit output estimated value w * or r − w * according to the sign of partial remainder in the previous iteration, and the result is still 4-bit.
According to (18), it can be seen that in the iterative process, the coefficient H can be realized by bit splicing, which is composed of S n 5-bit after left shift and 4-bit w * . When P n ≥ 0, the 5th digit of H is fixed to "0", while when P n < 0, the 5th digit is fixed to "1". The structure of the coefficient H correction circuit is shown in Figure 3. Since the bit-width of the partial square-root digits is increased by 4 in each cycle, the splicing operation of the coefficient H needs to go through 7 cycles, adding a total of 3 levels latency of 2-input multiplexer. Since the correction operation of the coefficient k and H is carried out in parallel. The latency caused by the correction circuit is about the latency of 1 level 4-bit adder or 3 levels latency of 2-input multiplexer.
According to (18), it can be seen that in the iterati be realized by bit splicing, which is composed of n S When 0 P n ≥ , the 5th digit of H is fixed to "0", wh fixed to "1". The structure of the coefficient H corre  According to (18) and (19), the sign of the partial remainder is generated according to the previous iteration, and the partial remainder correction circuit will output −k × H or k × H. Therefore, in the above two cases, the addition or subtraction operations need to be performed respectively. It means the independent adder and subtracter are implemented. As shown in Figure 3, we use the characteristics of the full-adder to solve the above problem. When the addition operation is performed, the input operands of the adder is k × H, and the carry-in is 0. When the subtraction operation is performed, the input operands is ∼ (k × H) + 1, where ∼ (k × Y) is an inversion operation, which can be implemented by parallel XOR gates. The operation of the additional "1" to the complement code can be used as the carry-in of the full-adder. The structure in Figure 3 can preform the partial remainder correction operation without increasing the adder resources, and the latency only increased by one level XOR gate.

Implement Results and Comparison
In order to get accurate evaluation results, we use Synopsys Design-Compiler to get the synthesis results of the proposed square-root circuit in 65 nm technology. Table 3 shows the synthesis results under the worst process angle (1.08 V, 125 • C), clock frequency is 300 MHz. The calculation period given in Table 3 depends on the bit-width of the operand. In order to support the 4 rounding modes specified in IEEE-754 standard, sufficient squareroot digits must be obtained in the iteration process. For the single-precision floating-point operand, at least 27 bits of square-root digits should be obtained, including 24 bits of standard square-root digits and 3 bits of rounding digits(guard bit, rounding bit and stick bit). Double-precision floating-point operand requires at least 56 bits of square-root digits, including 53 bits of standard square-root digits and 3 bits of rounding digits.
In addition, in order to provide a fair comparison with the results in other reports, the implementation results with different calculation precisions and different radices based on the proposed architecture are given in Table 3. For the area cost of the square-root circuit, Table 3 gives two expressions: the leaf cell count and the cell areas.
Combining the area cost of lookup table shown in Table 1 with the area data shown in Table 3, the advantages of the proposed design in area cost can be illustrated. When the radix is 64, only the lookup table (ROM) will cost 20,220.5 µm 2 , while the overall area cost of the proposed square-root circuit is 9199.08 µm 2 , which is only about 45% of the lookup table in the standard SRT algorithm. When the radix is 256, the area cost of the lookup table is 376,719.8 µm 2 while the area cost of the proposed square-root circuit is 12,017.88 µm 2 , which is only about 3% of the lookup table circuit.
Through the comparison of the area data in Tables 1 and 3, it can be seen that the huge area cost of the lookup table limits the application of the standard SRT algorithm in high radix square-root circuits. Therefore, in the design of high radix square-root circuit, the proposed architecture will have more obvious advantages in area cost. In addition, the bottleneck of high radix SRT square-root circuit is also solved. Figure 4 shows the detailed function waveform of the square-root circuit with radix-16 and 32 bits based on the proposed architecture. The meanings of the signals are summarized as follows: "i_div_a" is the input data; "o_div_r" is the result output; "o_div_hskd" is the valid indication signal of result; "man_sub_o" is the estimated value of partial square-root digits, which corresponds to w * in Equation (13); "man_qds_o" is the correction value of partial square-root digits, which corresponds to w n in Equation (15); "rem_add_a" is the partial remainder generated in the previous iteration, which corresponds to rP n in Equation (19); "rem_add_o" is the partial remainder generated in current iteration, which corresponds to P n+1 in Equation (19); "rem_mul_o" corresponds to H n × k n+1 in Equation (19) and "div_cnt_r" is a counter, which displays the calculation cycle and is used to control the iteration process. responds to n rP in equation (19); "rem_add_o" is the partial remainder generated in current iteration, which corresponds to 1 n P + in equation (19); "rem_mul_o" corresponds to 1 n n k H + × in equation (19) and "div_cnt_r" is a counter, which displays the calculation cycle and is used to control the iteration process. In Figure 4, the decimal floating point number input is 879,632.125, the hexadecimal representation is 0 × 4956_C102, the result of square-root value is 937.887, and the hexadecimal floating point number is 0 × 446A_78C5. As can be seen in Figure 4, after 8 cycles of iterative calculation, the proposed square-root circuit can obtain correct results.
Moreover, when the partial remainder generated in current iteration cycle is negative, it indicates that there is an error in the * w , and w can be corrected in current cycle, and the errors of partial remainder can be corrected in the next iteration through the circuit corresponding to equation (18). The correction positions of the partial remainder and square-root digits are marked in the waveform of Figure 4. Figure 5 shows the comparison of the calculation cycle between the proposed squareroot circuit and other mainstream processor. It must be pointed out that the comparison of calculation performance in Figure 5 is only limited to the calculation cycles, without considering the technology, frequency, area cost, or power consumption in different processors.
As shown in Figure 5, The performance of the square-root circuits based on multiplication operation is slightly higher than that of SRT-4 algorithm, and the performance of circuits is mainly limited by the latency of multiplication and accumulation units. Compared with SRT-4 algorithm, SRT-16 algorithm can get double bit-width of the square-root digits in each cycle, and the computational performance can be greatly improved. However, it can be seen from the algorithm implemented based on SRT in Figure 5 that in the standard SRT algorithm, the area cost of lookup table of higher radix also limits the application in processor design. Even in Intel Penryn processor, the structure of SRT-4 cascade is used to implement SRT-16 algorithm.
Although the square-root circuit based on multiplication can improve the circuit performance, it increases the throughput of square-root unit to 1 cycle by the pipelined structure. However, the mainstream processors in Figure 5 all adopt the iterative structure to reduce the penalty of pipeline clearing caused by missed branch prediction. It also shows that the proposed structure of square-root circuit is more suitable for the low-speed processor design based on RSIC-V instruction architecture set. In addition, in the comparison of computational performance in Figure 5, the proposed structure proposed achieves lesser computational cycles. In Figure 4, the decimal floating point number input is 879,632.125, the hexadecimal representation is 0 × 4956_C102, the result of square-root value is 937.887, and the hexadecimal floating point number is 0 × 446A_78C5. As can be seen in Figure 4, after 8 cycles of iterative calculation, the proposed square-root circuit can obtain correct results.
Moreover, when the partial remainder generated in current iteration cycle is negative, it indicates that there is an error in the w * , and w can be corrected in current cycle, and the errors of partial remainder can be corrected in the next iteration through the circuit corresponding to Equation (18). The correction positions of the partial remainder and square-root digits are marked in the waveform of Figure 4. Figure 5 shows the comparison of the calculation cycle between the proposed squareroot circuit and other mainstream processor. It must be pointed out that the comparison of calculation performance in Figure 5 is only limited to the calculation cycles, without considering the technology, frequency, area cost, or power consumption in different processors.
As shown in Figure 5, The performance of the square-root circuits based on multiplication operation is slightly higher than that of SRT-4 algorithm, and the performance of circuits is mainly limited by the latency of multiplication and accumulation units. Compared with SRT-4 algorithm, SRT-16 algorithm can get double bit-width of the square-root digits in each cycle, and the computational performance can be greatly improved. However, it can be seen from the algorithm implemented based on SRT in Figure 5 that in the standard SRT algorithm, the area cost of lookup table of higher radix also limits the application in processor design. Even in Intel Penryn processor, the structure of SRT-4 cascade is used to implement SRT-16 algorithm.
Although the square-root circuit based on multiplication can improve the circuit performance, it increases the throughput of square-root unit to 1 cycle by the pipelined structure. However, the mainstream processors in Figure 5 all adopt the iterative structure to reduce the penalty of pipeline clearing caused by missed branch prediction. It also shows that the proposed structure of square-root circuit is more suitable for the lowspeed processor design based on RSIC-V instruction architecture set. In addition, in the comparison of computational performance in Figure 5, the proposed structure proposed achieves lesser computational cycles. Table 4 lists the comparison with other square-root circuits based on SRT algorithm, including the comparison of area cost (cell area and leaf cell count), operand precision, and power consumption. Considering the different technologies and frequencies used between different designs, in order to provide more fair comparison, the ratio of power consumption to frequency is provided as a reference for the comparison on power consumption. tion.  Through the comparison of calculation performance in T area cost of this paper is reduced by 37.69% compared with [15] uses the 40-nm technology with smaller size, while this nology, if the shrinkage of technology size is considered, m achieved. In comparison with [13], the number of equivalent even the area cost of the proposed circuit is only 6.27% of [16 [17], the circuit area of this paper is smaller, but the calculation doubled. Through the comparison of calculation performance in Table 4, it can be seen that the area cost of this paper is reduced by 37.69% compared with [15]. It should be noted that [15] uses the 40-nm technology with smaller size, while this paper uses the 65-nm technology, if the shrinkage of technology size is considered, more area reduction will be achieved. In comparison with [13], the number of equivalent gates is reduced by 66.71%, even the area cost of the proposed circuit is only 6.27% of [16]. Compared with reference [17], the circuit area of this paper is smaller, but the calculation performance can be nearly doubled.
Even if the same circuit is implemented in different technology, the power consumption is obviously different. In Table 4, the power consumption of [13,16] implemented in 90 nm technology is about nine times than [14,15]. However, even compared with [14,15], which are implemented in 40 nm technology, the proposed structure also achieves lower dynamic power under the same calculation cycles and precision. Therefore, the proposed square-root structure is also suitable for power sensitive processor design.
Latency in Table 4 represents the time required for the square-root circuit to complete calculation. It can be seen from Table 4 that the performance of the square-root circuit based on the proposed algorithm is only higher than [17] in terms of the maximum frequency and latency of the circuit. However, it should be pointed out that different process parameters (e.g., technology size, voltage, temperature, etc.,) have a significant impact on the maximum frequency of the circuit.
In order to avoid the impact of different process parameters, the combinational logic depth of the critical path in the circuit is generally used to evaluate the performance of the circuit. In Table 4, the maximum logic depth of the square-root circuit with radix-16 and precisions of 32 and 64 are 33 and 41 levels, respectively. However, the data of the maximum logic depth is not given in other references. Therefore, the maximum frequency or performance cannot be directly compared across technology.
However, the maximum frequency and performance of the circuit can be indirectly compared according to the implementation structure of the algorithm. For the square-root circuits given in Table 4, the performance of the circuit is determined by two parameters: the maximum frequency and the calculation cycle. According to the principle of SRT algorithm, the higher the radix r, the lesser number of the iteration cycles required to complete the calculation, and the square-root circuit can achieve higher performance under the same frequency. Both [16,17] adopt the standard SRT algorithm and the lookup table structure. However, from the area comparison data, it can be seen that compared with [17], the radix of [16] is increased by four times, the circuit area cost is increased by 17 times, and the calculation performance is improved by only two cycles. Neither [14] nor [15] adopts the lookup table structure. Instead, the cascade structure of lower radix SRT square-root circuit is adopted to obtain a higher radix. Although the significant increase of circuit area is avoided, when the radix is doubled, the critical path delay of the corresponding circuit will also double.
According to the data in Table 3, when the radix of the square-root circuit based on the proposed algorithm is increased from 16 to 64, the circuit area cost is only increased by about 1.4 times, and the critical path delay is only increased by 1.5 times. Even when the radix is 256, the circuit area increases by only 1.8 times, and the critical path delay increases by only 1.9 times. Therefore, it can be seen from the data in Tables 3 and 4 that although there is a gap in frequency compared with other reports, the proposed square-root structure has better tradeoff between the area cost and frequency, and is more suitable for applications that are sensitive to power consumption and area cost.

Conclusions
In this paper, a novel architecture of floating-point square-root circuit based on SRT algorithm was proposed, in which the computational performance and the area cost are linear with the radix. In the proposed architecture, a partial square-root digits estimation circuit is applied to replace the lookup table in the standard SRT algorithm, which solves the design bottleneck of area cost in high radix SRT algorithm. The recursive process of standard SRT algorithm is extended, the estimation error of partial square-root digits and remainder can be corrected in time, and the error accumulation can be eliminated by using the non-recovery remainder division and full-adder. Compared with the standard SRT algorithm, the proposed algorithm does not need additional calculation cycles. Finally, we designed a floating-point square-root circuit with radix-16 in accordance with the IEEE-754 standard, and deploy it into the FPU of RISC-V processor core. Compared with other designs in the literature, the proposed floating-point square-root circuit can reduce the area cost significantly under the same operand precision and computational performance.