RNS Number Comparator Based on a Modiﬁed Diagonal Function

: Number comparison has long been recognized as one of the most fundamental non-modular arithmetic operations to be executed in a non-positional Residue Number System (RNS). In this paper, a new technique for designing comparators of RNS numbers represented in an arbitrary moduli set is presented. It is based on a newly introduced modiﬁed diagonal function, whose strictly monotonic properties make it possible to replace the cumbersome operations of ﬁnding the remainder of the division by a large and awkward number with signiﬁcantly simpler computations involving only a power of 2 modulus. Comparators of numbers represented in sample RNSs composed of varying numbers of moduli and o ﬀ ering di ﬀ erent dynamic ranges, designed using various methods, were synthesized for the 65 nm technology. The experimental results suggest that the new circuits enjoy a delay reduction ranging from over 11% to over 75% compared to the fastest circuits designed using existing methods. Moreover, it is achieved using less hardware, the reduction of which reaches over 41%, and is accompanied by signiﬁcantly reduced power-consumption, which in several cases exceeds 100%. Therefore, it seems that the presented method leads to the design of the most e ﬃ cient current hardware comparators of numbers represented using a general RNS moduli set.


Introduction
Parallel data processing is one of the most viable approaches to meet steadily growing needs for high-performance computations. Therefore, algorithms and data representations enjoying parallel structures, which facilitate the processing of a large amount of data efficiently, have been an area of active research for many years. One of the promising directions in this field relies on using the Residue Number System (RNS) to represent integers [1,2]. The RNS is a non-positional number system defined by the set of n (n ≥ 2) pairwise relatively prime positive integers called moduli {m 1 , m 2 , · · · , m n }.
Its dynamic range is equal to the product M = n i=1 m i , which allows it to represent all a-bit numbers, where a = log 2 M . Any non-negative integer X such that 0 ≤ X < M can be uniquely represented in RNS as X = {x 1 , x 2 , · · · , x n }, where the ith digit of X in RNS x i = |X| m i is the remainder of the integer division of X by the modulus m i , represented in a i = log 2 m i bits. the remainder of a division. The comparison algorithm based on the new CRT-II, suggested in [17], makes it possible to reduce the maximum size of the modulo addition from M to approximately √ M, where M is the dynamic range. The method of [22] relies on the approximate calculation of positional numbers according to CRT, whereas that of [24] makes it possible to compare signed numbers, but it also requires sign detection for each compared number. Finally, some comparators have been proposed for RNSs using special bases, e.g., the 4-moduli set composed of two pairs of conjugate moduli 2 n − 1, 2 n + 1, 2 n+1 − 1, 2 n+1 + 1 , as well as the 3-moduli sets { 2 n − 1, 2 n , 2 n + 1 } [20] and 2 n − 1, 2 n+x , 2 n + 1 [25].
In summary, the drawbacks of the previous magnitude comparison algorithms are: the need for using a redundant modulus, restricting the moduli set or time-consuming modulo operations involving large numbers (the size of the dynamic range or close). Here we will show how to extend the idea of diagonal function so that a high-speed and efficient comparator in RNS can be implemented in hardware. The new approach proposed here relies on integrating techniques from [16,26], and it is based on modifying the diagonal function of the numbers represented in RNS. The major advantages of this method are that, in addition to not requiring the computation of a remainder of division, it also leads to computations involving numbers of smaller sizes than in [26].
This paper is organized as follows. Section 2 presents the method of comparison using the SQT based on the diagonal function. Section 3 thoroughly details the theoretical background of the modified diagonal function proposed here, leading to significantly improved performance of the comparator. Performance estimations and comparison against existing circuits are provided in Section 4. Finally, some conclusions and suggestions for future research are given in Section 5.

Number Comparison Using the Sum of Quotients Technique (SQT)
In this section, we will present all key ideas related to the SQT method of [16], which will facilitate understanding of our method relying on a modification of the SQT method, which will be presented in Section 3. The main idea of the SQT method relies on the observation that in the finite n-dimensional space determined by the number of moduli n, the integers are ordered along straight lines, which are parallel to the main diagonal of the space. In MRC, each line represents the most significant digit of the number. However, these diagonals can be renumbered in a natural order of integers. In this case, the comparison of two numbers can be done by considering the numbers of the diagonals to which they belong. For fast determination of the diagonal to which a number belongs, a monotonically increasing function called the Sum of Quotients (SQ) was defined: where M i = M/m i . Let us define the following constant: where h i = 1/m i | SQ is the multiplicative inverse of m i mod SQ (1 < h i < SQ), i.e., such an integer that |h i · m i | SQ = 1. (Recall that a multiplicative inverse exists provided that m i and SQ are co-prime, which is indeed the case here.) It was shown that for the set of constants k i the following congruence holds: These notions are essential to defining the diagonal function as Electronics 2020, 9, 1784 4 of 16 which was shown to be monotonically increasing over a set of integers 0 ≤ X < M. This method is called the Sum of Quotients Technique (SQT), because the following important equality holds: The comparison of RNS numbers using SQT is summarized in the following algorithm, whose hardware implementation is shown in Figure 1. Input: X = {x 1 , x 2 , . . . , x n }, Y = y 1 , y 2 , . . . , y n Output: "100" if X < Y, "010" if X = Y, and "001" if X > Y.
Step 1. Calculate D(X) and D(Y). Note: These computations are independent and therefore they can be executed in parallel, provided that two circuits implementing the diagonal function are available.
Step 2. Compare the values of D(X) and D(Y): The comparison of RNS numbers using SQT is summarized in the following algorithm, whose hardware implementation is shown in Figure 1. Output: "100" if < , "010" if = , and "001" if > .
Note: These computations are independent and therefore they can be executed in parallel, provided that two circuits implementing the diagonal function are available.
Step 2. Compare the values of ( ) and ( ): The main disadvantage of Algorithm 1 is that the computation of the remainder of the division over the modulus , executed by the -operand multi-operand modular adder (MOMA) mod , is both hardware and time-consuming. (It will be seen later that for sample moduli sets the difference between the bit sizes of and could be from 3 to 5-bits.) In [16], it was suggested that in the case of the equality ( ) = ( ) (the diagonal function is not strictly monotonic), an extra comparison must be executed. However, in [23], it was shown that this additional comparison can actually be done in parallel, so that the only delay penalty is two gate levels (this observation was taken into account in Figure 1). In the following section, we will show how to modify SQT to replace the MOMA mod with a significantly faster and simpler circuit modulo with a power of 2. The main disadvantage of Algorithm 1 is that the computation of the remainder of the division over the modulus SQ, executed by the n-operand multi-operand modular adder (MOMA) mod SQ, is both hardware and time-consuming. (It will be seen later that for sample moduli sets the difference between the bit sizes of M and SQ could be from 3 to 5-bits.) In [16], it was suggested that in the case of the equality D(X) = D(Y) (the diagonal function is not strictly monotonic), an extra comparison must be executed. However, in [23], it was shown that this additional comparison can actually be done in parallel, so that the only delay penalty is two gate levels (this observation was taken into account in Electronics 2020, 9, 1784 5 of 16 Figure 1). In the following section, we will show how to modify SQT to replace the MOMA mod SQ with a significantly faster and simpler circuit modulo with a power of 2.

Comparison Using the Modified Diagonal Function
Here, we will describe the new method for comparison of RNS numbers based on introducing the modified diagonal function (MDF). It is based on the observation that if all constants k i are divided by SQ, i.e., similarly as was done for the CRT-based sign detector proposed in [26], then it is possible to move the computations from the residue class [0, SQ − 1) to the computations in the interval [0, 1), so that computations involving integer parts of real numbers are not really needed. In other words, the operation of finding the remainder of the division by SQ is replaced with the more efficient operation of discarding an integer part of a number. However, the major concern with such an approach is its accuracy, because in most cases the fractional numbers cannot be represented exactly using a finite number of bits. Nevertheless, the accurate passing from computations on fraction parts to computations on integers can be done as follows:

1.
Multiply each real constant by 2 N , where N is the number of bits of the fraction part, which guarantees sufficient accuracy.

2.
For each real number, say Z, calculate [ Z ], i.e., the smallest integer not less than Z.

3.
Execute all computations modulo 2 N (it is sufficient to ignore all carries generated from the (N − 1)-th position).
Note: The only limitation for the above conversion could occur when SQ divides 2 N (i.e., SQ is a power of two), because in this case, the method suggested reduces to the original one. Nevertheless, because in most cases SQ does not divide 2 N , we will therefore henceforth consider only this case. To determine the smallest N which guarantees sufficient accuracy, we proceed as follows.
First, notice that the constants can be recalculated as where Now we can define the following positional characteristic of a number, which will be called the modified diagonal function (MDF): Theorem 1. Let m n be the largest modulus of the moduli set. If N ≥ log 2 (SQ · (m n − 1)) then the MDF D(X) is strictly increasing for any 0 ≤ X < M, i.e., for any 0 ≤ Proof. First, we find the value of D(X − 1) for any 0 < X < M. Because for any therefore, according to Equality (5), we have Electronics 2020, 9, 1784 where |Z| 1 denotes discarding of an integer part of Z. Obviously, because D(X) ≥ D(X − 1), D(X − 1) can be determined using Equality (10): Now we will determine the properties of the functions D(X) and D(X − 1). By applying the notation introduced earlier, we obtain which leads to Now according to Equality (11) and by taking into account that in RNS Because for any 0 ≤ i ≤ n which hence becomes From the formulas derived above, it is obvious that D(X) − D(X − 1) is the additional term of the expression equal to Electronics 2020, 9, 1784 7 of 16 By considering that we obtain For the function D(X) to be strictly increasing, it is necessary to satisfy the two following conditions.
inequality makes it possible to pass from computation of the remainder of the division to the computation mod 2 N for D(X) in Equality (13), and hence in Equality (16) as well.
If this inequality holds and both Condition 1 and Inequality (19) are satisfied, it implies that the value of D(X) calculated by Equality (13) is larger than the value of D(X − 1) calculated by Equality (16).
Whether any of these two conditions is satisfied, it depends on N. Now we will show how to determine the smallest N, for which both Conditions 1 and 2 hold. As the function D(X) is monotonic, hence D(X) ≤ D(X − 1). Thus, according to Equality (2) Additionally, because 0 < R i < 1 and max Because Condition 1 leads to the inequality which in turn leads to 2 N > SQ · (m n − 1), and finally to From Condition 1, and assuming that Inequality (24) holds, we have Now consider the inequality Electronics 2020, 9, 1784 holds for 1 ≤ i ≤ n, then Inequality (26) is true for X with any number of zeros in its RNS representation. Therefore, we can estimate N from Inequality (27). Recalling that 0 ≤ R i < 1, 1 ≤ R < n, and max 1≤i≤n m i = m n , therefore which implies that Inequality (27) holds if To estimate N, we calculate the logarithm of the last inequality which is identical to Inequality (24). Therefore, if Inequality (24) holds, then both Conditions 1 and 2 are satisfied, which concludes the proof.
Inequality (24) can be considered as the condition that guarantees strict monotonicity of the MDF D(X): if it holds, to compare two RNS numbers X and Y, it suffices to compare the values of D(X) and D(Y). The above considerations can be formally summarized as the following algorithm.
Step 1. Calculate D(X) and D(Y). Note: These computations are independent and therefore they can be executed in parallel, provided that two circuits implementing the MDF are available.
Step 2. Compare the values of D(X) and D(Y). In summary, the diagonal function D(X) of [16,18] is monotonic for 0 ≤ X < M, whereas the MDF D(X) proposed here is strictly monotonic over this set (which makes it possible to compare numbers directly).
The differences between D(X) and D(X) are illustrated for a sample 3-moduli RNS {5, 11, 17}. Figure 2 shows the diagrams of the values of the functions D(X) and D(X) for the first 15 values of X, which clearly reflect their monotonic properties and demonstrate their differences. Recall that according to Theorem 1, the function ( ) is strictly monotonic for any 0 ≤ < . Indeed, it is seen that: (i) ( ) < ( ) implies < and (ii) ( ) > ( ) implies > .  Figure 3 shows the general scheme of the new comparator that implements Algorithm 2 (here = min {⌈log ( ⋅ )⌉, }). Obviously, it is necessary to add more extra hardware than its simple counterpart using any reverse converter followed by the -bit comparator, because only the latter small circuit must be added to the RNS-based processor. In our circuit, both the circuit computing ( ) and the -bit comparator must be added. Nevertheless, the new comparator has two potential major advantages over the latter: (i) higher speed, because of the delay of the -operand MOMA mod 2 is certainly significantly smaller than of the n-operand MOMA mod [30,31] in the CRT-based version of the reverse converter or of its MRC-based version (a slightly larger size of the operands handled by the final -bit comparator ( = + ⌈log ⌉ vs. ) has little impact on the area or delay of the final -bit comparator); and (ii) lower power consumption, because of the significantly smaller circuitry involved in performing the comparison. Either claim will be confirmed by performance estimations obtained for ASIC implementations of various basic general RNS number comparators, presented in the next section. for which we have D(X) = |1 · 6300 + 5 · 12, 014 + 15 · 14, 456| 2 14 = 4682 D(Y) = |2 · 6300 + 6 · 12, 014 + 16 · 14, 456| 2 14 = 4684 D(Z) = |3 · 6300 + 7 · 12, 014 + 0 · 14, 456| 2 14 = 4694 Recall that according to Theorem 1, the function D(X) is strictly monotonic for any 0 ≤ X < M. Indeed, it is seen that: Figure 3 shows the general scheme of the new comparator that implements Algorithm 2 (here b i = min log 2 (m i · k i ) , N ). Obviously, it is necessary to add more extra hardware than its simple counterpart using any reverse converter followed by the a-bit comparator, because only the latter small circuit must be added to the RNS-based processor. In our circuit, both the circuit computing D(X) and the N-bit comparator must be added. Nevertheless, the new comparator has two potential major advantages over the latter: (i) higher speed, because of the delay of the n-operand MOMA mod 2 N is certainly significantly smaller than of the n-operand MOMA mod M [30,31] in the CRT-based version of the reverse converter or of its MRC-based version (a slightly larger size of the operands handled by the final N-bit comparator (N = a + log 2 n vs. a) has little impact on the area or delay of the final N-bit comparator); and (ii) lower power consumption, because of the significantly smaller circuitry involved in performing the comparison. Either claim will be confirmed by performance estimations obtained for ASIC implementations of various basic general RNS number comparators, presented in the next section. Electronics 2020, 9, x FOR PEER REVIEW 10 of 17

Performance Estimations
In this section, we first present an approximate evaluation of the performance of hardware implementations of the new general RNS comparators and their three best-known counterparts, and then we provide more accurate estimations for ASIC implementations of all circuits considered.
Suppose that all n RNS moduli are l-bit numbers. Then, the basic parameters of operands involved in modulo operations handled by four general RNS number comparators can be summarized as listed in Table 1. First, recall that all these circuits have a number of steps growing logarithmically in the function of the number of moduli , i.e., they all have (log ) delay. Second, notice that the following inequalities hold: √ < < < 2 and √ < < < . The three circuits based on the CRT, SQT, and MDF have a similar structure, composed of the n-operand modulo adder, with the major difference made by the modulus. Because neither nor is a power of 2 and > , then it seems that the SQT-based circuit should involve less hardware and be faster than the CRT-based one. On the other hand, because the MDF-based circuit proposed here uses the n-operand adder modulo a power of 2 (2 ), it enjoys the major advantage of all arithmetic circuits mod 2 : significantly smaller delay and exceptional hardware efficiency compared to all its counterparts modulo any odd modulus involving cumbersome and lengthy operations of finding the remainder of the division by a large and awkward number or . The simplicity and the speed gained by the latter outweighs the minor delay/area differences due to a slightly larger final comparator of -bit vs. a-bit and -bit numbers. As for the comparator based on the CRT-II of [17] it executes ⌈log ⌉ iterative steps on operands of growing size and involving computations modulo a size growing up to about √ . On one hand, √ is not only the smallest of the moduli involved in computations by all comparators considered, but it is also involved only in the final stage of iterative computations, which suggests that it would result in some advantages. On the other hand, the estimation of delay/area performance of this circuit is difficult, because each iterative step involves modulo computations which, despite being executed on relatively small size moduli are nevertheless time-consuming and executed serially.

Performance Estimations
In this section, we first present an approximate evaluation of the performance of hardware implementations of the new general RNS comparators and their three best-known counterparts, and then we provide more accurate estimations for ASIC implementations of all circuits considered.
Suppose that all n RNS moduli are l-bit numbers. Then, the basic parameters of operands involved in modulo operations handled by four general RNS number comparators can be summarized as listed in Table 1. First, recall that all these circuits have a number of steps growing logarithmically in the function of the number of moduli n, i.e., they all have O(log n) delay. Second, notice that the following inequalities hold: The three circuits based on the CRT, SQT, and MDF have a similar structure, composed of the n-operand modulo adder, with the major difference made by the modulus. Because neither M nor SQ is a power of 2 and a > a SQ , then it seems that the SQT-based circuit should involve less hardware and be faster than the CRT-based one. On the other hand, because the MDF-based circuit proposed here uses the n-operand adder modulo a power of 2 (2 N ), it enjoys the major advantage of all arithmetic circuits mod 2 N : significantly smaller delay and exceptional hardware efficiency compared to all its counterparts modulo any odd modulus involving cumbersome and lengthy operations of finding the remainder of the division by a large and awkward number M or SQ. The simplicity and the speed gained by the latter outweighs the minor delay/area differences due to a slightly larger final comparator of N-bit vs. a-bit and a SQ -bit numbers. As for the comparator based on the CRT-II of [17] it executes log 2 n iterative steps on operands of growing size and involving computations modulo a size growing up to about √ M. On one hand, √ M is not only the smallest of the moduli involved in computations by all comparators considered, but it is also involved only in the final stage of iterative computations, which suggests that it would result in some advantages. On the other hand, the estimation of delay/area performance of this circuit is difficult, because each iterative step involves modulo computations which, despite being executed on relatively small size moduli are nevertheless time-consuming and executed serially.

Method Operands Modulus Size [bits] Number
CRT a ≤ n · l n M SQT a SQ ≤ (n − 1) · l + log 2 n n SQ CRT-the direct method based on CRT; SQT-the method by Dimauro et al. [16]; CRT II-the method by Wang et al. [17]; MDF-the new method proposed here; † only the largest sizes of operands and moduli are indicated.
To obtain more accurate complexity estimations, we synthesized all four comparators described above for various RNS moduli sets, which are grouped into two classes, listed in Table 2. Class 1 consists of 4-moduli sets; each set composed of moduli of the same size p. Varying the size of moduli p ∈ {5, 7, 9, 11, 13} makes it possible to observe comparators' performance in the function of the dynamic range M, which grows only with the size of the moduli but not with their number (which remains constant). Class 2 consists of moduli of the same size (we chose p = 7 bits), whose number n varies from 3 to 8, allowing to observe comparators' performance in the function of the number of moduli. All sets of selected moduli consist of the largest existing pairwise prime moduli for a given n. The circuits were described in parametrized structural VHDL following identical coding guidelines and synthesized following the similar layout of module hierarchy and primitive components like adders. The additions and multiplications were implemented with register-transfer level (RTL) operators and selection of their architectures was left to be done by the synthesis tool. We performed logic synthesis of the comparators for a range of target moduli sets using Cadence RTL Compiler v. 8.1 and an industrial 65 nm low-power library (STM CMOS065LP). For each design and moduli set, the minimum delay was found, which we assumed to be the smallest delay target when the synthesis was still able to achieve a non-negative timing slack. The cell area and total power (including dynamic and leakage components) reported by the synthesis tool were given an area and power figures.
The complexity characteristics obtained are detailed in Tables 3-8 and visualized in Figures 4  and 5. It can be seen that the delay of the new comparator proposed here grows equally slowly (almost linearly) while increasing the dynamic range DR or the number of moduli n. It seems that it results directly from the possibility of replacing cumbersome operations of finding the remainder of the division by a large and awkward number with significantly simpler multi-operand additions mod 2 N . The synthesis results suggest that the new comparator proposed here is faster than all known similar circuits for all sample moduli sets considered, with delay reduction ranging from over 11% to over 75% compared to the fastest circuit designed using existing methods. Only the basic CRT-based implementation introduces delay slightly larger but only in a few cases. The largest delay comes with the introduction of the comparator based on the CRT-II of [17].

Conclusions
This paper proposes a new general approach to the comparison of the numbers represented in Residue Number System (RNS). It is based on a newly introduced concept of the modified diagonal function, which serves as a theoretical basis to develop a significantly faster and more efficient comparison algorithm. It made it possible to introduce a new positional characteristic of an RNS number which is strictly monotonic so that it makes it possible to precisely reflect a relative positioning of numbers. Now, unlike in existing algorithms, computations involving cumbersome operations of finding the remainder of the division by a large and awkward number are replaced

Conclusions
This paper proposes a new general approach to the comparison of the numbers represented in Residue Number System (RNS). It is based on a newly introduced concept of the modified diagonal function, which serves as a theoretical basis to develop a significantly faster and more efficient comparison algorithm. It made it possible to introduce a new positional characteristic of an RNS number which is strictly monotonic so that it makes it possible to precisely reflect a relative positioning of numbers. Now, unlike in existing algorithms, computations involving cumbersome operations of finding the remainder of the division by a large and awkward number are replaced with significantly simpler computations involving only a power of 2 modulus. The newly proposed Moreover, the speed advantage of the new comparators was achieved using less hardware resources, with only two exceptions. For large dynamic ranges, hardware reduction is significant, as it can exceed 40% compared to the least complex existing designs. For all cases considered, the SQT-based method of [25] consumes more hardware resources than any other method. Hardware complexity of the basic CRT-based comparator deserves some special comments, because most of it is the reverse converter, which is used anyway as a stand-alone circuit. Therefore, it should not be considered a contributor to the overall hardware complexity.
Finally, power-consumption seems the major advantage of the new comparators, as its reduction ranges from over 50% to over 178% for Class 1 moduli sets and from over 10% to over 130% for Class 2 moduli sets. Moreover, it was achieved using circuits which are faster for all cases considered. In this context, using specifically designed comparators instead of the CRT-based comparators (which actually require including the least amount of extra hardware: only the final comparator of a-bit numbers), could be of some practical interest. This is because once the reverse converter is activated just for the purpose of comparing numbers, it could be extremely power-consuming, as can be seen from the data listed in Tables 5 and 8, as well as shown in Figures 4c and 5c.

Conclusions
This paper proposes a new general approach to the comparison of the numbers represented in Residue Number System (RNS). It is based on a newly introduced concept of the modified diagonal function, which serves as a theoretical basis to develop a significantly faster and more efficient comparison algorithm. It made it possible to introduce a new positional characteristic of an RNS number which is strictly monotonic so that it makes it possible to precisely reflect a relative positioning of numbers. Now, unlike in existing algorithms, computations involving cumbersome operations of finding the remainder of the division by a large and awkward number are replaced with significantly simpler computations involving only a power of 2 modulus. The newly proposed comparator and its most efficient known counterparts applicable for arbitrary RNS moduli sets, designed using various methods for several sample moduli sets, were synthesized for the 65 nm technology. Performance estimations obtained suggest that the new circuits enjoy delay reduction ranging from over 11% to over 75%, compared to the fastest circuits designed using existing methods. Moreover, it is achieved using less hardware, the reduction can even reach over 41%, and accompanied by significantly reduced power-consumption which in several cases exceeds 100%. Therefore, it seems that the presented method leads to the design of what is currently the most efficient hardware comparators of numbers represented using a general RNS moduli set. The magnitude comparison of RNS numbers, besides being used directly (like in some implementations of recent cryptographic algorithms using RNS), is also essential for the implementation of other RNS non-modular operations like division, sign detection, and overflow detection. Future research will include extensions of the approach proposed to handle other difficult non-modular RNS operations like sign and overflow detection. Funding: The reported study was funded by RFBR, project number 20-37-70023 and project NCFU.

Conflicts of Interest:
The authors declare no conflict of interest. the number of bits to represent SQ D(X) the diagonal function D(X) the modified diagonal function h i = 1/m i | SQ the multiplicative inverse of m i mod SQ {m 1 , m 2 , · · · , m n } RNS moduli set n the number of moduli N the number of bits of the fraction part x i the residue modulo (mod) m i {x 1 , x 2 , · · · , x n } RNS representation of an integer X