New Residue Number System Scaler for the Three-Moduli Set { 2 n + 1 − 1 , 2 n , 2 n − 1 }

This work proposes the first scaler designed specifically for the three-moduli set M1 = {2n+1 − 1, 2n, 2n − 1}. Hence, there is no other functionally similar scaler to compare the proposed scaler with. However, when compared with the latest published scalers for a different moduli set, M2 = {2n + 1, 2n, 2n − 1}, the proposed scaler has a better area and power performance, while it requires a longer time delay. As demonstrated in earlier publications, replacing the (2n + 1) channel in the M2 moduli set by the (2n+1 − 1) channel, to form the M1 moduli set, considerably improves the overall time performance of residue-based multiply–accumulate arithmetic units.


Introduction
The residue number system (RNS) is a non-weighted number system representation.Numbers are represented using a set of relatively prime positive integers, referred to as moduli [1,2].Specific arithmetic operations, such as addition, subtraction, and multiplication, are carried with respect to each modulus independently from other moduli.This feature allows parallel processing on all channels without having a carry propagating across different channels.Therefore, the RNS is used in applications that depend on the aforementioned operations, such as digital signal processing and cryptography [1][2][3][4][5].However, division is considered a difficult RNS operation [6].
Scaling is an important operation needed whenever the results of computations carried out on each data set exceed specific allowable ranges within a RNS-based processor.The work that has been published so far regarding scaling the RNS deals either with moduli sets of general form or with the traditional set.The main scalers that deal specifically with the traditional moduli set M 2 = {2 n + 1, 2 n , 2 n − 1} are presented in [7][8][9][10][11][12][13]. Those that are most efficient in terms of different metrics are presented in [12,13].
These conclusions have also been supported experimentally in terms of integrated circuit area, time delay, and power consumption for values of n extending from 32 to 64 bits [17].Additionally, in a very recent publication [18], circuit layout experiments on the moduli set {2 n + 1, 2 n , 2 n − 1} over the range of (3 ≤ n ≤ 22) showed that the modulus (2 n + 1) increases the area and latency of a RNS-based arithmetic structure when compared with modulo 2 n and (2 n − 1) structures.A circuit layout of a Multiplier and an Accumulator (a single MAC structure) proved that the (2 n + 1) channel requires on average an 18.9-35.6%increase in area and a 21.9-45.2%increase in delay as compared with the 2 n and (2 n − 1) MACs [18].Therefore, the (2 n + 1) channel leads to a serious latency imbalance across a RNS-based processor that uses the popular three-moduli set.This imbalance increases progressively when designing a multi-MAC RNS-based processor [18].The delay avoided by excluding modulo (2 n + 1) arithmetic components such as adders and multipliers and replacing them with modulo (2 n+1 − 1) components is considerably large, as demonstrated in [14,17,18].Hence, using a (2 n + 1)-free moduli set such as M 1 substantially improves the overall time performance of a RNS-based processor.To the best of the author's knowledge, the scaler presented here is the first proposed in the literature to deal with M 1 .

Decoding Analysis
For a three-moduli set, the Chinese remainder theorem (CRT), which is used to convert the RNS value to its weighted equivalent, is defined by the following [1]: where • M is the dynamic range given by M = m 1 m 2 m 3 ; • X is an integer such that X ∈ [0, M), with the binary value of X represented using (3n + 1) bits; , where i = 1, 2, and 3, R i = |X| m i (the least non-negative remainder when dividing X by m i ).
Substituting the above values into Equation (1) leads to Using the notation .to refer to the floor value of (.), X can be expressed as in [1]: Because X is represented in (3n + 1) binary bits, the value |X| 2 n represents the least-significant n bits of X.Moreover, the value X 2 n , which represents the most-significant (2n + 1) bits of X, is the scaled value of X, where the scaling factor is 2 n .The floor value of X 2 n is considered because the RNS represents only integer values [1].
The corresponding RNS digits of the scaled value X 2 n are given by (R 1s , R 2s , R 3s ), where Equation ( 2) is rewritten as follows [1]: where I is the number of integer multiples of M in the summation of the right-hand side (RHS) of Equation (2).
To evaluate X 2 n , the floor value of dividing Equation ( 4) by 2 n produces X where the fractional part R 2 2 n is dropped in Equation ( 5) when taking the floor value because R 2 < 2 n ; hence, R 2 2 n = 0. Applying modulus m 1 m 3 to Equation (5) produces , then applying modulus m 1 to Equation ( 6) results in Recalling that m 3 = (2 n − 1), the three terms on the RHS of Equation ( 6) can be rewritten as follows: Substituting the last three expressions into Equation ( 6) leads to Rearranging the terms in Equation ( 8) produces Using the identity | m 3 (.)| on the RHS of Equation ( 9) can be rewritten as Substituting the last expression into Equation (9) leads to Recalling the identity |(.)| m 1 m 3 m 3 = | (.) | m 3 [1], then applying modulus m 3 to Equation (10) deletes the term , which is an integer multiple of m 3 .This produces Defining A and v to be where ∧ denotes a logical AND operation, then Equation (10) can be rewritten as This allows the rewriting of Equation ( 14) as Equivalently, Equation ( 15) can be rewritten as Equivalently, Equation ( 17) is rewritten as

Hardware Implementation
The proposed hardware implementation of the CRT-based 2 n scaler of the moduli set {2 n+1 − 1, 2 n , 2 n − 1} is shown in Figure 1.The carry-save adder (CSA) of Figure 1 consists of (n + 1) full adders operating in parallel.The modulo (2 n+1 − 1) adder, modulo (2 n+1 − 1) subtractor, and modulo 2 n subtractor are described thoroughly in [19].The modified 2 n modulo adder is described in the last paragraph of this section.It is important to mention the modulo (2 p − 1) properties, where p is a positive integer [1].The first property states that 2 k a is performed by rotating the binary representation of a to the left k-bits, where a and k are positive integers and a < (2 p − 1).The second property states that |−a| (2 p −1) is performed by obtaining the 1's complement of a.Therefore, the value of |−4R 1 | m 1 , applied to the CSA of Figure 1, is computed by rotating the binary representation of R 1 2 bits to the left and then taking the 1's complement of the rotated value.Assuming the binary representation of R 1 is given by R , where the overline (.) denotes the complement of the bit (.).However, in the m 1 channel of Figure 1, R 1s is obtained by rotating the output of the modulo (2 n+1 − 1) adder 1 bit to the left.
The modified 2 n modulo adder in the m 3 channel is a binary adder that adds R 3 to R 2 .The modified adder is built as follows: The output of the parallel prefix structure of the adder is directed into two different and parallel tracks.In the first track, a 1 is added as an input carry to the output of the parallel prefix structure to produce In the second track, the output carry, c out , of the parallel prefix structure is reinserted and added as an input carry to produce R 3s (i.e., [14]).A few additional gates (not shown in Figure 1) are used to verify if the condition v is true.The result of this verification is inserted as an input carry to the modulo 2 n subtractor in the m 2 channel.This input carry bit is injected into the least-significant prefix operator [19,20].

Comparison and VLSI Realization
There is no scaler published in the literature for the moduli set {2 n+1 − 1, 2 n , 2 n − 1}.The new proposed scaler was compared with the most recent and efficient published scaler of the traditional moduli set {2 n + 1, 2 n , 2 n − 1} [12,13].The unit-gate model was used as a basis for theoretical comparison [14].All two-input monotonic gates had an area of 1 unit and a delay of 1 unit.The XOR (Exclusive OR) and XNOR (Exclusive NOR) gates had an area of 2 units and a delay of 2 units.However, the inverters were ignored.The full adder had an area of 7 units and a delay of 4 units, while the half adder had an area of 3 units and a delay of 2 units.The (2 p − 1) modular adder has an area and delay of 3p log 2 p + 5p and 2 log 2 p + 3 , respectively [19,20].The area and delay of a 2 p binary adder are 3  2 log 2 p + 5p and 2 log 2 p + 3 , respectively [19,20].The modified modulo 2 n adder has an area and delay of 3  2 p log 2 p + 12p and 2 log 2 p + 5 , respectively [19].However, the area and delay of the (2 p + 1) modular adder are 4.5p log 2 p + 0.5p + 6 and 2 log 2 p + 3 , respectively [20].Table 1 lists the area and time delay requirements of the proposed scaler and those in [12,13].Table 1.Hardware and time requirements of the scaler proposed in this paper for M 1 = {2 n+1 − 1, 2 n , 2 n − 1} and of the scalers proposed in [12,13] for M 2 = {2 n + 1, 2 n , 2 n − 1}.To obtain a more precise estimation of the area, delay, and power for the three designs under consideration, all the structures were modeled in Verilog HDL(Hardware Description Language) for values of n = 6, 12, 18, 24, and 30.Synopsys Design Compiler (G-2012.06) was used to synthesize the designs and map them into 65 nm Synopsys DesignWare Digital Logic Libraries.The "place-and-route" phase was performed using the Synopsys IC Compiler.The Synopsys Power Compiler was also used to estimate the power consumed.Moreover, the Synopsys Simulator was used to verify the correctness of the design functionality.The results are shown in Table 2.The relative differences between the three designs (i.e., the proposed scaler [12,13]) are listed in Table 3.Compared to the scaler of [12], Table 3 indicates that, on average, the proposed scaler had a area and power reduced by 12.7% and 11.7%, respectively.The proposed scaler had, on average, an increased time delay of 11.7%.Compared to the scaler of [13], Table 3 shows that the proposed scaler required a slightly smaller area and power, by 5.9% and 6.1%, respectively.However, it required an average time delay increase of 14.6%.Nevertheless, as mentioned in Section 1, avoiding the (2 n + 1) MAC unit and replacing it with the (2 n+1 − 1) MAC unit saves a very significant processing time [18].

Conclusions
This paper proposes the first scaler for the moduli set {2 n+1 − 1, 2 n , 2 n − 1}.When compared with the most recent and efficient scalers of the traditional three-moduli set {2 n + 1, 2 n , 2 n − 1}, the proposed scaler is proven to have an area-and power-efficient structure.However, the scalers of the traditional moduli set {2 n + 1, 2 n , 2 n − 1} are more-time-efficient structures at the expense of having the modulo (2 n + 1) channel.Replacing the (2 n + 1) channel by the (2 n+1 − 1) channel makes the proposed moduli set a faster alternative for RNS-based applications.