High Precision Multiplier for RNS { 2 n − 1, 2 n , 2 n + 1 }

: The Residue Number System (RNS) is a non-weighted number system. Beneﬁting from its inherent parallelism, RNS has been widely studied and used in Digital Signal Processing (DSP) systems and cryptography. However, since the dynamic range in RNS has been ﬁxed by its moduli set, it is hard to solve the overﬂow problem, which can be easily solved in Two’s Complement System (TCS) by expanding the bit-width of it. For the multiplication in RNS, the traditional way to deal with overﬂow is to scale down the inputs so that the result can fall in its dynamic range. However, it leads to a loss of precision. In this paper, we propose a high-precision RNS multiplier for three-moduli set { 2 n − 1,2 n ,2 n + 1 } , which is the most used moduli set. The proposed multiplier effectively improves the calculation precision by adding several compensatory items to the result. The compensatory items can be obtained directly from preceding scalers with little extra effort. To the best of our knowledge, we are the ﬁrst one to propose a high-precision RNS multiplier for the moduli set { 2 n − 1,2 n ,2 n + 1 } . Simulation results show that the proposed RNS multiplier can get almost the same calculation precision as the TCS multiplier with respect to Mean Square Error (MSE) and Signal-to-Noise Ratio(SNR), which outperforms the basic scaling RNS multiplier about 2.6–3 times with respect to SNR. and J.H. Investigation, S.M. and S.H. Data curation, S.M. and S.H. Writing—original draft preparation, S.M. and S.H. Writing—review and editing, S.M. and S.H. Visualization, S.M. and S.H. Supervision, S.M. and S.H. Project administration, S.M. and S.H. Funding acquisition, S.M. and S.H. authors the manuscript.


Introduction
The Residue Number System (RNS) is a non-weighted parallel numerical representation system, which divides the integers into multiple independent ones through modular operations. Thus, the bit-width of each channel is greatly reduced. As a result, RNS-based systems have the potential to achieve high calculation speed and low complexity. RNS is very suitable to process large integer numbers, which makes it extremely useful in cryptography [1], such as Elliptic Curve Cryptography (ECC) [2] and Lattice-based Cryptography (LBC) [3]. RNS has also been widely used in DSP units, such as digital Finite Impulse Response (FIR) filter in [4,5], adaptive filter in [6], 8-point [7,8], and 16-point [8] Discrete Cosine Transforms (DCT) and Discrete Fourier Transforms (DFT) [9]. However, since the calculation of RNS is defined on the modular operations, there are challenges in some basic operations, such as sign detection [10,11], magnitude comparison [12], residue-to-binary (R/B) conversion [13], and scaling [14,15], which limits the wide application of RNS.
In DSP application, overflow in fixed point representation is a common issue when the dynamic range is limited. It mostly happens in multiplication and addition operations, especially in applications with cascaded architecture, such as Fast Fourier Transform (FFT). For TCS, this issue can be easily addressed by expanding the bit-width of intermediate computation results and then converting it back to the original bit-width. This means that the precision of input operands will not lose, and the computation accuracy is only determined by the last conversion step. Meanwhile, the bit-width expansion step is very simple. In a word, the overflow can be solved simply and accurately by TCS fixedpoint calculation.
However, since the dynamic range of the RNS is determined by its moduli set and the dynamic expanding in RNS is difficult, it is much more complicated to avoid overflow compared to TCS. Usually, there are three approaches to handle the overflow problem in RNS. Figure 1 gives the basic structure of these three approaches in a three-moduli RNS, for example.
(1) The first approach is based on scaling, as shown in Figure 1a. The input operands are firstly scaled down to ensure the product result falls in the dynamic range of the RNS [16]. We call this multiplier a basic scaling RNS multiplier. Unfortunately, the scaling operation definitely reduces the precision of the input, resulting in a loss of precision of the product result.
(2) The second approach is based on base conversion, as shown in Figure 1b. The input operands are firstly converted to a new RNS with larger dynamic range to avoid the overflow problem, the multiplication results are then scaled down to the original RNS. This approach will be helpful in some applications, such as the FIR filter. However, in some DSP algorithms with cascaded multiplication or feedback structure, such as FFT computation and IIR filter, the overall dynamic range can vary significantly. Thus, the overhead of base conversion will become unacceptable.
(3) The last approach is based on base extension, as shown in Figure 1c. When the dynamic range is not enough, one or more bases will be added to extend the dynamic range of original RNS. The base extension operation is still too complicated to be acceptable. All of the above approaches cannot achieve similar performance to that in TCS. The first will lose accuracy, while the latter two will require complex algorithms and huge hardware resources.
In the RNS multiplier research, previous work mainly focuses on the efficient implementation of specific moduli. Chen [17] proposed an efficient modulo 2 n + 1 multiplier. Muralidharan [18] proposed a high dynamic range modulo 2 n − 1 multiplier. Zimmermann [19] proposed a joint implementation of the modulo (2 n ± 1) multiplier. Hiasat [20] proposed a generic multiplier for any modulo. Although these implementations can efficiently and accurately calculate the modular multiplication in each RNS channel, the precision loss caused by scaling before modular multiplier is ignored. These designs didn't consider the overflow problem caused by multipliers in specific DSP applications.
In this paper, we propose a high precision overflow-free RNS multiplier design method. The proposed RNS multiplier uses a similar idea of avoiding overflow in TCS to get high computation accuracy and low complexity. Throughout this paper, we choose the common used high precision RNS multiplier for three-moduli set {2 n − 1, 2 n , 2 n + 1}, which is widely used in RNS, to illustrate our idea for RNS multiplier design. The proposed multiplier improves the calculation precision by adding several compensation items to improve the precision of the result calculated by the scaled inputs. The compensation items can be obtained directly from preceding RNS scalers with little extra effort. Figure 2 illustrates the concise structure of the proposed RNS multiplier. The rest of this paper is arranged as follows. In Section 2, we introduce the basic theory of RNS. In Section 3, we propose two joint scalers and two high precision RNS multipliers. In Section 4, we explore the structure of proposed multipliers and scalers. In Section 5, we analyze the calculation performance of the proposed RNS multiplier and implement the multiplier in Very Large Scale Integration circuits (VLSI) to explore its hardware performance. Finally, the paper is summarized in Section 6.

Introduction of RNS
RNS is defined by a moduli set {m 1 , m 2 , . . . , m L }, where m i and m j (i, j = 1, 2, ..., L) are coprime when i = j. An integer X can be represented as where x i is the residue of X mod m i , we denote it as x i = X m i . Let M = ∏ n i=1 m i , then M is called as the dynamic range of the RNS, that is, X ∈ [0, M − 1]. According to the rules of modular operation [21], for operands X and Y, we have ( For two coprime moduli, m 1 and m 2 , the modular operation has properties as The Chinese Remainder Theorem (CRT) is one of the fundamental theorems of RNS, which is common used in scaling, Residue to Binary (R/B) conversion, and so on. If an integer X ∈ [0, M − 1], then where M i = M/m i , and M i −1 m i is the multiplicative inverse of M i for m i , that is, In this paper, our proposed scalers and multipliers are dedicated to the RNS {2 n − 1, 2 n , 2 n + 1}. For ease of notation, we let m 1 = 2 n − 1, m 2 = 2 n and m 3 = 2 n + 1 so that the residues x 1 = X m 1 , x 2 = X m 2 and x 3 = X m 3 .
Scaling is actually a constant division operation and the divisor is called the scaling factor. The format of the moduli set and the scaling factor are the two main factors in the complexity of the scaler. For an integer X, if scaling factor is K, the scaling result can be computed by where · represents floor operation. Different from TCS, the operand X in RNS is represented by multiple residues, and the final scaling result should also be represented by multiple residues. For RNS with moduli set {m 1 , m 2 , m 3 } = {2 n − 1, 2 n , 2 n + 1}, we have in which, i = 1, 2, 3. Substituting (6) into (4), we can get where I is an integer to ensure 0 ≤ X ≤ M − 1.

Design of High Precision RNS Multiplier
The proposed RNS multiplier needs the scaling result of scaling factors m 1 , m 3 , m 1 m 2 , and m 2 m 3 . Generally, we need four scalers to implement them. To further reduce the hardware complexity, we propose two joint scalers which can get scaling results for two scaling factors at the same time, one of which is for scaling factors m 1 and m 1 m 2 , and the other is for scaling factors m 3 and m 2 m 3 . As shown in the following derivation, the scaling results of scaling factors m 1 m 2 and m 2 m 3 can be obtained from intermediate products of m 1 scaler and m 3 scaler, respectively. As such, we can greatly save the hardware consumption by combining them with two joint scalers.
According to (7), we derive four calculation methods for these four scaling factors, m 1 , m 3 , m 2 m 3 , and m 1 m 2 , respectively.
According to (7), the scaling operation can be represented as In RNS, all residues are integers, and x 3 < m 3 , so x 3 /m 3 < 1. Thus, we can round them down (8) and get Mapping (9) to residue channels m 1 , m 2 , and simplify them, we can obtain By using CRT, we can uniquely represent X/m 3 with the remainders of channels m 1 and m 2 . However, in most DSP-oriented applications, the remainder of channel m 3 is also required to match the original three moduli set. Then, 3.1.2. Scaling Factor K = m 1 Similarly with factor K = m 3 , scaling for K = m 1 can be represented as Then, the values in residue channels m 2 and m 3 are For the precise representation of X/m 1 , the residues in channel m 2 and m 3 are enough. However, in most applications, the residue in channel m 1 is still needed in the following processing. Thus, by using CRT, we can get Because 0 ≤ 2 n X 1,2 − X 1,3 m 3 + X 1,2 < m 2 m 3 , we can get 3.1.3. Scaling Factor K = m 2 m 3 Low proposed two scaling structures for scaling factor K = m 2 m 3 in [22,23], respectively. However, they have a unit error when x 1 < x 2 . In this paper, we propose a scaling structure which can get the scaling results accurately. According to (7), we can get x 2 and x 3 are non-negative integers which are defined on the radixes of m 2 and m 3 .
When x 2 − x 3 < 0, according to (18), Thus, (19) can be converted to When has the same dynamic range with x 2 , so the modulo m 2 operations in (20) do not change the calculation result. These two situations can be combined into (20).
From (20), we can see that the scaling result of scaling factor m 2 m 3 can be calculated by the result from the scaler with scaling factor m 2 . Because X ∈ [0, M − 1], 0 ≤ X/2 n (2 n + 1) < m 1 , the results in all channels of moduli set {m 1 , m 2 , m 3 } are X/2 n (2 n + 1) .

Scaling Factor
Scaler with scaling factor m 1 m 2 has a similar structure with that of m 2 m 3 , From (7) we can get If and only if Then, (22) can be converted to The same as with scaling factor m 2 m 3 , the situation x 1 − x 2 > 0 can be combined into (23).
From (23), we can see that the scaler with scaling factor m 1 m 2 can also be calculated by the result from the scaler with scaling factor m 1 . Because 0 ≤ X < M, 0 ≤ X/2 n (2 n − 1) < m 3 . We can use the value in channel m 3 to represent it. If we need to expose it to channels m 1 and m 2 , it can be simply implemented by modular operations.
In summary, the calculation methods of these four different scaling factors are shown in Table 1.

High Precision RNS Multipliers for Moduli
We propose two high precision RNS multipliers based on CRT. For multiplicands X and Y defined on the moduli set {m 1 , m 2 , m 3 }, the product result is scaled by a fixed scaling factor M = m 1 m 2 m 3 . Since 0 ≤ X · Y/M < M, we can guarantee that the result is still in the dynamic range of the RNS. As shown in Figure 2, we add several complementary items to the result of Figure 1a to get a high precision multiplication result. We can see from the following derivation how the complementary items work to obtain high precision. As mentioned before, the adding complementary items can be obtained directly from the proposed scalers, so the proposed RNS multiplier needs less extra hardware consumption in comparison with the basic scaling multiplier.
The derivation process is as follows. According to (21), we have By using (8), (21), and (24), we can get According to (17), we have Thus, the last add item in (25) can be expanded as Then, (25) can be rewritten as We can find that the division operations in (28) have a divisor that is not the power of 2, which can be difficult to implement. In order to reduce hardware complexity, we approximately represent the divisors in x 3 y 1 m 1 m 2 m 3 as approximate errors , for these three items are smaller than 1. After these simplifications, (28) can be approximately expressed as where The first two items to add in (29) only have integer multiplications and all the multipliers can be obtained directly from the scalers. Moreover, while (30) has decimals, all the divisors are powers of 2, which can be simply implemented by shifting. The totally approximate error in (29) is Moreover, this kind of multiplier can also be realized by another method, which is similar to the above derivation procedure. According to (13) and (17), we can get In the same way, (32) can be approximated as where In addition, the error of (33) is Although the multipliers implemented by (29) and (33) have the same structure, they use different scalers, which leads to different hardware complexity. It is worth noting that the hardware complexity can also be reduced by reducing the number of add items in the multipliers with the cost of precision loss. For example, since the complementary items in (30) and (34) are relatively small but consume a lot of hardware resources, we can abandon them and get a simplified RNS multiplier. This provides a trade-off between calculation precision and hardware consumption. From now on, we call the proposed RNS multiplier according to (29) and (33) as a full compensatory RNS multiplier, and the RNS multipliers implemented by abandoning (30) and (34) are represented by a partial compensatory RNS multiplier. The following numerical example is used to illustrate our proposed RNS multiplication algorithm. Letting X = 64 and Y = 460, Table 2 shows the detailed calculation traces of the RNS multiplication operation step by step based on (29).

Hardware Structures
For the RNS with moduli set {2 n − 1, 2 n , 2 n + 1}, some numerical calculations and operations can be replaced by bit-wise logic operation to achieve an efficient hardware structure. Let the binary representation of an n-bit integer x be x n−1 x n−2 . . . x 0 and 0 ≤ x < 2 n − 1, according to the properties of the modulo operations and the Boolean logic operation rules [14], we can easily get −x 2 n −1 = x 2 n −1 where x represents bit-wise inversion of binary numbers, and CLS(x, r) indicates that integer x is shifted to the left by r bits. The above two operations for modulo 2 n + 1 will become a little bit complicated. For an (n + 1)-bit integer with binary representation x = x n x n−1 ...x 0 and 0 ≤ x < 2 n + 1, then The hardware structure of the scalers and multipliers in this paper are based on these operations.

Multiplier Hardware Structure
The implementation block diagrams of the proposed compensatory scaling RNS multipliers are shown in Figure 3.
In Figure 3, the multiplications in solid box represent the basic items of the proposed RNS multiplier, while the multiplications in dotted box represent the compensation items. The scaler blocks in Figure 3 are all proposed joint scalers, and the implementation details are shown in Figure 4. As shown in Figure 3, the final result is calculated by the product of two scalers without any other effort. Thus, this just costs a little extra hardware consumption.

Scaler Hardware Structure
We designed the scaling structures of the two joint scalers proposed in this paper. In addition, the implementation block diagram is shown in Figure 4.
In Figure 4, the proposed scaling structures of these four scaling factors are implemented using logic operations in (36) and modular adders proposed in [19]. The module in the two dotted line box can be used separately as scaler implementations of scaling factor m 2 m 3 and m 1 m 2 , respectively. The overall implementation in Figure 4a can be used as a joint implementation for scaling factors m 2 m 3 and m 3 , while Figure 4b can be used for scaling factors m 1 m 2 and m 1 .  (a) scaling factor 2 n + 1 and 2 n (2 n + 1) joint scaling structure; (b) scaling factor 2 n − 1 and 2 n (2 n − 1) joint scaling structure. Figure 4a can be further optimized, since the cascade modulo m 1 can be replaced by a Carry Save Adder (CSA) and a modulo m 1 adder. This optimized structure is shown in Figure 5. It is worth noting that we use this optimized structure in the implementation of Figure 4a in addition to scaling factor m 2 m 3 , in order to get a better hardware performance.

Calculation Performance of the Proposed Multiplier
In order to verify the calculation performance of the proposed compensatory scaling multiplier, we firstly analyze the calculation precision of the proposed multiplier compared with the TCS multiplier and the basic scaling RNS multiplier (see Figure 1a. Two metrics, MSE (mean-square-error), and SNR (Signal-to-Noise) are used to evaluate the calculation precision of three multipliers included. The MSE is calculated by: while SNR is defined as: In (39) and (40), X real,i represents real results of ith calculation, X result,i represents ith result calculated by the three multipliers, while X real,i,norm and X result,i,norm represent the unit normalization of X real,i and X result,i , respectively. For each i, the inputs of the multiplier are randomly selected from RNS's dynamic range. N is the total number of samples calculated for each n, in this paper, N = 10,000. The dynamic range of RNS is changed by n from 5 to 12. For comparison, the bit-width k of TCS is selected to guarantee that the dynamic range of TCS is slightly bigger than that of RNS. For example, for n = 5, the moduli set of the RNS is 31, 32, 33; then, M = 32,736, and we choose k = f loor(log 2 M + 1) = 15 bit in order to compare with TCS. The simulation results are shown in Figures 6 and 7.  From Figure 6, for all n, the proposed RNS multiplier achieves almost the same MSE with a TCS multiplier, while, for the basic scaling RNS multiplier, it suffers a relatively big MSE when n < 8. This means that the proposed RNS multiplier has a better calculation precision when the bit-width is not large enough.
As shown in Figure 7, for each n, the proposed RNS multiplier outperforms the traditional way by about 50-140 dB when n increases from 5 to 12, which reveals that the proposed RNS multiplier can greatly improve the calculation precision of the multiplication in RNS and avoid the overflow at the same time. In addition, the proposed multiplier achieves an SNR curve similar to that of the TCS multiplier, with an SNR loss of about 5 dB. This is because the dynamic range of TCS is chosen to be slightly larger than that of the RNS. Moreover, as mentioned before, the hardware consumption of proposed RNS multiplier can be saved by abandoning the third add item in (29) and (33). Although this can lead to a loss of calculation precision, the result shown in Figure 7 suggests that our partial compensatory RNS multiplier still outperforms the basic scaling one.

Synthesis Results of RNS Multiplier Based on Design Compiler
In order to evaluate the hardware performance of the proposed RNS multipliers, we designed them by VHDL and compiled them with Synopsys Design Compiler (DC) under the SMIC 65 nm process, respectively. The partial compensatory RNS multiplier with only one complementary item was also in consideration. We synthesized three mentioned multipliers in a clock timing constraint approach which uses the smallest clock period. The results are shown in Table 3.
In Table 3, Structure 1 means the structure in (29) and its simplified version, and Structure 2 means the structure in (33) and its simplified version. Although structures of two multipliers are symmetrical, they use different numbers of modulo m 1 and m 3 adders. Moreover, two cascaded modulo m 1 adders can benefit from the optimal hardware structure in Figure 5, so the hardware consumption of S1 is slightly larger than that of S2 for each n. BS means basic scaling RNS multiplier, PC means partial compensatory RNS multiplier and FC means full compensatory RNS multiplier. We use the AT of basic scaling RNS multiplier with structure 1 as a basis to calculate the AT ratio for each n. From Table 3, for each n, although the proposed multiplier needs about 2-2.3 times hardware consumption compared with the basic scaling multiplier, it can get about 2.6-3 times SNR than that of the basic scaling multiplier. To further evaluate the hardware efficiency of the proposed RNS multiplier, we calculate the SNR/AT of three multipliers. The results are shown in Figure 8. We can see that the proposed partial compensatory RNS multiplier has the biggest SNR/AT for each n, which means it has the highest hardware efficiency. In addition, the proposed RNS multiplier has bigger SNR/AT than that of the basic scaling one. This indicates that our proposed RNS multiplier still outperforms basic scaling RNS multiplier in hardware efficiency.  Hisat proposes a few efficient RNS scalers for moduli set {2 n − 1, 2 n+p , 2 n + 1} with scaling factor 2 n and 2 n+p [15]. This work is a recent development in RNS scaler design. The hardware efficiency of this work is obviously better than that of ours. As far as I know, our work in this paper is the first to discuss the problem of overall accuracy of the RNS multiplier, instead of the modular multiplier. In our design, we use the scaling results scaled by 2 n − 1, 2 n (2 n − 1), 2 n + 1, and 2 n (2 n + 1), instead of 2 n . Our focus in this work is not on optimizing the scaler. Any optimized scaler can be used in the proposed multiplier as long as it meets the requirements of scaling factor.

Conclusions
This work presented a high precision overflow-free multiplier design for three-moduli set {2 n − 1, 2 n , 2 n + 1}. The proposed RNS multiplier avoids overflow based on the scaling approach and achieves high precision by adding several compensation items to compensate the precision loss caused by scaling. Our RNS multiplier can get almost the same calculation precision as the TCS multiplier, which outperforms the basic scaling RNS multiplier about 2.6-3 times in SNR. In addition, the compensation items can be flexibly selected to make a trade-off between hardware resource and calculation precision. Synthesis results suggest that both our full compensatory RNS multiplier and partial compensatory RNS multiplier outperform traditional basic scaling RNS multiplier in hardware efficiency.