# Zero-Aware Low-Precision RNS Scaling Scheme

## Abstract

**:**

^{k}modulo as the scaling factor, which results in a high-precision output with a high area and delay. Therefore, low-precision scaling based on multi-moduli scaling factors should be used to improve performance. However, low-precision scaling for numbers less than the scale factor results in zero output, which makes the subsequent operation result faulty. This paper first presents the formulation and hardware architecture of low-precision RNS scaling for four-moduli sets using new Chinese remainder theorem 2 (New CRT-II) based on a two-moduli scaling factor. Next, the low-precision scaler circuits are reused to achieve a high-precision scaler with the minimum overhead. Therefore, the proposed scaler can detect the zero output after low-precision scaling and then transform low-precision scaled residues to high precision to prevent zero output when the input number is not zero.

## 1. Introduction

^{n}− 1, 2

^{n}, 2

^{n}+ 1} [7,8,9]. The authors of [7,8] considered the modulo 2

^{n}as the scaling factor. Using 2

^{n}as the scaling factor resulted in simplified scalers with high-precision output. However, using only one modulo as the scaling factor is mostly applicable for addition operations, since it cannot drastically reduce the size of the numbers to prevent multiplication overflow. Due to this, the authors of [9] proposed two-moduli scaling based on 2

^{n}(2

^{n}+ 1) as the scaling factor, which led to a low-precision output. Although this scaling factor can significantly reduce the size of the operands, the limited 3n-bit dynamic range of the three-moduli set {2

^{n}− 1, 2

^{n}, 2

^{n}+ 1} is not suitable for two-moduli scaling factors because in this three-moduli RNS system, the values of most numbers are less than the scaling factor (i.e., 2

^{n}(2

^{n}+ 1)), which results in a zero output for the scaler, consequently making the next operation faulty. This is a significant problem which indicates the importance of a zero-aware scaling mechanism, which is not covered by previous research.

^{k}, 2

^{n}− 1, 2

^{n}+ 1, 2

^{n}

^{+ 1}− 1} [10], {2

^{n}− 1, 2

^{n}, 2

^{n}+ 1, 2

^{2n + 1}− 1} [11] and {2

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n}+ 1} [12], {2

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n + 1}− 1} [13] and {2

^{2n + p}, 2

^{n}− 1, 2

^{n}+ 1, 2

^{n}− 2

^{(n + 1)/2}+1, 2

^{n}+ 2

^{(n + 1)/2}+1} [14], should be used. However, there is a limited number of works that consider the scaling for four-moduli sets. The authors of [15] designed a scaler based on a two-level architecture with the single-modulo scaling factor 2

^{n}

^{+ k}. The first level of this scaler performs scaling based on the three-moduli set {2

^{n}− 1, 2

^{n}

^{+ x}, 2

^{n}+ 1}, where 0 ≤ x ≤ n, and then the second level computes the final four-moduli scaling using the composite set {2

^{n}

^{+ k}(2

^{2n}− 1), m

_{4}} [15]. This two-level architecture requires high hardware requirements due to the multiple uses of modular adders. Furthermore, scaling by the 2

^{k}modulo is not sufficient for large dynamic range four-moduli sets to avoid overflow. In other words, the regular modulo 2

^{n}scaling of the numbers based on the three-moduli set {2

^{n}− 1, 2

^{n}, 2

^{n}+ 1} is not equivalent to modulo 2

^{n}scaling in the four-moduli set {2

^{n}− 1, 2

^{n}, 2

^{n}+ 1, 2

^{n}

^{+ 1}− 1}, since the dynamic ranges of these moduli sets are 3n- and (4n + 1)-bit, respectively. Therefore, two-moduli scaling must be used to prevent multiplication overflow for large dynamic range RNS systems.

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n}+ 1} and {2

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n + 1}− 1} is presented, and its performance is compared with the conventional method.

## 2. Low-Precision Scaling with Two-Moduli Scaling Factor: Mathematical Formulation

#### 2.1. Scaling Concept and CRT-II

_{1}, m

_{2}, m

_{3}, m

_{4}}:

_{1}is one of the moduli. Aside from that, also consider

_{1}, x

_{2}, x

_{3}, x

_{4}), which can be converted into its corresponding weighted number X using the New CRT-II conversion formulas for the generic four-moduli set {m

_{1}, m

_{2}, m

_{3}, m

_{4}} as follows [11]:

#### 2.2. General Formulations

_{1}m

_{2}). Therefore, scaling of X by m

_{1}m

_{2}can be performed by considering k = m

_{1}m

_{2}and substituting Equation (6) into Equation (4) as follows:

_{1}is a residue in modulo m

_{1}and the maximum value of H in Equation (16) is m

_{2}− 1. Therefore, the maximum value of Z in Equation (13) can be computed as follows:

_{Max}by m

_{1}m

_{2}is zero. Therefore, by considering this point and taking into account that T is an integer number, Equation (18) can be simplified as follows:

_{1}m

_{2}is reduced to T, and the full reverse conversion (i.e., full computing of Equation (6)) is not needed.

_{1}, m

_{2}, m

_{3}, m

_{4}}) but with the aim of reusing the two-moduli scaler formulas to reduce the overhead. Hence, considering k = m

_{1}and the main CRT-II formula of Equation (6) in Equation (4) results in

_{1}is less than m

_{1}, and H and T are integer numbers, Equation (23) can be simplified as follows:

_{i}are the two-moduli scaling residues. Therefore, we have

#### 2.3. Case Study: Moduli Set {2^{2n} + 1, 2^{2n},2^{n} + 1, 2^{n} − 1}

_{1}, m

_{2}, m

_{3}, m

_{4}} = {2

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}. According to Equation (20), we must compute T in Equation (15), and then its residues are the low-precision scaling residues. First, the following lemma computes the required multiplicative inverses.

**Lemma**

**1.**

_{1}= 2

^{2n − 1}, k

_{2}= 1 and k

_{3}= 2

^{n − 1}.

**Proof**

**of**

**Lemma**

**1.**

**Property**

**1.**

_{i}if v

_{i}is represented as a k-bit binary number [11].

**Property**

**2.**

_{i}(i.e.,$\overline{{v}_{i}}$) if v

_{i}is represented as a k-bit binary number [11].

**Property**

**3.**

_{i}is represented as a k-bit binary number [17].

**Property**

**4.**

_{i}is represented as a k-bit binary number [17].

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}, x

_{1}and x

_{2}are (2n + 1)- and 2n-bit numbers, respectively. Therefore, Equation (33) can be simplified using Property 3 as follows:

_{i,j}means the j-th bit of the residue x

_{i}and x

_{4}and x

_{3}are (n + 1)- and n-bit numbers, respectively. Therefore, Equation (34) can be rewritten as

_{3}is a residue in modulo 2

^{n}+ 1. Therefore, when x

_{3,n}is equal to one, the other bits will be surely be zero, and if the n low significant bits (LSBs) of x

_{3}are not equal to zero, then the most significant bit (MSB) of x

_{3}(i.e., x

_{3,n}) should be zero [12]. Therefore, by considering this point and Properties 1 and 2, Equation (38) can be simplified as follows:

^{2n}− 2, and therefore, it is always less than the first and second moduli. Hence, we have

^{2n}− 1. Therefore, we have

#### 2.4. Case Study: Moduli Set {2^{n} − 1, 2^{n} + 1, 2^{2n}, 2^{2n + 1} − 1}

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n}+ 1} except for 2

^{2n}+ 1 which is substituted with 2

^{2n + 1}− 1. Due to this, it can lead to a faster RNS arithmetic unit. However, its reverse converter will be more complex. The overall process of designing the scaler for this moduli set is relatively the same as for the moduli set {2

^{n}− 1, 2

^{n}+ 1, 2

^{2n}, 2

^{2n}+ 1} described in the previous subsection.

_{1}, m

_{2}, m

_{3}, m

_{4}} = {2

^{2n + 1}− 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}. Then, according to Equations (15)–(17), the multiplicative inverses can be computed as k

_{1}= k

_{2}= 1, and k

_{3}= 2

^{n}

^{− 1}(the proof is straightforward and similar to Lemma 1). Therefore, Equation (15) is a key formula in the scaling that can be calculated as follows:

^{2n + 1}− 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1} are the same as those for the moduli set {2

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1} (i.e., Equations (49)–(52)), since all of them are based on T. That aside, the single-modulo scaled residues are the same as in Equations (55)–(57) except for the first scaled residue, which is as follows:

## 3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}, the design can be considerably simplified, as presented in Figure 2. First, the H in Equation (39) is implemented using a 2n-bit regular carry-propagate adder (CPA) where its carry-in is connected to one. Aside from that, P also requires a modulo 2

^{n}− 1 CPA, which can be implemented using an n-bit CPA with EAC [18] based on Equation (40).

^{2n}− 1 CPA [18]. Then, according to Equations (49)–(52), the first and second two-moduli scaled residues are equal to T, and the third and fourth are only the reduction of T in moduli 2

^{n}− 1 and 2

^{n}+ 1, which can be realized using an n-bit CPA with EAC and n-bit CPA with complement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of the modulo 2

^{n}+ 1 adder which can be realized using different methods [19]. Finally, the single-modulo scaled residues can be achieved using Equations (54)–(57). The CSAs are used to compress the three operands into two, and then a modulo adder produces the scaled residue. It can be seen that in the customized version of the scaler for the moduli set {2

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}, the units for m

_{1}and m

_{2}reduction in the two-moduli scaling part are removed, since the scaled residues are equal to T. Aside from that, the second single-modulo scaled residue is H, and hence, the required m

_{2}reduction unit is removed.

Algorithm 1: Zero-Aware RNS Scaling. |

Input:Non-Zero RNS Number (x_{1}, x_{2}, x_{3}, x_{4}) |

Output:Non-Zero Scaled RNS Number (s_{1}, s_{2}, s_{3}, s_{4}) |

1: Calculate the low-precision scaled residues (sl_{1}, sl_{2}, sl_{3}, sl_{4}) |

2: If (sl_{1}, sl_{2}, sl_{3}, sl_{4}) ≠ (0, 0, 0, 0) Then return (sl_{1}, sl_{2}, sl_{3}, sl_{4}) |

3: Calculate the high-precision scaled residues (sh_{1}, sh_{2}, sh_{3}, sh_{4}) |

4: If (sh_{1}, sh_{2}, sh_{3}, sh_{4}) ≠ (0, 0, 0, 0) Then return (sh_{1}, sh_{2}, sh_{3}, sh_{4}) |

5: Return original residues (x_{1}, x_{2}, x_{3}, x_{4}) |

## 4. Performance Evaluation

^{2n + 1}− 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1} is fully designed in [15] based on the scaling factor 2

^{2n}as shown in Figure 3. To perform a technology-independent performance comparison, the unit-gate (U-G) model is used according to [15] for comparative assessment of the works. All the assumptions considered in [15] for estimation of the area and delay of modular adders are also considered here for a fair comparison, as shown in Table 1.

## 5. Conclusions

## Funding

## Conflicts of Interest

## References

- Chang, C.H.; Molahosseini, A.S.; Zarandi, A.A.E.; Tay, T.F. Residue Number Systems: A New Paradigm to Datapath Optimization for Low-Power and High-Performance Digital Signal Processing. IEEE Circuits Syst. Mag.
**2015**, 15, 26–44. [Google Scholar] [CrossRef] - Samimi, N.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. Res-DNN: A Residue Number System-Based DNN Accelerator Unit. IEEE Trans. Circuits Syst. I Regul. Pap.
**2020**, 67, 658–671. [Google Scholar] [CrossRef] - Deng, B.; Srikanth, S.; Jain, A.; Conte, T.; Debenedictis, E.; Cook, J. Scalable Energy-Efficient Microarchitectures with Computational Error Tolerance via Redundant Residue Number Systems. IEEE Trans. Comput.
**2021**, in press. [Google Scholar] [CrossRef] - Omondi, A.R.; Premkumar, B. Residue Number Systems: Theory and Implementation; Imperial College Press: London, UK, 2007. [Google Scholar]
- Molahosseini, A.S.; Zarandi, A.A.E.; Martins, P.; Sousa, L. A Multifunctional Unit for Designing Efficient RNS-Based Datapaths. IEEE Access
**2017**, 5, 25972–25986. [Google Scholar] [CrossRef] - Kong, Y.; Phillips, B. Fast Scaling in the Residue Number System. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2009**, 17, 443–447. [Google Scholar] [CrossRef] [Green Version] - Chang, C.H.; Low, J.Y.S. Simple, Fast, and Exact RNS Scaler for the Three-Moduli Set {2
^{n}− 1, 2^{n}, 2^{n}+ 1}. IEEE Trans. Circuits Syst. I Regul. Pap.**2011**, 58, 2686–2697. [Google Scholar] [CrossRef] - Low, J.Y.S.; Chang, C.H. A VLSI Efficient Programmable Power-of-Two Scaler for {2
^{n}− 1, 2^{n}, 2^{n}+ 1} RNS. IEEE Trans. Circuits Syst. I Regul. Pap.**2012**, 59, 2911–2919. [Google Scholar] [CrossRef] - Low, J.Y.S.; Tay, T.F.; Chang, C.H. A unified {2
^{n}− 1, 2^{n}, 2^{n}+ 1} RNS scaler with dual scaling constants. In Proceedings of the 2012 IEEE Asia Pacific Conference on Circuits and Systems, Kaohsiung, Taiwan, 2–5 December 2012. [Google Scholar] - Patronik, P.; Piestrak, S.J. Design of Reverse Converters for General RNS Moduli Sets {2
^{k}, 2^{n}− 1, 2^{n}+ 1, 2^{n}^{+1}− 1} and {2^{k}, 2^{n}− 1, 2^{n}+ 1, 2^{n}^{− 1}− 1} (n even). IEEE Trans. Circuits Syst. I Regul. Pap.**2014**, 61, 1687–1700. [Google Scholar] [CrossRef] [Green Version] - Molahosseini, A.S.; Navi, K.; Dadkhah, C.; Kavehei, O.; Timarchi, S. Efficient reverse converter designs for the new 4-moduli sets {2
^{n}− 1, 2^{n}, 2^{n}+ 1, 2^{2n+1}− 1} and {2^{n}− 1, 2^{n}+ 1, 2^{2n}, 2^{2n}+ 1} based on new CRTs. IEEE Trans. Circuits Syst. I Regul. Pap.**2010**, 57, 823–835. [Google Scholar] [CrossRef] - Zarandi, A.A.E.; Molahosseini, A.S.; Sousa, L.; Hosseinzadeh, M. An Efficient Component for Designing Signed Reverse Converters for a Class of RNS Moduli Sets with Composite Form {2
^{K}, 2^{P}− 1}. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.**2017**, 25, 48–59. [Google Scholar] [CrossRef] - Sousa, L.; Antao, S. MRC-Based RNS Reverse Converters for the Four-Moduli Sets {2
^{n}+ 1, 2^{n}− 1, 2^{n}, 2^{2n+1}− 1} and {2^{n}+ 1, 2^{n}− 1, 2^{2n}, 2^{2n+1}− 1}. IEEE Trans. Circuits Syst. II**2012**, 59, 244–248. [Google Scholar] [CrossRef] - Hiasat, A. A Reverse Converter and Sign Detectors for an Extended RNS Five-Moduli Set. IEEE Trans. Circuits Syst. I Regul. Pap.
**2017**, 64, 111–121. [Google Scholar] [CrossRef] - Sousa, L. 2
^{n}RNS Scalers for Extended 4-Moduli Sets. IEEE Trans. Comput.**2015**, 64, 3322–3334. [Google Scholar] [CrossRef] - Garcia, A.; Lioris, A. A Look-Up Scheme for Scaling in the RNS. IEEE Trans. Comput.
**1999**, 48, 748–751. [Google Scholar] [CrossRef] - Vassalos, E.; Bakalis, D. CSD-RNS-based Single Constant Multipliers. J. Signal Process. Syst.
**2012**, 67, 255–268. [Google Scholar] [CrossRef] - Piestrak, S.J. A high speed realization of a residue to binary converter. IEEE Trans. Circuits Syst. II
**1995**, 42, 661–663. [Google Scholar] [CrossRef] - Vergos, H.T.; Bakalis, D.; Efstathiou, C. Fast modulo 2
^{n}+ 1 multi-operand adders and residue generators. Integration**2010**, 43, 42–48. [Google Scholar] [CrossRef]

**Figure 1.**The block diagram of the proposed zero-aware low-precision scaler for the generic RNS four-moduli set {m

_{1}, m

_{2}, m

_{3}, m

_{4}}.

**Figure 2.**The proposed scaler for the moduli set {2

^{2n}+ 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1} with scale coefficients (2

^{2n}+ 1) 2

^{2n}and 2

^{2n}.

**Figure 3.**The single-modulo scaler for the special moduli set {2

^{2n + 1}− 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1} with scaling factor 2

^{2n}proposed in [15].

**Table 1.**The area and delay formulas for different n-bit modulo adders based on the U-G model reported in [15].

Modulo | Adder | Area | Delay |
---|---|---|---|

${2}^{n}-1$ | CPA-EAC | $3n\lceil {\mathrm{log}}_{2}n-1\rceil +12n$ | $2\lceil {\mathrm{log}}_{2}n-1\rceil +3$ |

CSA-EAC | $7n$ | 4 | |

${2}^{n}$ | CPA | $1.5n\lceil {\mathrm{log}}_{2}n\rceil +5n$ | $2\lceil {\mathrm{log}}_{2}n\rceil +3$ |

${2}^{n}+1$ | CPA-CEAC | $4.5n\lceil {\mathrm{log}}_{2}n\rceil +0.5n+6$ | $2\lceil {\mathrm{log}}_{2}n\rceil +3$ |

CSA-CEAC | $7n$ | 4 |

**Table 2.**The area and delay formulas based on the U-G model for different components of the proposed double-modulo scaler.

Component | Area | Delay |
---|---|---|

2n-bit CPA | $3n\lceil {\mathrm{log}}_{2}n\rceil +13n$ | $2\lceil {\mathrm{log}}_{2}n\rceil +5$ |

n-bit 2 × 1 MUX | $3n$ | $2$ |

n-bit CPA-EAC | $3n\lceil {\mathrm{log}}_{2}n-1\rceil +12n$ | $2\lceil {\mathrm{log}}_{2}n-1\rceil +3$ |

2n-bit Simplified CSA-EAC1 | $10n+4$ | $4$ |

2n-bit Simplified CSA-EAC2 | $6n+4$ | $4$ |

2n-bit CSA-EAC | $14n$ | $4$ |

2n-bit CPA-EAC | $6n\lceil {\mathrm{log}}_{2}n\rceil +24n$ | $2\lceil {\mathrm{log}}_{2}n\rceil +3$ |

n-bit CSA-CEAC | $7n$ | $4$ |

n-bit CPA-CEAC | $4.5n\lceil {\mathrm{log}}_{2}n\rceil +0.5n+6$ | $2\lceil {\mathrm{log}}_{2}n\rceil +3$ |

**Table 3.**The total area and delay estimations for the RNS scalers based on the four-moduli set {2

^{2n + 1}− 1, 2

^{2n}, 2

^{n}+ 1, 2

^{n}− 1}.

Scaler | Scale Factor | Area | Delay |
---|---|---|---|

Proposed Low-Precision | 2^{2n} (2^{2n + 1} − 1) | $19.5n\lceil {\mathrm{log}}_{2}n\rceil +95.5n+14$ | $6\lceil {\mathrm{log}}_{2}n\rceil +27$ |

Proposed High-Precision | 2^{2n} | $31.5n\lceil {\mathrm{log}}_{2}n\rceil +160.5n+14$ | $8\lceil {\mathrm{log}}_{2}n\rceil +34$ |

[15] High-Precision | 2^{2n} | $\left(28.5n+6\right)\lceil {\mathrm{log}}_{2}n\rceil +150.5n+44$ | $6\lceil {\mathrm{log}}_{2}n\rceil +25$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sabbagh Molahosseini, A.
Zero-Aware Low-Precision RNS Scaling Scheme. *Axioms* **2022**, *11*, 5.
https://doi.org/10.3390/axioms11010005

**AMA Style**

Sabbagh Molahosseini A.
Zero-Aware Low-Precision RNS Scaling Scheme. *Axioms*. 2022; 11(1):5.
https://doi.org/10.3390/axioms11010005

**Chicago/Turabian Style**

Sabbagh Molahosseini, Amir.
2022. "Zero-Aware Low-Precision RNS Scaling Scheme" *Axioms* 11, no. 1: 5.
https://doi.org/10.3390/axioms11010005