An Efficient CRT-Base Power-of-Two Scaling in Minimally Redundant Residue Number System

In this paper, we consider one of the key problems in modular arithmetic. It is known that scaling in the residue number system (RNS) is a rather complicated non-modular procedure, which requires expensive and complex operations at each iteration. Hence, it is time consuming and needs too much hardware for implementation. We propose a novel approach to power-of-two scaling based on the Chinese Remainder Theorem (CRT) and rank form of the number representation in RNS. By using minimal redundancy of residue code, we optimize and speed up the rank calculation and parity determination of divisible integers in each iteration. The proposed enhancements make the power-of-two scaling simpler and faster than the currently known methods. After calculating the rank of the initial number, each iteration of modular scaling by two is performed in one modular clock cycle. The computational complexity of the proposed method of scaling by a constant Sl=2l associated with both required modular addition operations and lookup tables is estimeted as k and 2k+1, respectively, where k equals the number of primary non-redundant RNS moduli. The time complexity is log2k+l modular clock cycles.


Introduction
Nowadays, high-performance computing is progressing extremely rapidly. This makes qualitatively new demands to designed number-theoretic methods and computational algorithms. That is why creating fundamentally new and efficient computing tools for fast and reliable parallel data processing is especially important. Modular computational structures occupy a special place among them. Modular arithmetic, i.e., the arithmetic of RNS, creates their mathematical basis.
The inherent parallelism and carry-free properties of RNS provide a high potential for accelerating arithmetic operation compared with conventional weighted number systems (WNS). The main advantage of RNS consists of its unique ability to decompose large integer numbers into a set of small residues and to process them in parallel in independent modular channels.
In this regard, one of the most promising ways in the specified area is the development of high-speed parallel modular computational structures as well as the enhancement of their functionality and optimization. In this case, the main optimization criteria are the Lemma 1 (Euclid's Division Lemma). For any X ∈ Z and a positive integer m, there exists a unique pair of integers Q, R such that where R ∈ Z m = {0, 1, . . . , m − 1}.
The residue code (χ 1 , χ 2 , . . . , χ k ) corresponds to the set of all integers X satisfying the system of simultaneous linear congruences The following statement is true [9,23,24].
Theorem 1 (Chinese Remainder Theorem). Let the moduli m 1 , m 2 , . . . , m k be pairwise prime, i,k m i (i = 1, 2, . . . , k). Then the system of congruences (2) has a unique solution, the class of residues modulo M k , defined by the congruence The practical application of the RNS assumes that each residue code (χ 1 , χ 2 , . . . , χ k ) must correspond only to one integer number. Therefore, certain sets of representatives of residue classes are used as the number range to ensure required single-valued correspondence. Since in the given RNS it is possible to represent M k integers, the set Z M k = {0, 1, . . . , M k − 1} is usually used in computer applications as an RNS operating range.
The decoding mapping Φ −1 RNS : Z 1 × Z 2 × · · · × Z m k → Z M k based on the CRT (3) executes according to the rule Applying Euclid's Division Lemma (1), we can write where χ i,k is a normalized residue modulo m i : x denotes the largest integer less than or equal to x.
Substituting (5) into (4), and taking into consideration (6), we have that is equivalent to Since the summands in (7) have narrower change bounds, the use of (4), which is a normalized analog of (1), is preferable for constructing RNS arithmetic. Equation (7) is called the CRT-form of representing the integer X = (χ 1 , χ 2 , . . . , χ k ) from the RNS number range Z M k .
The mapping Φ RNS is an isomorphism concerning the basic arithmetic operations. The operation • ∈ {+, −, ×} on arbitrary elements A and B given by their residue codes A = (α 1 , α 2 , . . . , α k ) and B = (β 1 , β 2 , . . . , β k ) is carried out by the rule where In the RNS, according to (8), the modular addition, subtraction, and multiplication are performed independently for each modulus m i (i = 1, 2, . . . , k). It must be noted that (8) is correct only if the result A • B of the arithmetic operation does not go beyond the RNS number range, i.e., if The RNS inherent code parallelism illustrated by (8), which consists of the decomposition of arithmetic operations on integers A and B into independent small word length operations on the like digits α i and β i of residue code, is the main advantage of modular arithmetic compared with the arithmetic of weighted number systems (WNS). Realizing this advantage to the fullest extent is a key strategic goal of all computer applications in the RNS.
As is known, in contrast to the positional code, the residue code (χ 1 , χ 2 , . . . , χ k ) of the number X does not explicitly contain information about its value. Therefore, the implementation in the RNS arithmetic operations that require calculating the so-called positional characteristics which give information about the numbers location in the RNS range Z M k encounters specific difficulties. Such procedures, in contrast to modular ones, are called non-modular.
The efficiency factor of RNS arithmetic, to a decisive extent, is determined by the optimality of the applied non-modular procedures. At the same time, the main factor that has the most impact on the quality indicators of algorithms for non-modular operations is the computational complexity of calculating the positional characteristics of the residue code and related integer representation forms.
As for Equation (7), its direct application as the general form of integers for building non-modular procedures is practically impossible due to the complexity of straightforward implementation, especially in the case of large M k . At the same time, the use of the specific positional characteristics enables us to obtain from (7) the relevant forms of integer representation, which have good implementation properties and make it possible to overcome the problem of time-consuming addition operations modulo M k .
As follows from (7), the difference is a multiple of M k . Hence, the following equality holds The positional characteristic ρ k (X) is called a rank of the number X. In essence, the rank ρ k (X) is a CRT reconstruction coefficient that indicates how many times the upper bound M k of the number range is exceeded when the integer value of the number X is calculated by its residue code (χ 1 , χ 2 , . . . , χ k ).
Equation (9) is called a rank form of the integer X.
From (9), it also follows that the rank ρ k (X) is a quotient of the integer division of X k by M k .
Hence, we obtain Therefore, since χ i,k ∈ Z m i (i = 1, 2, . . . , k), the inequality 0 ≤ ρ k (X) ≤ k − 1 holds. Compared with (7), Equation (9) does not contain time-consuming reduction modulo M k . Therefore, designing non-modular procedures in RNS arithmetic on the basis of the rank form has a substantial lead over the canonical CRT implementation.

The Approaches Currently Used to Calculate the Rank of a Number
First, the rank of a number as a main RNS integral characteristic has been studied in [1], and later in [2]. The rank evaluation algorithm consists of a slow k-step iterative procedure of sequential additions large modulo of specific constants defined by the chosen RNS moduli-set {m 1 , m 2 , . . . , m k }. Moreover, the upper bound of the rank r(X) depends on the values of the weights µ 1,k , µ 2,k , . . . , µ k,k (see (4)), and can be sufficiently large for most moduli-sets suitable for practical use. If we assume that the processing of such long L-bit word-length numbers L = ( log 2 M k ) is comparable in time with k operations on the small residues, then the complexity of this method is equal to O k 2 . Because of that, the given approach to the rank calculation is time-consuming and practically unacceptable for high-performance computing due to its computational complexity, especially when using huge M k .
The so-called "extra modulus method" for rank calculation has been proposed in [25]. It rearranges the canonical CRT implementation to an exact integer equation, i.e., the same form as (9). To be able to retrieve the value of the CRT reconstruction coefficient, i.e., the rank of a number, the extra-modulus m e must satisfy the following conditions: m e > k, and m e is any integer prime to M k . In this way, the slow and challenging addition modulo M k in the straightforward CRT implementation is replaced by subtraction and multiplication modulo m e . Thus, we have an extra modular channel for rank calculation. This method works well and correctly when it assumes that proper redundant residue |X| m e is available. Hence, the "extra modulus method" is suitable for the base extension operation. At the same time, when the number under consideration results from the modular addition or subtraction operation [26], it cannot be used owing to eventual overflow or underflow, respectively. Thus, in such a case, the exact value of |X| m e is not available. Therefore, this method is not applicable for sign determination and magnitude comparison of two numbers in RNS.
A different approach to evaluating the CRT reconstruction coefficient is proposed in [27][28][29]. The main idea of the so-called "fractional domain method" consists in the representation of the reconstruction coefficient r as an integer part of a sum of at most k proper fractions (see (10)). The value r is recursively estimated by approximating terms of a fraction χ i,k /m i . To avoid division by the modulus m i in the fraction, the denominator m i is replaced by 2 n (m i < 2 n ), while the numerator χ i,k is approximated by its most significant υ bits (υ < n) (i = 1, 2, . . . , k). Since the division by powers of 2 is equivalent to simple shifts, then the calculation of r can be implemented by addition only.
The main drawbacks of this method consist of the following. First of all, full-precision fractional computations are required. In any case, such calculations are slower than operating on smaller word-length, and the full-precision fractional bits require substantial storage. On the other hand, the number of iterations required is of the order of the bit-length needed for the approximation. For example, the method employing a fractional interpretation of the CRT [27] needs a very high precision of log 2 (kM k ) bits. The method proposed in [28,29] uses a sequential bit-by-bit manner for evaluating reconstruction coefficient r. The iterative structure of this method makes it very slow in the case of large word-length numbers.
There are also approaches to reconstruct the integer value of RNS number based on the CRT by using special moduli-sets with a limited number of moduli such as m = 2 n + d (d ∈ {−1, 0, 1}) [5,8]. The main drawback of these methods consists of a small number of the selected moduli, typically from three to five. Such moduli sets are suitable for the efficient implementations of digital signal processing algorithms but completely not applicable for the processing of large numbers which are widely used in cryptography.
In recent decades, the CRT algorithm, corresponding forms of number representation, and the methods of integer reconstruction by residue code have been intensively studied, especially concerning their application in high-performance computing. The major efforts are aimed at reducing the computational complexity of calculating the main integral characteristics of residue code.
There are some new approaches for calculating an approximate value of the rank of a number which allow us to reduce the computational complexity of complicated nonmodular operations in RNS arithmetic [30][31][32]. The method proposed in [30] is based on the so-called interval floating-point characteristic which provides information about the range of changes in the relative value of RNS representation. Generally, it enables us to perform effectively such operations as magnitude comparison, sign determination, and overflow detection. The concept of an approximate value of the rank of a number is introduced in [31]. This approach allows us to reduce the computational complexity of the decoding from residue code to binary representation and decrease the size of the required coefficients. Based on the properties of the approximate value and arithmetic properties of RNS, a new method for error detection, correction, and controlling computational results has been proposed. In [32], a new original general-purpose technique for CRT basis extension and scaling in RNS using floating-point arithmetic for the rank estimation is proposed for a homomorphic encryption scheme. The main algorithmic improvements focus on optimizing decryption and homomorphic multiplication in the RNS using the CRT to represent and manipulate the large coefficients in the ciphertext polynomials.
The rank positional characteristic has been thoroughly investigated in [33,34]. As shown, the rank ρ k (X) has a simple structure, high modularity of calculation, and a small range of changes. At the same time, the rank ρ k (X) is a sum of two small numbers, namely, the inexact rank ρ k (X) < k and two-valued rank correction ∆ k (X) ∈ {0, 1}: where and In conventional non-redundant RNS, as it follows from (11)- (14), the calculation of the inexact rank ρ k (X) is reduced to a summation of k small residues R 1,k (χ 1 ), R 2,k (χ 2 ), . . ., R k,k (χ k ) modulo m k taking into account the number of the overflows occurring during the modular addition operations. At the same time, as demonstrated in [34], the main computational cost is associated with the estimation of the rank correction ∆ k (X). Its evaluation requires concurred modular addition operations in all independent modular channels corresponding to primary RNS moduli m 1 , m 2 , . . . , m k . These computations can be implemented easily by the pre-computation and lookup table techniques. As a result, the total number of required modular addition operations and lookup tables for rank ρ k (X) calculation are k 2 + 5k − 10 /2 and k 2 + k − 2 /2, respectively.
As shown in [34], the minimum redundancy residue code enables optimization of the rank calculation. It assumes the extension of non-redundant residue code (χ 1 , χ 2 , . . . , χ k ) of the number X by the redundant residue χ 0 = |X| m 0 concerning extra modulus m 0 = 2, i.e., by adding the parity of the number X to its residue representation. Therefore, in minimally redundant RNS, the number X ∈ Z M k is represented by its minimally redundant residue code (χ 0 , χ 1 , . . . , χ k ). So, the total residue code length increases by only one bit.
The main advantage of minimally redundant RNS compared with non-redundant analogs consists of a significant simplification of calculating the rank correction ∆ k (X) and, accordingly, the rank ρ k (X).
The use of minimum redundancy residue code makes it possible to replace in (11) the rank correction ∆ k (X), which evaluation is time-consuming and requires performing addition operations in all modular channels, with a trivially calculated binary attribute δ k (X) ∈ {0, 1}. At the same time, and where Compared with non-redundant analogs, the use of minimally redundant RNS enables us to reduce significantly the complexity of the rank ρ k (X) calculation both in terms of required modular addition operations and lookup tables. At the same time, the corresponding computational cost is k modular addition operations and k one-input lookup tables. The time complexity depends only on the number of primary RNS moduli and equals T rank = log 2 k modular clock cycles.
As shown in [34], the transition to the minimum redundant residue code enables a decrease in the computational complexity of the rank calculation from the order O k 2 /2 to O(k) concerning required modular addition operations and lookup tables. Thus, the computational complexity reduction factor increases with the number k of non-redundant moduli m 1 , m 2 , . . . , m k and asymptotically approaches the threshold k/2.
The use of minimally redundant RNS ensures significant optimization of calculating the rank ρ k (X) of the number X. Moreover, this is also applied to the implementation of the CRT algorithm and, correspondingly, to the execution of various non-modular operations based on it. First of all, that is caused owing to the extreme simplicity evaluation of two-valued characteristic δ k (X) ∈ {0, 1} as well as the modular structure of the main calculation equation for inexact rank ρ k (X) (see (12)). This circumstance enables radical simplifying the calculation of the rank ρ k (X) in minimally redundant RNS in comparison with conventional non-redundant RNS and, consequently, makes it possible to construct faster and optimal with respect to computational complexity variants of RNS arithmetic.
Therefore, the application of minimally redundant residue representation takes priority over conventional non-redundant RNS arithmetic to implement the scaling procedures based on the rank form of a number.

The Main Types of Scaling Algorithms in RNS Arithmetic
In the conventional WNS, the power-of-two scaling is performed simply by right shifting. In the RNS, compared with WNS, this procedure has substantial difficulty because it is not easily implementable due to its non-positional nature.
The classical power-of-two scaling method consists of the residue code conversion to binary representation, scaling in the conventional WNS, and converting the result back to the RNS.
Unlike the WNS, the residue code does not contain explicit information about the integer value of the represented number. Therefore, in addition to its usual purpose, which consists of limiting the undesirable growth of calculation results, the scaling in RNS is also used to detect the position of integers in a particular range (i.e., to evaluate their values), rounding, and solving other similar tasks. This operation is often used in more complex non-modular procedures such as general modular division. Many different scaling algorithms, which do not require RNS-to-binary conversion, have been presented in the literature. A detailed review of the known modular scaling methods is presented in [8].
The essence of the modular scaling operation is to obtain some integer approximation X = ( χ 1 , χ 2 , . . . , χ k ) (i = 1, 2, . . . , k) to the fraction X/S, where X = (χ 1 , χ 2 , . . . , χ k ) is an arbitrary element of the RNS number range Z M k , and S is a constant factor (scale). The fraction X/S is usually approximated by the integers X/S and X/S ( x and x are the floor and ceiling function of x, respectively).
The most important aspect of the scaling problem in RNS is to ensure the high flexibility of the created algorithmic tools. That implies adoption of the set S = {S 0 , S 1 , . . . , S Λ−1 } of scales S l > 1 (l = 0, 1, . . . , Λ − 1) which is usually chosen based on the criterion for the minimum calculating error under a given constraint on the number of scaling factors.
All known scaling techniques can be classified into four main categories: scaling by a power of two [42][43][44].
In the first group, many scaling methods take the scaling factor S as a product of l moduli, i.e., of the form S = M l (l = 1, 2, . . . , k − 1) [35][36][37][38]. That makes it easier to obtain the residues χ l+1 , χ l+2 , . . . , χ k of the approximation X to the fraction X/S. The remaining residues χ 1 , χ 2 , . . . , χ l can be calculated sufficiently lightly within the framework of procedures based on one of the base extension algorithms [2,35,45]. Due to the small word length of residues, the pre-computation and lookup table techniques are suitable for modular scaling.
In [35], the base extension algorithm uses the reverse conversion of residue code to mixed-radix representation. The method proposed in [36] requires a redundant modulus to evaluate the CRT reconstruction coefficient, i.e., the rank of a number, to complete the base extension procedure. In [38], the suggested approach is entirely based on a lookup tables technique, while all the required tables have two inputs. At the same time, the memory costs are too high when the number of chosen moduli is sufficiently large. The method proposed in [37] enables one to carry out base extension and exact scaling without some system redundancy only by using additional lookup tables.
The CRT-base technique for modular scaling by an integer has been suggested in [39]. Here, the main idea is to approximate the CRT calculating relation for reconstructing the integer value of RNS numbers. This enables the substitution of large modulo M k addition in the canonic CRT-decoding scheme by smaller word-length modular addition operations. In [40], the proposed method uses minimum redundancy for modular scaling by arbitrary positive scales. The distinctive feature of the algorithm consists of using the interval index as a positional characteristic of residue code. At the same time, the interval index can be calculated fast and lightly by modular addition of small residues in the kth modular channel corresponding to the modulus m k from the RNS moduli-set {m 1 , m 2 , . . . , m k }.
In the case of arbitrary rational scale S, an efficient basis for modular scaling is the approach presented in [41]. The main feature is that for the scales of the form S = p/q, the numbers p and q can take any integer values for which the fraction qX/p does not exceed the upper bound of the RNS number range. In addition, both the number qX and the results of intermediate calculations may not satisfy the specified requirement.
The scaling methods in the fourth group implement division by constants of the form S = 2 l (l = 1, 2, . . . , Λ), Λ ≤ log 2 M k [7,42,43]. General approaches to solving this task are based mainly on the bisection method. It consists of calculating the recurrence relation X (j+1) = X (j) /2 for j = 0, 1, . . . , l − 1. In this case, X (0) = X, and X (l) = X/2 l . The residue χ (j+1) i (i = 1, 2, . . . , k) of approximation X (j+1) is determined as while all the primary moduli m 1 , m 2 , . . . , m k are coprime odd numbers. The last condition ensures that 2 and m i (i = 1, 2, . . . , k) are relatively prime numbers, and, correspondingly, the existence of a modular multiplicative inverse of 2, i.e., the number As followed from (17), the scaling by 2 requires the parity detection of the number X (j) , j = 0, 1, . . . , l − 1. So, there is a need for a base extension operation to extra modulus equal 2. An iterative algorithm for scaling by the factor S = 2 l proposed in [42] is implemented in l steps. At the same time, the parity of the intermediate results is checked at each iteration using the base extension operation suggested in [25]. In [43], the power-of-two scaling technique is applied to realize a digital filter in quadratic RNS. The scaling algorithm presented in [44] focuses on arbitrary moduli sets with large dynamic ranges and requires only machine-precision integer and floating-point operations. At the same time, it is used for software implementation of rounding and exponent alignment procedures in a multiple-precision RNS-based arithmetic library for parallel CPU-GPU systems.

A Novel Approach for Calculating the Rank of a Number Resulting from Scaling by 2
In RNS, the rank ρ k (X) ∈ Z k = {0, 1, . . . , k − 1} is a principal positional characteristic since all the non-modular operations, such as magnitude comparison, sign determination, overflow detection, general division, scaling, residue-to-binary conversion, and others, can be implemented on its basis. Because the rank ρ k (X) enables estimation of the integer value of the RNS-number X, then the development of efficient methods and algorithms for its calculating is of primary importance in building efficient variants of RNS arithmetic and, accordingly, high-performance modular computational structures.
Let us show that the rank form (9) of the number representation in residue arithmetic creates a basis for constructing relatively fast and sufficiently simple iterative algorithms for the implementation of division by constant S l = 2 l (l = 1, 2, . . . , Λ, Λ ≤ log 2 M k ). In this case, the following theorem is fundamental for solving the problem of modular scaling by powers of 2.
Theorem 2. Let in RNS with pairwise prime odd moduli m 1 , m 2 , . . . , m k the arbitrary number X = (χ 1 , χ 2 , . . . , χ k ) from the range Z M k having rank ρ k (X) be given. Then the rank of the integer X = X/2 satisfies the equation where ρ k (1) is the rank of the number 1, and x denotes the negation of the Boolean value x.
Proof. As follows from the rank form (9), the number 1 in a given RNS has the following form Therefore, we can write Then, in accordance with Euclid's Division Lemma (1), from (22) we have Thus, Since for each least nonnegative residue χ ∈ Z m modulo an arbitrary odd modulus m, there is a unique formal quotient |χ/2| m , and |χ/2| m = (χ + m|χ| 2 )/2 (see, for example, [1]), then Therefore, Taking this into account, from (23) we get Hence, according to the rank form of number representation (9), we conclude that the following equation for the rank ρ k X of the number X is valid: If the number X is even, then |X| 2 = 0, so that . . , k). Therefore, in this case, Equation (24) takes the form which corresponds to (18). If the number X is odd, then |X| 2 = 1, and it is easy to check that where ω i and ϕ i are two-valued quantities determined by (19) and (20), respectively. In this case, Equation (24) takes the form which also corresponds to (18). The theorem is proved.
As it follows from Theorem 2, the rank ρ k X of the number X = X/2 can be calculated rapidly and easily only taking into account the known value of the rank ρ k (X) of the initial number X. This circumstance makes it possible to optimize and significantly speed up the execution of the power-of-two scaling operation. In this case, it is not necessary at each iteration to calculate the rank of the number, which is the intermediate result of scaling, by its residue code. At the same time, the complete operation of rank calculation is necessary only for the initial number X at the preliminary stage of the scaling procedure.

A Novel Power-of-Two Modular Scaling Based on the Rank Positional Characteristic in Minimally Redundant RNS
Theorem 2 implies the following step algorithm for power-of-two scaling in minimally redundant RNS with primary pairwise prime odd modules m 1 , m 2 , . . . , m k , extra modulus m 0 = 2, and scales of the form S l = 2 l (l = 1, 2, . . . , Λ, Λ = log 2 M k ).

S.2. For the residue number X
i are obtained by formulas similar to (19) and (20), namely: of the minimally redundant residue code and the rank ρ k X (j+1) of the number X (j+1) = X (j) /2 are determined, respectively, according to the rules S.4. The redundant residue χ is calculated according to equation following from the rank form (9) where ψ In essence, it determines the parity of the number X (j+1) . If j = l − 1, then the number X (j+1) = X (l) = X/2 l is the required number, and the scaling process ends. Otherwise, the variable j is incremented by one (j = j + 1), and the jump to step S.2 is carried out.
For its hardware implementation, the most important feature of the above recursive scaling algorithm is that the specified operations on steps S.2, S.3, and S.4 can be combined in time and carried out within one modular clock cycle. Due to this circumstance, after obtaining the rank ρ k (X), each iteration of RNS number scaling by 2, i.e., each shift of its integer value by one bit to the right, is performed in one modular clock cycle.
Since the calculating process of the rank ρ k (X) has a pipeline structure, with the appropriate organization of computations the described scaling procedure at low hardware costs provides a reasonably high speed.
It follows from the above that all the necessary calculations within the scaling algorithm can be implemented using tabular computational structures.
For example, the calculation of the inexact rank ρ k (X) of the initial number X is reduced to a summation of the sets of small residues R 1,k (χ 1 ), R 2,k (χ 2 ), . . . , R k,K (χ k ) modulo m k . Simultaneously, we take into account the number of occurred overflows when performing these modular addition operations (see (12)-(14)). Therefore, we need k one-input lookup tables to store the given set, while the bit length of recorded residues is log 2 m k (l = 1, 2, . . . , k). At the same time, the estimation of two-valued rank correction δ k (X) (see (16)) requires the set ψ 1 , ψ 2 , . . . , ψ k of least significant bits of normalized residues χ i,k (i = 1, 2, . . . , k) of the number X (see (6)).
Similarly, the sets of binary flags ψ (26)-(28)) enable us to obtain the integer ∆ (j) required for rank calculating in the corresponding iterations of scaling procedure (j = 0, 1, . . . , l − 1). All these binary sets can also be recorded in the appropriate lookup tables.
Thus, the content of the ith lookup table corresponding to the input residue χ 1, 2, . . . , k), (j = 0, 1, . . . , l − 1). Below we present the proposed scaling method in the form of a pseudo-code algorithm. Let us evaluate the computational complexity of the proposed iterative power-of-two scaling method. As follows from the above, Algorithm 1 requires total T scal = T rank + T iter × l modular clock cycles. According to [33,34], in minimally redundant RNS, the time complexity of calculating the rank ρ k (X) of the initial number X depends only on the number k of primary RNS moduli and can be evaluated as T rank = log 2 k . At the same time, all calculations within each iteration, consisting in obtaining both the minimally , . . . , χ (j+1) k and the rank ρ k X (j+1) of the number X (j+1) = X (j) /2 (j = 0, 1, . . . , l − 1), can be performed in one modular clock cycle by using lookup table technique. Therefore, T iter = 1. Hence, the algorithm time complexity T scal = log 2 k + l modular clock cycles.

S.3.2.
We calculate the non-redundant residue code and the rank of the number X (2) = X (1) /2 :

S.3.3.
We calculate the non-redundant residue code and the rank of the number X (3) = X (2) /2 :
As far as j = l − 1 = 2, the scaling procedure ends, and the number X (3) is the desired solution.
To verify the obtained result, according to the rank form (9), we find The result is correct.
The above example shows that the use of minimally redundant RNS enables us to optimize and speed up the power-of-two scaling procedure compared with the conventional non-redundant RNS to a large extent. First of all, that is caused by the extreme simplicity of calculating the inexact rank ρ k (X) and estimating two-valued characteristic δ k (X) of the initial number X as well as by the trivial operations for obtaining the rank ρ k X (j) (j = 0, 1, . . . , l − 1) at each iteration of the scaling procedure (see Theorem 2). Therefore, the proposed minimally redundant residue representation takes priority over non-redundant analogs in optimization and speed-up of the scaling and other nonmodular procedures based on the CRT implementation using a rank characteristic.

Discussion
Let us now discuss the theoretical and practical aspects of the approach proposed in this paper.
As followed from (17), the power-of-two scaling algorithm based on the bisection method requires the parity detection of the number X (j) (j = 0, 1, . . . , l − 1) at each iteration. Therefore, fast calculating the residue concerning extra modulus m 0 = 2 is a significantly important task.
In conventional non-redundant RNS, the parity detection of the number X (j) = is usually based on estimating the integer value of X (j) by the use of specific positional characteristics. The generally accepted ones are the digits of mixedradix representation, core function, the rank of a number, and interval index [1][2][3]5,8].
In RNS arithmetic, the parity check of a number refers to complicated non-modular operations requiring high computational costs. The computational complexity of this operation is comparable to the computational complexity of the reverse conversion from the residue code into the mixed-radix representation or to the calculation of the rank of a number.
Generally, in non-redundant RNS, the implementation of parallel parity check algorithm requires O(k 2 ) modular addition operations [33,34]. So it can become computationally expensive for large values of k. Thus, for efficient implementation of the power-of-two scaling algorithm based on the bisection method, one needs to speed up and optimize the RNS parity detection technique.
In this article, the proposed approach to power-of-two scaling is based on using the rank of a number as the main RNS positional characteristic. Therefore, in our case, obtaining residue modulo m 0 = 2 is reduced to the calculation of the rank ρ k X (j) with the following use of the rank form (9). Hence, Thus, determining the parity of a number has a computational complexity identical to the complexity of rank calculating concerning the numbers of required modular addition operations R MO and lookup tables R LUT . At the same time, obtaining the residue code , . . . , χ (j+1) k of the number X (j+1) = X (j) /2 needs k additional lookup tables (see (17)). Therefore, the computational cost of the iterative procedure of scaling by S l = 2 l consists of S MO = R MO × l modular addition operations and S LUT = R LUT + k lookup tables, whereas the time complexity is T scal = T iter × l modular clock cycles, where T iter is a performance time of one iteration based on the bisection method.
Thus, in conventional non-redundant RNS, the computational cost of the canonical power-of-two scaling procedure based on the bisection method (17) and the rank calculation method described in [34] is estimated as The main advantage of the proposed approach to power-of-two scaling over the existing ones consists in the use of minimally redundant RNS and the novel method for calculating the rank of a number resulting from division by two (see Theorem 2) in each iteration of the scaling algorithm. This circumstance enables a significant reduction of the computational complexity of the scaling algorithm.
As follows from [34], the corresponding computational cost of calculating the rank ρ k (X) of the initial number X is R MO = k and R LUT = k in terms of required modular addition operations and lookup tables, respectively. Furthermore, the performance time of the rank calculation is T rank = log 2 k modular clock cycles (see Section 3).
It is important to note that all calculations at each iteration are implemented using the lookup tables technique and the simplest combinational logic circuits.
As shown above, the minimally redundant residue code of the number X (j+1) = X (j) /2 (j = 0, 1, . . . , l − 1) is yielded in only one modular clock cycle and needs the use of k + 1 additional lookup tables. At the same time, the first k of these lookup tables are used for obtaining the residue code χ gives us the rank ρ k X (j+1) of the number X (j+1) (see (29) and (30)). So, at each iteration, there are no additional modular operations.
The total numbers of required modular addition operations and lookup tables are estimated, respectively, as and The time complexity of the novel power-of-two scaling algorithm is T scal = T rank + l = log 2 k + l modular clock cycles.
Thus, the use of minimally redundant RNS and novel approach to rank calculation at each iteration of power-of-two scaling (see Theorem 2) enables significant decrease of the computational complexity. The corresponding reduction factors of the computational complexity, in terms of the required modular addition operations (see (32) and (34)) and lookup tables (see (33) and (35)), are Below, Tables 1 and 2 present these reduction factors. It should be noted that the use of the novel method for calculating the rank ρ k X (j+1) of the number X (j+1) = X (j) /2 (j = 0, 1, . . . , l − 1) at each iteration of the scaling procedure (see Theorem 2) in non-redundant RNS, gives us the following computational cost S LUT = R LUT + (k + 1) = 1 2 k 2 + 3k .
Simultaneously, the time complexity is T scal = log 2 k + l + 1 modular clock cycles. As can be seen, the reduction factors of the computational complexity of power-of-two scaling based on Theorem 2 in minimally redundant RNS compared with conventional non-redundant RNS are represented by the following fractions In this case, as follows from (40), the reduction factor C MO (k) = C MO (k, 1) does not depend on the value S l = 2 l (l = 0, 1, . . . , Λ − 1). At the same time, C LUT (k) ≈ C LUT (k).
The dependence of the reduction factors C MO (k) and C LUT (k) on the number of primary RNS moduli k is presented in Table 3. Thus, the use of minimally redundant RNS and novel approach to calculating the rank of a number at each iterations of bisection method enables radically simplifying the carrying out of power-of-two scaling compared with conventional non-redundant RNS. This circumstance enables us to construct faster and optimal in computational cost RNS-oriented complicated computing procedures which widely use scaling algorithms.

Conclusions
As shown in this paper, the use of minimum-redundancy residue code enables the construction of efficient scaling procedures based on the CRT due to optimizing the calculation of the rank of a number, a principal positional characteristic in RNS arithmetic.
At the beginning stage of the power-of-two scaling procedure, to calculate the rank of the initial number, we apply the approach for the rank calculation proposed by one of the authors in [33,34]. It is reduced to the summation of the small word-length residues R 1,k (χ 1 ), R 2,k (χ 2 ), . . ., R k,k (χ k ), taking into account the number of occurred overflows during the modular addition operations modulo m k , and fast calculation of two-valued rank correction δ k (X) ∈ {0, 1} (see (12) and (16)).
We propose a novel approach to power-of-two scaling based on Theorem 2. Using minimal residue code redundancy, we have optimized and sped up the rank calculation and parity determination of the numbers that result from division by two at each iteration of the bisection method. Each iteration of modular scaling by two is performed in only one modular clock cycle. Thus, owing to the proposed improvements, the power-of-two scaling procedure becomes simplest and faster than the currently known methods.
The computational complexity of the proposed scaling method by constant S l = 2 l concerning required both modular addition operations and lookup tables is estimated as k and 2k + 1, respectively, where k equals the number of primary non-redundant RNS moduli. The time complexity is log 2 k + l modular clock cycles.
The use of minimally redundant RNS and a novel approach to calculating the rank of a number at each iteration of the bisection method enables a significant decrease in the powerof-two scaling computational complexity. Corresponding reduction factors concerning the required modular addition operations and lookup tables are given in Tables 1-3.
The proposed approach to power-of-two scaling coincides with the development vector of modern high-performance computing using RNS arithmetic. It enables the implementation of an extensive class of tasks in various areas of science and technology, first of all in cryptography and digital signal processing.