An Efficient Parallel Reverse Conversion of Residue Code to Mixed-Radix Representation Based on the Chinese Remainder Theorem

In this paper, we deal with the critical problems in residue arithmetic. The reverse conversion from a Residue Number System (RNS) to positional notation is a main non-modular operation, and it constitutes a basis of other non-modular procedures used to implement various computational algorithms. We present a novel approach to the parallel reverse conversion from the residue code into a weighted number representation in the Mixed-Radix System (MRS). In our proposed method, the calculation of mixed-radix digits reduces to a parallel summation of the small word-length residues in the independent modular channels corresponding to the primary RNS moduli. The computational complexity of the developed method concerning both required modular addition operations and one-input lookup tables is estimated as Ok2/2, where k equals the number of used moduli. The time complexity is Olog2k modular clock cycles. In pipeline mode, the throughput rate of the proposed algorithm is one reverse conversion in one modular clock cycle.


Introduction
Along with the improvement of computer technology, the development and implementation of new effective approaches to the organization and realization of computational tasks are some of the main ways to increase the data processing speed. At present, highperformance computing is developing extremely rapidly. These reasons lead to qualitatively new requirements imposed on number-theoretic methods and computational algorithms. Practically, all well-known approaches to high-performance computing use certain parallel forms of data representation and processing. In recent decades, special consideration has been given to the so-called modular computational structures. Their arithmetic foundation is the Residue Number System (RNS), whose ideological roots go back to the classic topics of number theory and abstract algebra. The RNS is a non-positional number system with inherent parallelism and occupies a place of particular importance due to its carry-free properties, which provide a high potential for accelerating arithmetic operations.
The main advantage of RNS is its unique ability to decompose the large word-length numbers into a set of smaller word-length residues, which are processed in parallel in the independent modular channels. The inherent parallelism of RNS enables avoiding the carry-overs obtained in addition, subtraction, and multiplication, which are usually time-consuming in the WNS. In this regard, the modularity and carry-free properties make computation fast and efficient. Therefore, the RNS presents one of the most efficient means for increasing data processing speed.
Due to its carry-free property, the residue arithmetic is exceptionally suitable for a broad class of applications in which addition and multiplication are the dominant arithmetic operations. In any case, it has excellent potential for many substantial applications in such areas as digital signal processing, cryptography, distributed information and communication systems, information security systems, fault tolerance, cloud computing, and others. Moreover, these RNS applications may be effectively embedded in processor platforms functioning according to the conventional information-processing approach [2,5,8]. For the reasons mentioned above, residue arithmetic represents an efficient mathematical tool for the high-speed implementation of various computational tasks.
The reverse conversion and base extension are the most critical topics in residue arithmetic. As opposed to conventional WNS, these operations, on a par with other central non-modular procedures such as magnitude comparison, sign determination, overflow detection, general division, scaling, etc., are relatively harder for implementation. They are time consuming and costly due to their more complicated structure compared to modular operations.
As is known, to perform non-modular operations, it is necessary to carry out the binary reconstruction of the integer by its residue code, which in general is hampered by the non-weighted nature of the RNS. This circumstance negates to a substantial extent the main advantages of residue arithmetic.
Therefore, the development of novel approaches and methods for fast number reconstruction by its residue code has significant importance in high-performance computing based on parallel algorithmic structures of RNS, especially for high-speed implementing digital signal processing applications and public-key cryptosystems. That should enable the extensive use of residue arithmetic in many priority areas of science and technology.
In this paper, we present a novel approach to the parallel reverse conversion from the residue code into the mixed-radix representation. In the proposed method, the calculation of mixed-radix digits reduces to a parallel summation of the small word-length residues in the independent modular channels corresponding to the primary RNS moduli.
The paper is structured as follows. Sections 2 and 3 discuss the basic theoretical concepts of the research. Section 4 describes the mathematical background of the proposed reverse conversion method. Sections 5 and 6 present a numerical example and an analysis of the computational cost, respectively. Section 7 provides discussion, and Section 8 concludes the paper.

The Basic Concepts of the Residue Arithmetic
The abstract algebra and number theory create the theoretical basis of the residue arithmetic [12,13].
An RNS is defined by an ordered set {m 1 , m 2 , . . . , m k } of k pairwise relatively prime moduli, where each modulus m i ≥ 2 (i = 1, 2, . . . , k), and the greatest common divisor of m i and m j equals 1, i.e., gcd m i , m j = 1 for i = j. For convenience, we assume that the default order of moduli is ascending, i.e., m 1 < m 2 < · · · < m k .
In the given RNS, it is possible to represent M k integer numbers, where M k is the product of all moduli, M k = ∏ k i=1 m k . Therefore, the set Z M k = {0, 1, . . . , M k − 1} is usually used as an RNS dynamic range.
Every number X ∈ Z M k has a unique representation in the form of a k-tuple of small integers (χ 1 , χ 2 , . . . , χ k ), which is called a residue code, where χ i is a least non-negative remainder of a division of X by m i (i = 1, 2, . . . , k). We can notationally write this relation as The main advantage of the residue arithmetic over conventional binary arithmetic consists of parallel carrying out addition, subtraction, and multiplication at the level of small word-length residues. The modular operations • ∈ {+, −, ×} on integers A = (α 1 , α 2 , . . . , α k ) and B = (β 1 , β 2 , . . . , β k ) are performed independently in each modular channel in compliance with the computational rule: where α i = |A| m i and β i = |B| m i , i = 1, 2, . . . , k.
In other words, the arithmetic operations on long-word operands are decomposed into modular channels with operands that are no larger than the corresponding modulus. Moreover, all the modular channels are entirely independent of each other. The carryfree nature of modular operations (1) is one of the most attractive features of residue arithmetic [1,3,8].
Therefore, compared with the conventional WNS, the RNS simplifies and speeds up the addition and multiplication operations. This fundamental advantage of the residue arithmetic strongly appears in the case of implementing computational procedures, which mainly contain long segments consisting of only sequences of modular arithmetic operations. In this case, the primary moduli set is chosen so that the final results of the computational procedure always belong to the used dynamic range for any allowed values of input operands. At the same time, the intermediate results can even exceed the boundaries of the dynamic range.
Along with the carry-free modular operations, there are also the so-called non-modular operations such as residue-to-binary conversion, base extension, magnitude comparison, sign determination, overflow detection, general division, scaling, etc. These operations are complicated and quite time consuming, and their significant computational complexity limits the applications of the residue arithmetic and restricts its widespread usage for high-speed computing.
To perform the non-modular operations, it is required to consider all residues in the k-tuple (χ 1 , χ 2 , . . . , χ k ). Furthermore, it is necessary to determine the integer value of the number by its residue code, which in general is hampered by the non-positional nature of the RNS. The crucial problem of efficient implementation of non-modular operations is constantly receiving considerable attention by modern researchers [2,5,8].
The applicability of residue arithmetic is mainly determined by the computational complexity and feasibility of non-modular operations, which are used as a basis for implementing more complex computational algorithms in RNS. At the same time, the fundamental problem in the residue arithmetic, which unfortunately up to now is yet completely unresolved; it consists of reducing the computational complexity of non-modular operations. Due to a lack of efficient methods and algorithms for non-modular operations implementation, the residue arithmetic is mainly suitable when the modular additions and multiplications make up the bulk of required computations. In this case, the number of used non-modular operations is relatively small. This circumstance bounds the widespread use of the RNS to a narrow class of specific tasks.

Reverse Conversion of the Residue Code to Conventional Representation
The root problem of residue arithmetic is that the weighted value of the integer X depends on all the residues χ 1 , χ 2 , . . . , χ k . The reconstruction of an integer by its residue code, i.e., the reverse conversion, is one of the most difficult non-modular operations in residue arithmetic. Moreover, this operation underlies all the other non-modular procedures.
Despite the currently extensive studies on residue arithmetic and its applications, there is a need to develop novel efficient approaches and methods of an integer number reconstruction by its residue code. This should enable us the extensive use of residue arithmetic for high-speed computing in many priority fields, first of all, in various digital signal processing and cryptographic applications.
There are two canonical techniques of reverse conversion: the canonical method based on the Chinese Remainder Theorem (CRT) and the residue code conversion to a weighted representation in the Mixed-Radix System (MRS) [1,2,5,8,[14][15][16][17][18]. In general, all other conversion methods represent different variants of these two methods.
Below, we describe the mathematical background of these methods.

CRT-Base Conversion Method
When the moduli m 1 , m 2 , . . . , m k are pairwise relatively prime, the integer number X and its residue code (χ 1 , χ 2 , . . . , χ k ) are related by the equation: where In essence, Equation (2) represents the CRT [10,19,20]. In the last decades, considerable efforts are directed to reducing the complexity of the CRT implementation and the possibility of its application in high-speed computing [2,5,8,[21][22][23]. The main idea of these methods is to replace the inner multiplications and additions modulo M k with simpler operations (see (2)).
Consider the CRT-number As follows from (2), the difference X k − X is a multiple of M k . Therefore, the following exact integer equality holds The unique integer number ρ k (X) is a normalized rank (or, briefly, rank) of the number X [3,4,7].
Equation (4) is called a rank form of the integer X. In essence, the rank ρ k (X) is a reconstruction coefficient that indicates how many times the dynamic range M k is exceeded when converting the residue code (χ 1 , χ 2 , . . . , χ k ) to the integer X.
In contrast to (2), Equation (4) does not contain a very time-consuming reduction modulo M k . Therefore, when we have the efficient method for the rank ρ k (X) computation, the reverse conversion algorithm constructed on the basis of (4) has a substantial lead over the canonical CRT implementation (2).

MRS-Base Conversion Method
In the MRS defined by a set {m 1 , m 2 , . . . , m k } of pairwise relatively prime moduli, the integer X ∈ Z M k is represented by the k-tuple (x k , x k−1 , . . . , x 1 ) of mixed-radix digits, resulting in It is well known that the MRS surpasses the RNS when performing non-modular operations such as magnitude comparison, sign determination, and overflow detection. Therefore, the mixed-radix representation has received the widest appliance for the implementation of non-modular procedures along with the other generally accepted integral characteristics of the residue code such as the rank of a number, core function, interval index, parity function, diagonal, and quotient functions [3,4,7,[24][25][26][27][28][29][30][31][32][33].
This sequential calculation procedure called a chained algorithm can be written in the general form where From (6) and (7), it follows that the considered computational process requires two modular operations: subtraction and multiplication by the multiplicative inverse. Thus, the most crucial advantage of this algorithm is its high modularity. However, its strictly sequential nature prevents general use for the construction of appropriate high-performance parallel computing procedures.

A Novel CRT-Base RNS-to-MRS Reverse Conversion Method
Now, we describe a proposed new method for calculating mixed-radix digits x 1 , x 2 , . . . , x k of the number X by its residue code (χ 1 , χ 2 , . . . , χ k ).
Consider the CRT-number X k . According to (3), we have By Euclid's Division Lemma, the integer m k χ i,k can be written as where x denotes the largest integer less than or equal to x. Substituting (9) into (8), we obtain where Taking into account (9), we have we can reduce the right side of equality modulo m k . Hence, the residue R i,k (χ i ) can be calculated as At the same time, from (13) it follows that Similarly, taking into account Equations (10)-(13), the numbers X i (i = k − 1, k − 2, . . . , 1) can be written by turns as . . .
where M 0 = 1, S 1 (X) = χ 1 , the integers S l (X) (l = 2, 3, . . . , k) are calculated according to (12)- (15) in the case when the index k is replaced by l. Finally, substituting the above equations for X l (l = k − 1, k − 2, . . . , 1) by turns into (10), we obtain At the same time, according to Euclid's Division Lemma, we have where R l (X) = |S l (X)| m l and Q l (X) = S l (X)/m i are the remainder and quotient of the division S l (X) by the modulus m l , respectively. Therefore, taking into account (12), when the index k is replaced by l, the integers R l (X) and Q l (X) can be computed as From (19), it follows that Q l (X) equals the number of occurred overflows when calculating the sum R l (X) of residues R 1,l (χ 1 ), R 2,l (χ 2 ), . . . , R l,l (χ l ) modulo m l (l = 2, 3, . . . , k).
Substituting (17) into (16), we obtain where Let us draw attention to Equations (21) and (22). It is evident that the number X . . , k (see Equation (5)). At the same time, x Bearing in mind that Q 1 (X) = 0, the number X (Q) k−1 can be written as where Q 1 (X) = 0, Q 2 (X) = Q 1 (X) = 0, and Q l (X) = Q l−1 (X) for l ≥ 3. Therefore, taking into account (19), the integer Q l (X) can be calculated as Thus, the integer X (Q) k−1 (see Equations (23) and (5)) can be represented by a k-tuple x Consequently, that entails the fulfillment of the condition Z l−1 ⊂ Z m l , which leads to inequality Thus, when the moduli set {m 1 , m 2 , . . . , m k } meets the conditions (25), we have that X Note that the integer X = 0 (see Equation (5)). Now, let us return to Equation (20). According to Euclid's Division Lemma, the sum of two mixed-radix numbers X Hence, substituting (26) into (20), we obtain Taking into account the rank form of the number X (4), from (27) we have From (28), it follows that the mixed-radix representation of the number X, i.e., k-tuple (x k , x k−1 , . . . , x 1 ), can be calculated as a result of the addition of two mixed-radix numbers X (see (21) and (23)) in the basis {m 1 , m 2 , . . . , m k }. Note that x are calculated as the sum of the residues R 1,l (χ 1 ), R 2,l (χ 2 ), . . . , R l,l (χ l ) modulo m l along with the counting of occurred overflows according to (18) and (24) (l = 2, 3, . . . , k).
Furthermore, in the MRS with the bases m 1 , m 2 , . . . , m k , we calculate the sum of two numbers X (R) k and X (Q) k−1 . As a result, we obtain the mixed-radix representation (x k , x k−1 , . . . , x 1 ) of the number X. Table 1 given below presents the pre-calculation components (see Equations (31) and (32)). It should be recalled that R 1,1 (χ 1 ) = χ 1 . The abbreviation LUT means lookup table. The bit-length of residues is b l = log 2 m l (l = 1, 2, . . . , k). Here, and further, x denotes the smallest integer greater than or equal to x.

Input Residue Number and Skope of LUTs
Output Residue Set Table 2 presents the results of calculations in the modular channels according to Equations (29) and (30). It should be reminded that in the first modular channel corresponding to the modulus m 1 , the calculations are not carried out, so x

Modular Channel Input Data
Output Data The stated above allows us to formulate the following substantial theorem.

A Numerical Example of the Proposed Conversion Method
The main idea of the proposed approach to reverse conversion is illustrated below by a simple numerical example. For convenience, we consider a four-moduli RNS.

The Computational Cost of the Reverse Conversion Method
As it follows from the results mentioned above, the calculation of the mixed-radix digits x 1 , x 2 , . . ., x k reduces to the independent and parallel summation of small residues R 1,l (χ 1 ), R 2,l (χ 2 ), . . ., R l,l (χ l ) modulo m l in lth modular channel (l = 1, 2, . . . , k), taking into account the number of the overflows occuring during the modular addition operations (see (29)-(32)).
Let us evaluate the time required to perform the parallel reverse conversion.
First, we consider the calculation of mixed-radix digits of the numbers X (see (29) and (30)). As can be seen, there are no modular addition operations in the first modular channel corresponding to the modulus m 1 . In the second channel, we have only one addition operation modulo m 2 . Furthermore, two additions modulo m 3 are performed in the third channel and so on. Thus, in the lth modular channel, we have l − 1 additions modulo m l (l = 2, 3, . . . , k). These calculations are easily parallelized and pipelined. Therefore, the required computation time for calculating digits x . . , m k } involves two additional modular clock cycles taking into account the inter-digit carries. Therefore, the execution time of the reverse conversion equals T conv = T k + 2 modular clock cycles. Thus, the overall time is t conv = T conv t mod , where t mod denotes the modular clock cycle time. At the same time, when pipelined, the throughput rate of the proposed conversion method is one conversion in one modular clock cycle.
Consider now the evaluation of the required computational cost. Due to the small word-length of residues in the k-tuple (χ 1 , χ 2 , . . . , χ k ), the pre-computation and lookup table techniques are suitable for reverse conversion implementation. So, we can use oneinput lookup tables depending on the residues word-length in each modular channel.
At the beginning stage of the reverse conversion, in the lth channel corresponding to the modulus m l , the number of lookup tables required to store the residue set R 1,l (χ 1 ), R 2,l (χ 2 ), . . . , R l,l (χ l ) equals N lut (l) = l. At the same time, the word length of recorded residues is b l = log 2 m l bits (l = 2, 3, . . . , k). In the first modular channel, N lut (1) = 0 since S 1 (X) = χ 1 .
Then, the overall number of one-input lookup tables in all modular channels is equal to The summation of the residues R 1,l (χ 1 ), R 2,l (χ 2 ), . . . , R l,l (χ l ) modulo m l requires N add (l) = l − 1 modular addition operations (l = 2, 3, . . . , k). At the same time, all independent calculations are realized in parallel in corresponding modular channels.
Taking into account that x on the final stage of the reverse conversion requires 2(k − 2) modular addition operations.
Hence, the overall number of modular addition operations in all modular channels is equal to When pipelined, the throughput rate of the proposed method is one reverse conversion in one modular clock cycle.

Discussion
As it follows from [1], the calculation of the mixed-radix digits x 1 , x 2 , . . . , x k (see (6) and (7)) requires k − 1 both addition and multiplication operations; in this case, the overall conversion time is k(k − 1)/2 · (t add + t mul ), where t add and t mul denote an execution time of addition/subtraction and multiplication, respectively. The computational cost of the pipelined implementation of this algorithm is k(k − 1)/2, both multiplication and addition operations, while the conversion time is (k − 1)(t add + t mul ). The main drawback of this method is its strictly sequential nature.
The parallel conversion method circumscribed in [16] uses the additional lookup tables. At the same time, k(k + 1)/2 lookup tables and k(k + 1)/2 adders are required. The conversion time is t lut + (k − 1)t add due to the need to generate the inter-digit carries when performing addition operations. As noted in [34], the method proposed in [16] does not allow obtaining the claimed depth of O(log 2 k) in terms of RNS processing elements. In this regard, an improved method was proposed by adding extra k(k + 1)/2 multipliers to hardware resources used in [16]. The implementation time is t lut + t mul + (2log 2 k + 1)t add . Hence, the time complexity of this conversion algorithm is O(log 2 k).
In [15], the mixed-radix conversion is realized by the cascaded scheme of lookup tables and adders. The computational cost for the sequential implementation is k(k − 2)/4 doublesize lookup tables and k(k − 2)/4 adders, while the conversion time equals (k/2)· (t lut + t add ). When pipelined, the throughput rate is determined by the time equals t lut + t add . This method works well when the used moduli do not have a very large wordlength, since the size of lookup tables increases significantly with a word-length growth.
The paper [17] presents the parallel reverse conversion method, which uses the lookup table technique and requires no arithmetic or logical units. As reported, this algorithm is better than the ones presented in [15,16]. It is based on solving k(k − 1)/2 linear Diophantine Equations and requires k(k − 1)/2 lookup tables of size m i × m j , while a conversion time is (k − 1)t lut . When pipelined, its effective conversion rate is one conversion per t lut . So, this method is attractive for DSP implementation. However, it is not suitable for implementing cryptographic applications because of the enormous size of the required lookup tables, especially when processing large numbers.
In the paper [9], the reverse conversion method is based on modular reduction by a modified canonic CRT algorithm. This enables minimizing the bit-width of intermediate data processing. The lookup tables translate the b i -bit input residues (i = 1, 2, . . . , k) into b out -bit output integers, where b i = log 2 m i , b out = 1 2 log 2 ∑ k i=1 b i , and k is the number of RNS moduli. As a result, the modular reduction of the modified k-tuple of b out -bits integers is carried out over a ring of size 2b out such that only the b out least significant bits of the binary representation are maintained. In this case, all the b out -bit outputs in the modified k-tuple are added together by adder tree without regard to overflow, propagating the b out least significant bits to the output. The reverse conversion requires k lookup tables and k − 1 adders. The scope of used lookup tables is 2 b × 2 b out , b ∈ {b 1 , b 2 , . . . , b k }. The overall conversion time is t lut + log 2 k t add .
Some reverse conversion methods use the special moduli sets with a limited number of moduli, such as m = 2 n + d (d ∈ {−1, 0, 1}) [2,8,[35][36][37][38][39][40]. Their main drawback consists in a small number of the selected moduli, typically from three to five. These moduli sets are suitable for the efficient implementations of DSP algorithms but completely not applicable for large numbers processing widely used in cryptography. For example, to represent 1024-bit word-length cryptographic numbers using four RNS moduli, each modular channel must have residues of 256-bit length, which is not qualified for highperformance computing. Table 3 compares the results across multiple techniques of the reverse conversion. Here, we use the following abbreviations: LUT-lookup table, ADD-adder, MUL-multiplier. The bit length b ∈ {b 1 , b 2 , . . . , b k }, b l = log 2 m l (l = 1, 2, . . . , k). Table 3. RNS-to-MRS reverse conversion methods.

Method
Number and Scope of LUTs ADD MUL Conversion Time As seen from above, the proposed parallel reverse conversion method has time complexity of the order O( log 2 k ). In pipelined mode, it enables the high throughput rate and has one reverse conversion in one modular clock cycle. At the same time, the computational complexity is of the order of O(k 2 /2) in terms of the number of both required arithmetic operations and one-input lookup tables.

Conclusions
In this paper, a novel approach to parallel reverse conversion of the residue code (χ 1 , χ 2 , . . . , χ k ) of the number X to mixed-radix representation (x k , x k−1 , . . . , x 1 ) is described.
The calculation of the mixed-radix digits (x k , x k−1 , . . . , x 1 ) is reduced to a parallel summation of the small word-length residues R 1,l (χ 1 ), R 2,l (χ 2 ), . . ., R l,l (χ l ) modulo m l in lth modular channel (l = 1, 2, . . . , k), taking into account the number of the overflows occuring during the modular addition operations. These modular operations are performed fast and independently in each modular channel and easily pipelined.
The computational cost of the proposed reverse conversion method is presented. In all modular channels, the general number of modular addition operations is equal to N add = k 2 + 3k − 8 /2 . At the same time, the summary number of reqiured one-input lookup tables makes up N lut = k 2 + k − 2 /2.
The execution time of the reverse conversion equals T conv = log 2 k + 2 modular clock cycles. At the same time, when pipelined, the throughput rate of the proposed conversion method is one conversion in one modular clock cycle.
The proposed parallel reverse conversion method coincides with the development vector of modern high-performance computing using residue arithmetic. It can find a widespread application for implementing a broad class of tasks in various areas of science and technology, first of all, in digital signal processing and cryptography.