Abstract
In this paper, a new simplified iterative division algorithm for modular numbers that is optimized on the basis of the Chinese remainder theorem (CRT) with fractions is developed. It requires less computational resources than the CRT with integers and mixed radix number systems (MRNS). The main idea of the algorithm is (a) to transform the residual representation of the dividend and divisor into a weighted fixed-point code and (b) to find the higher power of 2 in the divisor written in a residue number system (RNS). This information is acquired using the CRT with fractions: higher power is defined by the number of zeros standing before the first significant digit. All intermediate calculations of the algorithm involve the operations of right shift and subtraction, which explains its good performance. Due to the abovementioned techniques, the algorithm has higher speed and consumes less computational resources, thereby being more appropriate for the multidigit division of modular numbers than the algorithms described earlier. The new algorithm suggested in this paper has O (log2 Q) iterations, where Q is the quotient. For multidigit numbers, its modular division complexity is Q(N), where N denotes the number of bits in a certain fraction required to restore the number by remainders. Since the number N is written in a weighed system, the subtraction-based comparison runs very fast. Hence, this algorithm might be the best currently available.
    1. Introduction
The development of an informational society poses new challenges connected with the problem of multidigit numbers transmission and processing. Important mathematical problems that require such calculations and also considerable computational resources, both in terms of theory and practice, arise in the applied and computational theory of numbers [,]. Most of such problems involve integer calculations with the numbers belonging to the large and super large computer ranges, while the results of calculations must be precise without rounding.
A feature of conventional computing devices is limited bit grid, which causes computational complexity for the operations over multidigit numbers.
For calculations with multidigit numbers or calculations with large-range numbers, the residue number systems (RNSs) have clear advantages over the radix number systems. Modern research is focused on the processing of multidigit data in which the values of integer variables considerably exceed the dynamic range of the serially produced computing devices (by 103–106 times and even more); see Molahosseini et al. [].
Modular calculations are crucial for applications with multidigit numbers, e.g., cryptography, digital signature, and others [,].
For instance, the numbers used in cryptographic systems vary between  and  to guarantee the high-level security of protected information []. During modular processing these numbers are partitioned into small formats (several bits or several tens of bits), which appreciably speeds up their implementation.
Residue number systems attract many researchers as a base for computing devices, and the recent decade has been remarkable for the growing interest in these systems. This fact follows from numerous publications dedicated to RNS usage in digital signal processing, image processing, antinoise coding, cryptographic systems, quantum automata, neurocomputers, systems with the massive parallelism of operations, cloud computing, DNA computing, etc. [,,,,,,,,,,,,].
Residue number systems are remarkable for fast summation, subtraction, and multiplication, which explains major interest in these systems for the applications requiring large volumes of calculations. However, some operations (modulo reduction, comparison and division of numbers) are very complicated in RNSs [,]. The development of more efficient algorithms for comparison, sign detection, and division would open up new applications of RNSs [].
The existing division algorithms for RNSs [,,] can be classified as the ones based on number comparison and the ones based on number subtraction.
In the comparison-based algorithms [,,,,,,], the quotient is found using the iteration:
      
        
      
      
      
      
    
      where  and  denote the current and next dividends, respectively;  is the divisor; finally, gives the digit of the quotient. For obtaining , the quotient  has to be compared with .
In the second class of the algorithms [,,,], the quotient is calculated using the iterations . The quotient  is generated at each iteration from the complete RNS range, instead of being chosen from a small set.
The integer division algorithm [] is similar to standard binary division. Yet, the main drawback of this algorithm and its modifications is that each iteration requires numbers comparison.
The algorithm without such drawbacks that was suggested in Szabó and Tanaka [] replaces the real divisor with the approximate one (an RNS module or the product of several RNS moduli). The algorithm yields correct results under the condition , where  and  are the real and approximate divisors, respectively. Clearly, this condition may be violated for some sets of moduli (e.g., , where p1 and p2 are the RNS moduli).
Among the shortcomings of the above algorithm, note the need for using mixed radix number systems (MRNSs), scale operation, special logic and approximate divisor calculation tables. As a matter of fact, a series of approaches were proposed for solving the division problem based on the numbers comparison and sign detection methods, which can be classified in the following way: the algorithms [,,] employ the conversions of MRNSs; the algorithms [,] formulate the problem in terms of even numbers detection (parity check); and the algorithms [] involve the base extension method for iterations. However, the proposed algorithms suffer from high execution time and considerable hardware cost due to the usage of MRNS, the Chinese remainder theorem (CRT), and other time-consuming operations.
In Hiasat et al. [] and Hiasat [] it introduced a high-speed division algorithm in which the MRNS and CRT in modular numbers division were replaced by the comparison of the high powers of the dividend and divisor. The execution time and hardware cost of this algorithm are smaller in comparison with the other algorithms; nevertheless, it contains redundant stages. In order to speed up current quotient calculation, J.H. Yang [] suggested a division algorithm based on parity check that finds the quotient twice as fast as the algorithms [,]. But the calculations of the high powers of 2 still require much time in RNSs, and these operations are performed at each iteration. In Chung [] the original algorithm [] was simplified using division by 2 and efficient quotient search within a probable range. At each round, this algorithm adopts hard-to-implement parity check, which is its disadvantage.
Most of the listed algorithms contain hard-to-implement operations such as the CRT, scale operation, extension, sign detection, and comparison, which reduce their speed and cause considerable hardware cost of modular numbers division.
There exists a division algorithm in the RNS format that uses the basic set of RNS moduli in combination with an auxiliary module system for storing the remainders of the dividend and divisor. The dividend and divisor in the RNS representation are converted into different RNS representations with different module systems []. The usage of two RNS sets leads to higher redundancy, making it necessary to perform direct and inverse transitions from the basic module set to the auxiliary one and back during division. This feature dramatically reduces the speed of calculations. Talahmeh at al. [] introduced a fast division algorithm based on index transformations over the Galois field , which is easily implemented using tabular search. However, this algorithm guarantees efficient processing for the data of 6–10 bits and prime moduli only. In the case of larger ranges, the algorithm has low performance because the generator of prime p must be very large to represent integers over the Galois field.
Most of the suggested iterative algorithms involve very many operations at each iteration. As was declared by the authors, the algorithm based on the CRT with fractions [,,] might appear to be best because its time complexity is , where  denotes the number of RNS moduli and  the number of bits in each module under the assumption that the moduli are more or less the same. But this algorithm has some disadvantages as follows: (a) the divisor D is limited by , where P is an RNS representation range; (b) each iteration includes several operations such as summation, multiplication, comparison, and parity check; (c) in the end of the algorithm, the quotient has to be converted from the system  to the system , which gives extra computation load during modular numbers division.
In this paper, an alternative modular division algorithm with efficient quotient calculation using the relative values of the dividend and divisor in the fractional representation is introduced. Each iteration of the new algorithm involves the shift and subtraction of the successive intermediate results, which makes hardware implementation more efficient.
2. Approximate Positional Characteristic Calculation for Modular Numbers Based on the CRT with Fractions and Its Application to Modular Division
An RNS is a set of positive and pairwise relatively prime numbers , which are called moduli or bases. The dynamic range is given by . For an unsigned number , the additional RNS range has the form . For the signed numbers, the additional range is defined by . In this system, any integer  belonging to the range  can be uniquely described by an ordered set of remainders . Each remainder  has the modulo  representation
      
      
        
      
      
      
      
    
Let operation  be arithmetic summation, subtraction or multiplication. The most interesting property of an RNS is that these operations can be converted from the integer representation into the modular operations with different moduli , i.e.,
      
      
        
      
      
      
      
    
Using model Equation (2), the dynamic range is decomposed into parts with a narrower data format so that within them all calculations are performed in parallel. As a result, the complexity of the arithmetic structures decreases accordingly.
The number  in the RNS representation can be restored using the Chinese Remainder Theorem [,], i.e.,
      
      
        
      
      
      
      
    
      where ; , , denote the RNS moduli;  is the multiplicative inversion of  on , .
As is well-known, among their drawbacks the RNSs have the implementation complexity of non-modular operations (comparison, division) that are based on given positional characteristics.
The analysis of positional characteristics shows that they can be calculated precisely or approximately. Therefore, the calculation methods of positional characteristics consist of two groups as follows:
- precise calculation methods;
 - approximate calculation methods.
 
The precise calculation methods of positional characteristics were considered in [,]. In this paper, an approximate calculation method with appreciably smaller hardware cost and execution time for the operations over the positional codes of decreased digit capacity will be employed.
The approximate calculation method of positional characteristics uses the relative values of modular numbers with respect to the complete range defined by the Chinese Remainder Theorem []. This theorem associates with a positional number  its remainder representation , where , , are the least non-negative residues of this number on the RNS moduli .
Assume that the number  has the RNS representation with residues . Dividing the left- and right-hand sides of Equation (3) by the constant  (the dynamic range) yields the approximate value
      
      
        
      
      
      
      
    
Here  are the constants of the chosen system; , , denote the digits of the number in the RNS representation on the moduli . Note that Equation (4) takes values within the interval . The final value is obtained by summation and integer part truncation, with the fractional part being retained. The fractional value  contains information about the value of the number and its sign. If , then the number  is positive, and  gives its value divided by . Otherwise,  is a negative number, and  gives its relative value. Denote by  the value of  rounded to  bits. The exact value of  satisfies the inequalities . The integer part of the number yielded by summing up  is neglected, i.e., discarded. 
The rounding of  causes inevitable errors. Introduce the notation . As was demonstrated in [],  bits after the decimal point have to be used for rounding  without considerable errors that would affect calculation accuracy. In other words, there exists a bijection between the set of numbers in the RNS representation and the set of numbers rounded to the th bit, i.e., .
Taking into account the function , the sign detection conditions can be written as follows:
      
        
      
      
      
      
    
Consider the approximate calculation method for comparing numbers in the RNS representation.
Example. Let , , , and  be the system of RNS moduli. Then , , , , , and .
The constants  for calculating the relative values are
      
      
        
      
      
      
      
    
The constants  rounded to 12 decimal places are
      
      
        
      
      
      
      
    
Compare the two numbers  and  in the RNS representation with the moduli , , , and . Note that, for numbers comparison, the critical case is the numbers differing by 1. The RNS representations of the numbers  and  are  and , respectively. Their difference is . Now, detect the sign of . First, find ; this value satisfies the first condition of model Equation (5), i.e., . Hence, the natural conclusion is that , meaning .
3. New Division Algorithm Based on the CRT with Fractions
Consider two numbers––a dividend  and a divisor . Let both numbers be represented within the range , where  and , , denote the RNS moduli. For the sake of simplicity, assume that  and  are positive numbers. (The case of negative numbers can be considered by analogy using simple modifications.) The division algorithm calculates the quotient  and the remainder  so that , where . The detailed description of this algorithm is given in Appendix A.
Modular division includes two stages. At the first stage, the high power of  is obtained using the binary series approximation of the quotient; at the second stage, the approximation series is refined accordingly. The algorithm yields , where ;  denotes the set of powers  in the refined quotient approximation series; , , are the RNS moduli.
For modular division optimization, the classical CRT will be replaced by the CRT with fractions. In this case, following model Equation (4), the dividend, divisor, and remainder can be written as the fractions
      
      
        
      
      
      
      
    
Using the approximate values, consider the modular division algorithm based on the CRT with fractions.
The algorithm consists of two stages. The first stage is to find the high power  of the divisor by the left shift of  to the zero digits standing before the first significant digit. At the second stage, the general quotient is generated by selecting the powers of 2 that form the partial quotients to be included in the general quotient approximation series. The analysis procedure starts from the highest power  and ends with the zeroth power of 2, thereby reading out the necessary powers of 2 in the RNS representation. The modular division algorithm based on the CRT with fractions works in the following way. The residues table for the  integer powers of 2 is put into memory, and the representations  and , , are obtained at the input. Then the quotient is calculated so that , where .
The quotient  is generated at each iteration from the powers of 2 in the RNS representation that are included or excluded depending on the sign of the subtraction chain , . The notations in this formula are the following:  as the highest power of the quotient;  as the highest digit of the quotient;  as current iteration; ; ;  and  as the fractions;  as the complete range; , , as the RNS moduli; ;  and  as the values of  and , respectively, rounded to the th bit (note that the resulting errors do not affect calculation accuracy);  as the current value and  as the successive value, which is defined by the one-position right shift of the divisor multiplied by the corresponding power of 2 (actually, this is equivalent to division by 2 and the subtraction ). Each th iteration is associated with the th binary digit in the RNS representation, which are put into memory as the residuals table for the integer powers of 2.
The well-known algorithm needs the dividend and divisor at each iteration. For the new algorithm, the dividend  and the divisor  are required only at the first iteration; all subsequent iterations involve the difference  because these values contain information about the dividend and divisor. With this method, all quotient approximation iterations are reduced to the subtraction , and the sign of this difference is used to find the desired partial quotient as the corresponding power of 2 in the RNS representation. As a result, the computational complexity of modular division considerably decreases.
The digits of the quotient are obtained by the  summation of the partial quotients using the sign of the subtraction result. If the sign is positive, the quotient is included (otherwise excluded). In contrast to the well-known algorithms, the new one allows for easy implementation: the time complexity of the iterations is defined by the execution time of shift, subtraction, and summation.
Concerning the advantages of the new algorithm, note that the division procedure does not involve (a) intermediate numeric data in the MRNS representation and (b) difficult-to-implement RNS operations such as comparison, scaling, base extension, and sign detection. These features contribute to higher efficiency of modular division.
The new modular division algorithm for integer numbers  has the scheme presented in Appendix B. The hardware implementation of this algorithm is described in detail below.
4. Hardware Implementation of New Modular Division Algorithm
The new modular division algorithm for multidigit numbers includes the following basic parts: quotient sign detection, quotient approximation, and further refinement of the quotient approximation series.
Figure 1 illustrates the hardware implementation of the modular division algorithm, which consists of several units such as converters, summers (summators), multipliers and others. The issues of their optimization were studied in the papers [,,,].
      
    
    Figure 1.
      Hardware implementation of new algorithm.
  
Assume that an RNS contains  moduli and  modular processors. Let  be the number of bits required for the representation of each remainder. For making the hardware implementation complexity analysis of this algorithm simpler, consider the case in which the moduli are more or less the same. Under this hypothesis, the total length of the modular processor bus is  bits.
Let the dividend and divisor be arbitrary integers and also let the divisor be not reducible to a pairwise relatively prime number on the RNS moduli.
The hardware implementation scheme in Figure 1 includes buses  and  supplying the dividend and divisor, respectively, and quotient bus . Each of buses , and  has  bits. For division, the one-bit signal is supplied through «Division» bus. Upon receipt of the inputs  and , the system calculates  so that , where .
At the initial state, the control unit (CU) receives the “Division” signal and then forms the following signals through the one-bit buses (see Figure 1):
- “Adj. 0,” for adjusting the zero states of the functional units;
 - timing pulses («TP»), for performing control of the registers and counters;
 - “Adj. MS,” for adjusting the address code of multiplexer («MS») 2:1;
 - “Adj. DMS,” for adjusting the address code of demultiplexer («DMS») 1:2. Depending on the address code, demultiplexer «DMS»switches the highest power of 2, either simultaneously to multiplier «MTP» and element «OR» or to element «OR» only, whose output is connected to the input of inhibit element . Depending on the address code, multiplexer MS switches to the output of the comparison and sign detection unit (CSDU), or directly to the divisor , or to the divisor multiplied by the highest power of ;
 - “Adj. ,” for adjusting the right shift of reversible register ;
 - “Adj. CSDU,” for adjusting the blocking of the comparison and sign detection unit (CSDU).
 
The dividend  and the divisor  in the RNS representation on the chosen moduli are supplied through the M-bit buses to the input of CSDU. And this unit calculates the relative values of the dividend and divisor ( and ) as well as detects their signs and performs their comparison. Through element  the dividend directly comes to the input of CSDU. And the divisor is supplied to the input of CSDU through multiplexer MS, which has the corresponding address at the address input. The CSDU is implemented by Equations (4) and (5) and the model of Example 1. If , then CSDU forms the signal  (the divisor is greater than the dividend), which comes to the input of the control unit. Next, the division unit is adjusted to the initial state, and the quotient . If , then CSDU generates the equality signal of the dividend and divisor (“”). This signal is supplied to the input of summer , where the constant  is written. If , then CSDU forms the signal “,” and the control unit switches the division unit into the partial quotient approximation mode.
Next, the signs of the dividend  and divisor  are analyzed. The sign of the quotient  is defined by . From the output of CSDU the sign signals of the dividend and divisor (“” and “”) are supplied through the one-bit buses to the input of element exclusive or «XOR» (addition modulo 2). Using the truth table, this circuit generates the signal “” or “,” which is supplied to the input of quotient summer . If the output signal of element XOR is 0, then the quotient is assigned the positive sign (0); if it is 1, then the quotient is assigned the negative sign (1). Then, through the N-bit buses the values  and  come to the input of summer  and the shift register , respectively.
The value  is converted by  into the additional code, and the result is supplied to the input of subtractor  through the N-bit bus. The value  comes to the input of reversible shift register  through the N-bit bus. Using the timing pulses of the control unit, the content of register  (the value ) is shifted to the right to the number of zero digits standing before the first significant digit; this number is registered by counter 2 «CNT2». The resulting number of shifts corresponds to the highest power  of 2 in the divisor. The relative values are used to find the highest power of 2 in the quotient approximation series, which is registered by counter CNT2 without iterative calculations. As soon as the high significant digit of the divisor  becomes 1 during the shift procedure (similar to number normalization), register  fixes the state of counter CNT2 through the one-bit bus and enables information read-out from  (similarly to stack pointer). Counter CNT2 activates the RAM memory address defining the highest power of  in the quotient approximation series that must be in the quotient . In the initial state, buffer register  has zero value. The quotient approximation mode is completed, and at the  output the memory cell is activated that stores the highest power of  in the quotient approximation series. Thus, for approximating the quotient , the CSDU performs one comparison and one operation  for obtaining the sum of N-bit numbers, while register  performs one shift operation to  digits.
Let ; in this case, after shifting the content of register  the state of counter CNT2 is . This counter activates the memory cell containing the power of .
4.1. Refinement Mode for Quotient Approximation Series with Dividend a and Divisor b
Recall that the highest power of  is obtained in the approximation mode and stored by  in the RNS representation. Using the one-bit signal “Adj. DMS,” this power comes through the M-bit memory bus to the input of summer  via elements OR and . After that, through the one-bit bus “Adj. MS” the address code of multiplexer MS is adjusted to the next state; the value  is switched at the output of multiplexer MS and then supplied to the input of CSDU through the M-bit bus. This unit calculates  and sends the result to the input of shift register  through the N-bit bus . The old information in  is removed. The buses of the dividend а and divisor  are disconnected from CSDU by the signal “Adj. CSDU” coming to the inputs of  and . Using the control signals of the control unit, the content of  is supplied to an input of  through the N-bit bus. From the output of summer  the value  is supplied to the second input of summer  through the N-bit bus. Next, the signal “Adj. ” switches register  into the right shift mode.
The divisor multiplied by the highest power of 2 is subtracted from the content of summer , and the value  is calculated for detecting the sign of the result. If the sign digit of the subtraction result in summer  is 0, i.e., , then 0s are supplied to the inhibit inputs of inhibit elements  and . Through elements OR the inhibit elements  and  pass the highest power of the quotient from the second output of DMS to the input of summer . The subtraction result in summer  (i.e., the remainder of the dividend that corresponds to the rest powers of 2) is supplied, through the N-bit bus and inhibit element , first to the input of buffer register  and then to the input of summer . The old content of this summer is removed, and the new value is written. Next, the content of register  is shifted to the right, and the process is repeated as described above. If the sign digit of the subtraction result in summer  is 1 (i.e., the relative value of the divisor is greater than the dividend), then this 1 comes to the inhibit inputs of elements  and  that inhibit the supply of the corresponding power to summer  and also the supply of the subtraction result of summer  to the input of buffer register . In other words, the register saves the previous value of the subtraction result. At its input summer  receives only the refined powers of 2 that are the terms of the quotient series. The conversion process ends after analyzing the power of . Thus, in the refinement mode all redundant terms of the quotient approximation series are iteratively eliminated using uncomplicated transformations.
The approximate quotient is sequentially refined by the subtraction of the divisor first multiplied by the highest power of 2 from the dividend and its further shift and subtraction from the resulting partial remainders during quotient calculation. In comparison with the well-known algorithms, a distinctive feature of the quotient approximation refinement procedure suggested in the new algorithm is the usage of the dividend  and the product  (the divisor and the highest power of 2) only at the first iteration. The subsequent iterations are based on an original principle of this paper, namely, on the one-position right shift of  at each iteration; in fact, this is equivalent to division by 2 and the subtraction of the results obtained at successive iterations, i.e., . The principle is unique because each iteration contains two operations––one-position right shift and subtraction with further sign detection. As a result, the speed of division considerably increases. Each iteration of the well-known algorithms involves multiplication, summation, comparison, and parity check. Assume that each iteration requires  time units; then the total time of each iteration is . Each iteration of the new algorithm consists of one subtraction and one shift, consuming  time units. Thus, the efficiency gain is about 2. The performance analysis of the new algorithm (quotient approximation and refinement) has demonstrated its considerable advantage over the well-known counterparts in terms of modular division time.
4.2. Experimental Performance Analysis: New Modular Division Algorithm Versus Well-Known Algorithms
As is indicated by experiments, the algorithm developed in this paper strongly depends on initial data (the number of RNS moduli and their values) and also on input data (the values of dividend and divisors). Hence, this algorithm is difficult for analytical study.
The new modular division algorithm is similar to the algorithm presented in Hung []. However, in comparison with the optimization methods used therein, the operations of multiplication and summation have been replaced by the less demanding shift operations of partial quotients based on comparison. As a result, hardware cost and execution time have been considerably reduced. Besides, the Lu–Chiang algorithm involves the quotient correction operator with sign detection using MRNS, which dramatically decreases its speed.
In this section, the algorithm [] will be compared with the new modular division algorithm on the same numerical data. For comparison, choose the example considered in Hung [].
Let the RNS moduli be 5, 7, 9, and 11.
For example, take the dividend  and the divisor . It is required to obtain the quotient  and the remainder .
For the sake of illustrative and complete analysis, Table 1 and Table 2 provide intermediate data yielded by the algorithm [] and the new algorithm, respectively, in the course of calculating .
       
    
    Table 1.
    Calculation of  by algorithm [].
  
       
    
    Table 2.
    Calculation of  by new algorithm.
  
The new modular division algorithm is attractive owing to less operations (see Table 3). Table 3 shows how many operations of different types are consumed by the well-known algorithms and the new algorithm for the example [] (these data were obtained by experimental study).
       
    
    Table 3.
    Comparison of Hung-Parhami algorithms with new algorithm.
  
While using the new modular division algorithm, the division specifics of fractions must be considered for avoiding rounding errors.
Actually, the function  is an approximate variation of the function . When RNS numbers are replaced by their approximate characteristics, an important issue is the accuracy of the representation  that guarantees correct results of the division operation. In accordance with the experimental studies [,,], the values of  that are used for rounding and also for restoring the positional representation of numbers can be insufficient in several cases. This aspect may considerably restrict the performance of the device.
Consider the error of the division operation. For given numbers  and  in the RNS representation, the exact quotient  is approximated by the partial quotient
        
      
        
      
      
      
      
    
In the quotient calculation problem, the algorithm outputs the integer part of the number , which corresponds to the integer part of the exact quotient . The absolute error of the quotient is bounded by
        
      
        
      
      
      
      
    
        where  and  denote the calculation errors of the functions  and , respectively. Clearly, the absolute error of the quotient is growing as the dividend increases. However, special role is played by the divisor. In the current problem, the inequality  always holds; hence, the denominator of the right-hand side of Equation (6) decreases faster than the numerator, and the error is rapidly growing for sufficiently small b (see the graph in Figure 2). The relative error is demonstrated in Figure 3.
      
    
    Figure 2.
      Absolute error for approximate division of  by  with RNS moduli set {5, 7, 9, 11} and .
  
      
    
    Figure 3.
      Relative error for approximate division of  by  with RNS moduli set {5, 7, 9, 11} and .
  
Using Equation (6), estimate the value of  that guarantees exact division. Note that the function  can be calculated by the equation
        
      
        
      
      
      
      
    
        where  gives the rank of number . By analogy with this formula, the rounded value of the function  is obtained using the constants  rounded to  decimal points, i.e.,
        
      
        
      
      
      
      
    
For each , the inequality  holds. This yields the following estimate for the calculation error of :
      
        
      
      
      
      
    
        where .
Furthermore, the value  converges to the exact value  as  is increased. If the constants  are rounded down, then the relationship
        
      
        
      
      
      
      
    
        holds for all  from the RNS range.
Taking this inequality into account, Equation (1) can be rewritten as
        
      
        
      
      
      
      
    
In view of the earlier notes, consider the worst case causing the largest error of calculations. Choose  as the largest possible dividend and  as the smallest possible divisor. Using Equation (7),
        
      
        
      
      
      
      
    
Now, require that the error  does not exceed a given threshold . Then
        
      
        
      
      
      
      
    
Solving this inequality in  yields
        
      
        
      
      
      
      
    
The integer parts of the approximate and exact quotients coincide if . Using experimental studies, it was established that the threshold  is sufficient for exact calculations. As a result, the final bound takes the form
        
      
        
      
      
      
      
    
The right-hand side of Equation (8) is greater than the lower bound in the inequality
        
      
        
      
      
      
      
    
        which is required for the exact number restoration in the algorithm. Table 4 shows the distribution of modular division errors with this bound for different RNS moduli. In addition to the general-form moduli, some sets of special form are also considered. Note that the share of faulty divisions varies from 0.5% to 14.3%. In accordance with Table 4, bound Equation (8) can be applied for exact division in RNS without any restrictions. Note that this approach and division in RNSs are difficult for theoretical study.
       
    
    Table 4.
    Distribution of division errors with insufficient rounding accuracy of  for different sets of RNS moduli.
  
In their paper, Hung and Parhami [] imposed a strict constraint on the choice of the divisor: it was recommended to use any divisors from the range , which considerably restricts the method’s applicability. The new approach suggested in this paper adopts bound Equation (8) for the number of digits without any constraints on the divisor.
On the other hand, the algorithms [] and the new algorithm are comparable in terms of hardware cost and execution time for the MRNS operations (summation, subtraction, multiplication) and bit shift operations (right and left shifts). Really, the new algorithm requires less operations of these types, but the operands have higher digit capacity.
5. Conclusions
The new algorithm described in this paper speeds up the modular division procedure in the RNS representation in comparison with the well-known counterparts. This fact can be explained by the rather simple structure of the algorithm containing uncomplicated operations, namely, summation and shift (for quotient approximation) as well as shift and subtraction (for quotient refinement). Being based on the CRT with fractions, the new algorithm does not include such operations as modular remainder calculation and number conversion into the MRNS representation. In comparison with the well-known RNS division algorithms, the new modular division algorithm has several considerable advantages. First, the division procedure involves no additional constraints on the dividend and divisor, such as representation range constraints. The only requirement of the new algorithm is that both parameters belong to the RNS range. Second, the new algorithm does not include any non-modular operation. Furthermore, the new algorithm uses less modular operations (modular summation, subtraction, multiplication) than some other RNS division algorithms. The new algorithm considerably differs from the abovementioned ones. It is unique in the sense that iterations contain shifts and subtractions. In comparison with the existing analogs, this algorithm appreciably decreases hardware cost and execution time for modular division.
The developed division algorithm can be used to design arithmetic-logic RNS devices and also to design problem-oriented RNS processors for digital signal processing, cryptography, etc. These new RNS applications will promote further development of this field of computational mathematics.
Author Contributions
Conceptualization, N.C. and P.L.; methodology, N.C. and P.L.; software, A.N.; validation, P.L. and M.D.; formal analysis, P.L., A.N. and M.D.; investigation, N.C.; resources, I.L.; data curation, A.L.; writing—original draft preparation, N.C. and P.L; writing—review and editing, M.B. and M.D.; visualization, P.L., A.N. and M.D.; supervision, N.C.; project administration, M.B.; funding acquisition, N.C., P.L., M.B. and M.D.
Funding
This research was funded by Russian Federation State task No. 2.6035.2017, the Russian Foundation for Basic Research (RFBR), grants numbers 18-07-00109 A and 19-07-00130 A, and the Council on grants of the President of the Russian Federation, grants numbers SP-2245.2018.5 and MK-6294.2018.9.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
This appendix presents the modular integer division algorithm of a number a by a number b, which yields the quotient Q and also the remainder R.
The algorithm is as follows.
        
- F_a = F(a), F_b = F(b)
 - Set j = 0, Q = 0
 - If (F_b ≤ F_a) then
 - F_b = 2F_b
 - While (F_b ≤ F_a) do j = j + 1, F_b = 2F_b
 - F_b = F_b/2
 - Δ1 = F_a − F_b
 - Q = 2j
 - For I = j − 1, j – 2, …, 0 do begin
 - F_b = F_b/2
 - If (Δ − F_b ≥ 0) then Q = Q + 2i, Δ = Δ − F_b
 - end
 - end if
 - R = a − b·Q.
 
The detailed description of this algorithm is given below.
Line 1. Calculate the positional characteristics F(a) and F(b) of the dividend and divisor, respectively, with required accuracy.
Line 2. Initialize the index of iterations j = 0 and the quotient Q = 0.
Line 3. Check the condition F_b ≤ F_a: if it holds, continue the division algorithm; otherwise go to line 14.
Line 4. Perform the left shift of the positional characteristic of the divisor to one binary digit.
Line 5. While F_b ≤ F_a, increase j = j + 1 and perform the left shift of the positional characteristic of the divisor.
Line 6. Perform the right shift of the positional characteristic of the divisor to one binary digit.
Line 7. Calculate Δ as the difference between the positional characteristic F_a of the dividend and the positional characteristic F_b of the divisor that is shifted by j binary digits to the left.
Line 8. Increase the quotient Q by 2j.
Line 9. Start the refinement procedure of the approximation series. For each i from j − 1 to 0, do the following operations:
Line 10. Perform the right shift of the positional characteristic of the divisor to one binary digit.
Line 11. If Δ − F_b ≥ 0, then increase the quotient by 2i and calculate the next value Δ = Δ − F_b.
Line 12. Finish the refinement procedure of the approximation series.
Line 13. Close the condition checked in line 3.
Line 14. Calculate the remainder R.
Appendix B
This appendix presents the new algorithm for modular division of numbers  that involves the relative values  and  in the CRT representation with fractions. A certain rule  is constructed to reduce each pair of numbers ,  to the fractions , , ; then there exists a collection ,  such that  and . The correctness of this algorithm can be argued as follows. Using the operation , a pair of numbers ,  is assigned the highest power  in  extracted from memory such that, if , then . If , then division ends because . If , then  and  is the desired partial quotient to be included in the general quotient. The highest power of  is a summand of the general quotient. The analysis process starts from the highest power of 2 and ends with zero power. Next, in accordance with the operation , the pair of numbers  is assigned  by the right shift of  (which is equivalent to division by 2). As a result,  and 
Depending on the value of , the second summand is included (if ) or excluded from further analysis (otherwise, as a redundant term for the quotient approximation series).
The subsequent iterations take into account only the necessary powers of 2 in parentheses. Therefore, the partial quotient is included or excluded from the general quotient using the above condition. The iterative process continues until the zero power of 2 is reached at step 0.
Consequently, the quotient is formed from the set of the partial powers of 2 that satisfy  in the RNS representation on module . Let the inequality  hold at step 0. In this case, the final result is , where L denotes the set of the necessary powers  of 2 in the general quotient.
At each iteration, the corresponding power of 2 is either eliminated or used as the partial quotient that must be in the general quotient. Since  is described by the residues table for the integer powers of 2, the partial quotient has the form , where , .
To summarize, the algorithm consists of two stages as follows. At the first stage, the left shift of  is used to find the high power of 2. The second stage is intended to analyze the successive iterations:
		 
- ; if , then is included in the general quotient; if , then is excluded.
 - ; if , then ; otherwise .
 - …
 - ; if , then ; otherwise .
 
The correctness of this method is verified by trivial transformations. Then , and hence . Here  and  denotes each of the powers of 2 that are included in the quotient approximation series.
References
- Szabó, N.S.; Tanaka, R.I. Residue arithmetic and its applications to computer technology. SIAM Rev. 1967, 11, 103–104. [Google Scholar]
 - Molahosseini, A.; Sousa, L.D.; Chang, C. Embedded Systems DESIGN with Special Arithmetic and Number Systems; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
 - Asif, S.; Andersson, O.; Rodrigues, J.; Kong, Y. 65-nm CMOS low-energy RNS modular multiplier for elliptic-curve cryptography. IET Comput. Digit. Tech. 2018, 12, 62–67. [Google Scholar] [CrossRef]
 - Vayalil, N.C.; Paul, M.; Kong, Y. A residue number system hardware design of fast-search variable-motion-estimation accelerator for HEVC/H.265. IEEE Trans. Circuits Syst. Video Technol. 2017, 29. [Google Scholar] [CrossRef]
 - Alia, G.; Martinelli, E. NEUROM: A ROM based RNS digital neuron. Neural Netw. 2005, 18, 179–189. [Google Scholar] [CrossRef] [PubMed]
 - Gomathisankaran, M.; Tyagi, A.; Namuduri, K. HORNS: A homomorphic encryption scheme for Cloud Computing using Residue Number System. In Proceedings of the 2011 45th Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011; pp. 1–5. [Google Scholar] [CrossRef]
 - Zheng, X.; Xu, J.; Li, W. Parallel DNA arithmetic operation based on n-moduli set. Appl. Math. Comput. 2009, 212, 177–184. [Google Scholar] [CrossRef]
 - Jun, S.; Hu, Z. Method and dedicated processor for image coding based on residue number system. In Proceedings of the Modern Problems of Radio Engineering Telecommunications and Computer Science (TCSET), Lviv-Slavske, Ukraine, 21–24 February 2012; pp. 406–407. [Google Scholar]
 - Mohan, P.V.A. Residue Number Systems; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
 - Molahosseini, A.S.; Sorouri, S.; Zarandi, A.A.E. Research challenges in next-generation residue number system architectures. In Proceedings of the ICCSE 2012—Proceedings of 2012 7th International Conference on Computer Science and Education, Melbourne, VIC, Australia, 14–17 July 2012; pp. 1658–1661. [Google Scholar] [CrossRef]
 - Chervyakov, N.I.; Lyakhov, P.A.; Babenko, M.G.; Garyanina, A.I.; Lavrinenko, I.N.; Lavrinenko, A.V.; Deryabin, M.A. An efficient method of error correction in fault-tolerant modular neurocomputers. Neurocomputing 2016, 205, 32–44. [Google Scholar] [CrossRef]
 - Chervyakov, N.I.; Molahosseini, A.S.; Lyakhov, P.A.; Babenko, M.G.; Deryabin, M.A. Residue-to-binary conversion for general moduli sets based on approximate Chinese remainder theorem. Int. J. Comput. Math. 2017, 94, 1833–1849. [Google Scholar] [CrossRef]
 - Kaplun, D.; Butusov, D.; Ostrovskii, V.; Veligosha, A.; Gulvanskii, V.; Kaplun, D.; Butusov, D.; Ostrovskii, V.; Veligosha, A.; Gulvanskii, V. Optimization of the FIR filter structure in finite residue field algebra. Electronics 2018, 7, 372. [Google Scholar] [CrossRef]
 - Hiasat, A. Efficient RNS scalers for the extended three-moduli set $(2^{n}-1, 2^{n+p}, 2^{n}+1) $. IEEE Trans. Comput. 2017, 66, 1253–1260. [Google Scholar] [CrossRef]
 - Kumar, S.; Chang, C.-H.; Tay, T.F. New algorithm for signed integer comparison in $\{2^{n+k},2^{n}-1,2^{n}+1,2^{n\pm 1}-1\}$ and its efficient hardware implementation. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 64, 1481–1493. [Google Scholar] [CrossRef]
 - Nakahara, H.; Nakanishi, H.; Iwai, K.; Sasao, T. An FFT circuit for a spectrometer of a radio telescope using the nested RNS including the constant division. ACM SIGARCH Comput. Archit. News 2017, 44, 44–49. [Google Scholar] [CrossRef]
 - Mrabet, A.; El-Mrabet, N.; Bouallegue, B.; Mesnager, S.; Machhout, M. An efficient and scalable modular inversion/division for public key cryptosystems. In Proceedings of the 2017 International Conference on Engineering & MIS (ICEMIS), Monastir, Tunisia, 8–10 May 2017; pp. 1–6. [Google Scholar] [CrossRef]
 - Chren, W.A. A new residue number system division algorithm. Comput. Math. Appl. 1990, 19, 13–29. [Google Scholar] [CrossRef]
 - Bajard, J.-C.; Didier, L.-S.; Muller, J.-M. A new Euclidean division algorithm for residue number systems. In Proceedings of the International Conference on Application Specific Systems, Architectures and Processors: ASAP’96, Chicago, IL, USA, 19–21 August 1996; pp. 45–54. [Google Scholar] [CrossRef]
 - Chiang, J.-S.; Lu, M. A general division algorithm for residue number systems. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 26–28 June 1991; pp. 76–83. [Google Scholar] [CrossRef]
 - Gamberger, D. New approach to integer division in residue number systems. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 26–28 June 1991; pp. 84–91. [Google Scholar] [CrossRef]
 - Lu, M.; Chiang, J.-S. A novel division algorithm for the residue number system. IEEE Trans. Comput. 1992, 41, 1026–1032. [Google Scholar] [CrossRef]
 - Bajard, J.; Rico, F. How to improve division in residue number systems. In Proceedings of the 16th IMACS World Congress, Lausanne, Switzerland, 21–25 August 2000; pp. 110–121. [Google Scholar]
 - Hiasat, A.A. Semi-Custom VLSI Design and Implementation of a New Efficient RNS Division Algorithm. Comput. J. 1999, 42, 232–240. [Google Scholar] [CrossRef]
 - Hung, C.Y.; Parhami, B. Fast RNS division algorithms for fixed divisors with application to RSA encryption. Inf. Process. Lett. 1994, 51, 163–169. [Google Scholar] [CrossRef]
 - Hung, C.Y.; Parhami, B. An approximate sign detection method for residue numbers and its application to RNS division. Comput. Math. Appl. 1994, 27, 23–35. [Google Scholar] [CrossRef]
 - Hiasat, A.A.; Abdel-Aty-Zohdy, H.S. Design and implementation of an RNS division algorithm. In Proceedings of the Proceedings 13th IEEE Sympsoium on Computer Arithmetic, Asilomar, CA, USA, 6–9 July 1997; pp. 240–249. [Google Scholar] [CrossRef]
 - Yang, J.-H.; Chang, C.-C.; Chen, C.-Y. A high-speed division algorithm in residue number system using parity-checking technique. Int. J. Comput. Math. 2004, 81, 775–780. [Google Scholar] [CrossRef]
 - Chang, C.-C.; Yang, J.-H. A Division algorithm using bisection method in Residue Number System. Int. J. Comput. 2013, 2, 59–66. [Google Scholar]
 - Talahmeh, S.; Siy, P. Arithmetic division in RNS using Galois Field GF(p). Comput. Math. Appl. 2000, 39, 227–238. [Google Scholar] [CrossRef]
 - Chang, C.-C.; Lai, Y.-P. A division algorithm for residue numbers. Appl. Math. Comput. 2006, 172, 368–378. [Google Scholar] [CrossRef]
 - Chervyakov, N.I.; Babenko, M.G.; Lyakhov, P.A.; Lavrinenko, I.N. An Approximate method for comparing modular numbers and its application to the division of numbers in Residue Number Systems. Cybern. Syst. Anal. 2014, 50, 977–984. [Google Scholar] [CrossRef]
 - Patronik, P.; Piestrak, S.J. Design of reverse converters for the new RNS moduli set {2n+1,2n–1,2n,2n-1+1} (n odd). IEEE Trans. Circuits Syst. I Regul. Pap. 2014, 61, 3436–3449. [Google Scholar] [CrossRef]
 - Tay, T.F.; Chang, C.-H.; Sousa, L. Base transformation with injective residue mapping for dynamic range reduction in RNS. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 2248–2259. [Google Scholar] [CrossRef]
 - Vun, C.H.; Premkumar, A.B.; Zhang, W. A new RNS based DA approach for inner product computation. IEEE Trans. Circuits Syst. I Regul. Pap. 2013, 60, 2139–2152. [Google Scholar] [CrossRef]
 - Kouretas, I.; Paliouras, V. A low-complexity high-radix RNS multiplier. IEEE Trans. Circuits Syst. I Regul. Pap. 2009, 56, 2449–2462. [Google Scholar] [CrossRef]
 - Younes, D.; Steffan, P. A comparative study on different moduli sets in residue number system. In Proceedings of the 2012 International Conference on Computer Systems and Industrial Informatics, Sharjah, UAE, 18–20 December 2012; pp. 1–6. [Google Scholar] [CrossRef]
 
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).