Compact Word-Serial Modular Multiplier Accelerator Structure for Cryptographic Processors in IoT Edge Nodes with Limited Resources

: IoT is extensively used in many infrastructure applications, including telehealth, smart homes, smart grids, and smart cities. However, IoT has the weakest link in system security since it often has low processing and power resources. It is important to implement the necessary cryptographic primitives in these devices using extremely efﬁcient ﬁnite ﬁeld hardware structures. Modular multiplication is the core of cryptographic operators. Therefore, we present, in this work, a word-serial modular multiplier accelerator structure that provides the system designer with the ability to manage areas, delays, and energy consumption through selecting the appropriate embedded processor word size l . The modularity and regularity of the suggested multiplier structure makes it more suitable for implementation in ASIC technology. The ASIC implementation results indicates that the offered multiplier structure achieves area reduction compared to the competitive existing multiplier structures that vary from 76.2% to 98.5% for l = 8, from 73.1% to 98.1% for l = 16, and from 82.9% to 98.3% for l = 32. Moreover, the energy reduction varies from 61.2% to 98.8% for l = 8, from 67.7% to 98.3% for l = 16, and from 76.1% to 98.8% for l = 32. These results indicate that the proposed modular multiplier structure signiﬁcantly outperforms the competitive ones, in terms of area and consumed energy, making it more suitable for utilization in resource-constrained IoT edge devices.


Introduction and Related Work
The Internet of Things (IoT) is a broad network of physical devices that are equipped with sensors, software, electronics, and a network that allows them to share data and execute tasks. IoT is a promising technology that will shape our future by providing intelligent solutions in different applications, such as smart homes, smart cities, self-driven cars, smart farming, smart grids, and telehealth. To better understand how this system works, let us look at the IoT network topology in an IoT application, such as telehealth. It is one of the many emerging infrastructure applications that relay on IoT technology, to provide services to remote users, such as stay-at-home patients, and providing quality healthcare to remote communities [1,2]. Figure 1 shows a telehealth system that relies on IoT edge devices to deliver healthcare to remote locations.
The main entities of a telehealth system are: (a) A server that could be a hospital or medical center, and naturally considered a hardware root-of-trust (HRoT) due to the layered security measures implemented. (b) Internet cloud, which is, in general, an insecure communication medium. (c) A gateway that provides an interface between the IoT edge devices and the internet. (d) Edge IoT devices that comprise sensors and actuators to measure and deliver medications to remote patients. (e) Mobile devices that allow healthcare practitioners (doctors/nurses) to remotely connect to the telehealth system. It is clear from the figure that there are many opportunities for attacks due to the use of diverse hardware platforms, diverse operating systems, insecure wireless communication media, limited processing power for many system entities, and the sheer number of people involved in the operation of the system [3,4]. Securing any system (telehealth or otherwise) implies many features that include integrity, confidentiality, authentication, non-repudiation, and availability. These security features are implemented using fundamental encryption algorithms, such as elliptic curve cryptography (ECC) and Rivest-Shamir-Adleman (RSA) algorithms. Given the power and delay restrictions for most of the devices used in most IoT systems, elliptic curve cryptography (ECC) is the encryption technique of choice due to its high level of security with shorter key lengths compared to common approaches, such as RSA [5]. An essential operation in ECC arithmetic is modular multiplication. There is an extensive body of literature covering modular multiplications in both prime fields GF(p) and binary extension fields GF(2 m ). Most of the proposed multipliers possess high area and delay complexities, which make them unsuitable for resource-constrained IoT edge devices [6][7][8]. To overcome these limitations, several authors developed word-serial modular multipliers [9][10][11][12]. Systolic approaches were reported in [9,[13][14][15] and non-systolic designs were report in [16][17][18][19]. Other authors attempted to save power and area by merging the modular multiplication and modular squaring operations [7,8,20]. However, the resulting structures were not suitable for resource-constrained IoT devices due to their high area and power requirements.
Most of the reported modular multiplier structures are classified as one-of-a kind structures. Ad hoc approaches are adopted with no consideration on how the structure can be modified to optimize system performance parameters, such as latency, throughput, power, and area requirements. The authors of this article presented a systematic methodology for implementing the modular multiplication algorithm based on the algebraic approach first proposed by the first author [21]. The systematic methodology applied linear mappings to obtain modular multiplier structures. However, linear mappings have limited abilities, both in terms of the number of parallel processing elements (PE) and also the timing strategies that could be developed. This paper proposes using nonlinear techniques for mapping the algorithm onto parallel PEs and to obtain more flexible timing strategies. The goal of the paper is to obtain a word-serial processor accelerator for modular multiplication operations. The resulting structure gives the designer the ability to control the PE workload and the algorithm latency. The experimental results confirm that the proposed multiplier outperforms the efficient word-serial ones previously reported on in the literature, in terms of area and consumed energy for various embedded word-sizes. These design features make the proposed design more suitable for embedded applications and other resource-constrained IoT applications.
The outlining of the paper is as follows. Section 2 briefly describes the adopted modular multiplication algorithm and exhibits the details of its dependency graph. Section 3 presents the followed approach to extract the modular multiplier word-serial accelerator structure with its related logic details. Section 4 shows the realized implementation results. Section 5 concludes the recommended work.

Algorithm of Interleaved Modular Multiplication
We can perform modular multiplication over GF(2 m ) by multiplying two polynomials P(γ) and Q(γ) and reducing the result using the reduction polynomial T(γ) as: the general polynomial format of P(γ), Q(γ), and T(γ) can be given as: with p j , q j , t j ∈ GF(2). By replacing Q(γ) in Equation (1) with its polynomial format given in Equation (3), Equation (1) can be represented as follows: we can arrange Equation (1) in the interleaved form as: We choose to drop γ from polynomials P(γ), Q(γ), S(γ), and T(γ) to simplify the upcoming expressions. Investigating Equation (6), we notice that the multiplication product can be produced by accumulating the terms q i [Pγ i mod T], with 0 ≤ i ≤ m − 1.
Suppose P i = Pγ i mod T, we can represent P i+1 in terms of P i as P i+1 = Pγ i+1 mod T = [Pγ i ]γ mod T = P i γ mod T. Thus, the recursive form of P i+1 can be represented as: with the initialization condition P 0 = P. The term γ m mod T in Equation (7) is equivalent to ∑ m−1 j=0 t j γ j as proved by [7]. Moreover, the term ∑ m−2 j=0 p i j γ j+1 represents a polynomial of order less than m.
The recursive Equation (7) can be expressed in the bit-level form as: The recursive form of partial product S i+1 , 0 ≤ i ≤ m − 1 can be given from accumulating P i terms as: with S 0 = 0 and S m represents the final result S. Recursive Equation (9) can be expressed in the bit-level form as:

Dependency Graph
Using reference [21], the dependence graph DG describing the modular multiplication can be obtained from Equations (8) and (10). The indices i and j in the two equations designate that the DG can be defined in a two-dimensional integer domain. Index i denotes the rows, and index j denotes the columns. Figure 2 presents the DG for the field size m = 5. The circled nodes compute the operations depicted by Equations (8) and (10). The vertical lines represent the partial product signal s i j , the multiplier signal p i j , and the irreducible polynomial coefficient t j . The horizontal lines represent the broadcast multiplicand signal q i . The diagonal lines represent the signal p i j−1 . Signals p i+1 j and s i+1 j requires signals p i j−1 and p i j to be computed. The last column nodes produce signal p i m−1 , which is used inside the column nodes, and broadcasted to the remaining row nodes as indicated in Figure 2.
As we observe from Figure 2, the input signals s 0 j , p 0 j are fed at the upper row of the DG, and the output signals s m j are produced from the lower row.

Word-Serial Accelerator Structure Exploration
We will follow a formal and systematic methodology that we previously developed in [19,[21][22][23] to map the recursive-iterative multiplier algorithm to a processor array and assign an execution schedule to each processing element (PE) in the resulting array.

Scheduling Function
Consider the two-dimensional dependence graph for the polynomial modular multiplication algorithm shown in Figure 2. We assume the processor array we would like to develop has l-bit digit or word size.
A valid nonlinear scheduling function assigns an execution time value to each node or point P in Figure 2 according to the nonlinear expression: Γ(P) is the function that assigns a time instance to node P(i, j) in the DG. Figure 3 shows the time index values after applying the nonlinear scheduling function in Equation (11) for the case when m = 5 and l = 3. The figure shows the DG points are being grouped horizontally in l-bit groups having the same execution time value. This ensures that all the bits in a single processor word are executed at the same time. An extra column of nodes is added at the left side of the DG to ensure that the number of columns in the DG is an integer multiple of l. For the general case, we need to add extra θ = l m l − m columns, with zero inputs, at the left side of the DG. As we notice from Figure 3, the output of the multiplier will be available after m m l computation steps.  An important feature of the proposed scheduling function is to provide the system designer with the ability to control the workload of the entire processor array system. For the nonlinear scheduling formula in Equation (11), we note that only one group of l bits is active at any given time instance. Therefore, the PE workload is equal in that case to the system workload. Of course, the system designer could choose another scheduling function to choose a different system workload.

Projection Function
The projection function approach discussed in [21] projects several nodes in the DG of Figure 3 to a single node. This operation is necessary since each l group of nodes in Figure 3 operates only once. Therefore, to reuse the processing elements, we map several groups into one PE. The system workload in Figure 3 implies that we need to map all the nodes of the DG into one PE only. We propose the following nonlinear projection function to map a node P(i, j) to a new node P(x, y): where "·" is a place holder for the argument [21]. Figure 4 shows the resulting word-serial accelerator structure after applying the assumed projection function to Figure 3. The system consists of the following components:

1.
A processor array block whose word size is l; 2.
Three input registers T, P and PL; 3.
One output register S; 4.
Three shift-right registers SHR-S, SHR-pd and SHR-P (which is inside the processor array block); 5.

6.
Four three-input MUXes (two of them inside the processor array block) to select between the inputs and partial results of variables P and T.
Register P passes bit values starting from bit p 0 m−1 , while register PL passes the bit values of the word variable starting from bit p 0 m−2 . The partial results stored in S and P are cycled through the shift registers SHR-S and SHR-P, respectively. The fixed words of T are rotated through the rotate-right register ROR-T. Observation of Figures 3 and 4 indicates that the rightmost bit p d of register P is transferred diagonally to the following node after delayed by r − 1 time steps, where r = m/l , while the remaining vertical and slanted word bits of P are transferred to the following bottom nodes after being blocked by r time steps. Therefore, bit P d should be passed through the shift-right register, SHR-pd, that has depth size r − 1 as shown in Figure 4. Shift-right registers SHR-P and SHR-S, and rotate-register RoT-T all have the same width and depth sizes of l and r, respectively. Figure 5 displays the details of the processor array block for the case l = 3 bits. The processor array contains two types of PEs. Figures 6 and 7 depict the design details of the PEs. All the PEs are interconnected in a pipeline structure to perform computing at the same time. The yellow PE in Figure 3 has two more tri-state buffers managed by the control signal e. We will discuss below the role of these extra buffers.
The operation details of the explored multiplier accelerator can be summarized for generic filed size m and word size l as follows: 1.
Control signal C, controlling the selection of all MUXes, activates (C = 1) during the the first m/l clock cycles to feed the input words of operands T, P, and PL to all PEs of the processor array block. The words are fed starting with the most significant words. Moreover, the most significant bit p 0 m−1 is passed to the last PE, PE l , and broadcasted to the remaining PEs. At the first clock cycle, SHR-S is cleared to initialize the S variable with zero values.

2.
Control signal C of all MUXes deactivates (C = 0) during the remaining clock cycles to feed the resulted intermediate words of P and fixed words of T to all PEs of the processor array block. These words are passed through shift-registers SHR-P, SHR-pd, and RoT-T, respectively. Moreover, the resulted intermediate words of S are fed to all PEs of the processor array block through the shift-register SHR-S.

3.
Control signal e activates (e = 1) at the clock cycles T = (i) m l + 1, 0 ≤ i ≤ m, to enable the tri-state buffer Tr 1 shown in Figure 6 to horizontally feed the bits of p i m−1 , 0 ≤ i ≤ m, to the remaining PEs. Moreover, q i input bits are broadcasted during the same clock cycles to all PEs in the processor array block. The control signal e deactivates (e = 0) during the remaining clock cycles to enable the tri-state buffer Tr 2 , displayed in Figure 6, to feed the bits of p d through the shift-register SHR-pd to the input of the processor array block as shown in Figure 4.

4.
Control signal v, shown in Figure 5, deactivates (v = 0) at clock cycles T = (i + 1) m l , 0 ≤ i ≤ m, to force zero bit values to the P words shown at the leftmost side of the DG, Figure 3. Control signal v activates (v = 1) at the remaining clock cycles to feed the P d signal through the leftmost MUX of the processor array shown in Figure 5.

5.
The resulting output words S are available at the output bus, through register S shown in Figure 4, during clock cycles T ≥ (m − 1) m l + 1.

Complexities Analysis
The area, delay, and consumed energy complexities of the proposed multiplier are reported and compared to other efficient word-serial multipliers reported in [9][10][11][12]. Table 1 summarizes the area and delay complexities of recommended multiplier and the previously reported efficient word-serial ones. The total count of logic gates/components in the accelerator structure estimates the area complexity. The entire number of clock cycles needed to produce the product represents the latency (L) of the multiplier. The whole gate delays in the longest path of the logic circuit represent the critical path delay (CPD) of the multiplier structure. The product of latency and critical path delay (CPD) estimates the delay complexity. We can represent the delays of the 2-input AND, 2-input XOR, and 2-to-1 MUX by τ A , τ X , and τ MUX symbols, respectively. Table 1. Estimation of area and delay for the adopted word-serial multipliers.

Multiplier
Tri-State AND XOR MUXes Flip-Flops Latency CPD (1) The three-input logic XOR area is estimated as 1.5× the area of the two-input logic XOR. (2) The switches in Multiplier of [12] have the same area as the 2-to-1 MUX as it has the same number of transistors.
The following formulas describe the remaining notations in Table 1: Input and output flip-flops of each multiplier structure are added to the total estimated number of its flop-flops. As we notice from Table 1, the proposed multiplier structure has a significant reduction in the area compared to other multiplier structures due to having area complexity of order O(l).
To quantify the results obtained in Table 1, we modeled the proposed multiplier structure and the adopted ones using VHDL hardware language and synthesized them for the recommended field size n = 409 and embedded word sizes of l = 8, l = 16, and l = 32. The synthesis was performed using NanGate Open Cell Library (15nm, 0.8V) and Synopsys tools version 2005.09-SP2. The following describes the synthesis design parameters obtained in Table 2: 1.
Area (A) results are obtained in terms of the two-input NAND gate and are represented in units of kilo-gates kgates.

2.
Total computation time (T) is represented in nano-second ns time unit. 3.
Consumed power (P) is obtained at a frequency of 1 KHz in units of milliwatt mW.

4.
Consumed energy (E) is obtained as the product of P and T in units of femtojoule f J 5.
Area-time product (AT) is obtained as the product of A and T in units of kgatesnanosecond Kgates.ns Table 2. Performance parameters of the adopted word-serial multipliers for n = 409 and different values of l.  Figures 8-11 compare the obtained results of area (A), area-time product (AT), consumed power (P), and consumed energy (E), respectively, of the proposed multiplier structure with the adopted ones. Figure 8 depicts that the proposed multiplier structure saves a significant amount of area ranging from 76.2% to 98.5% at l = 8, 73.1% to 98.1% at l = 16, and 82.9% to 98.3% at l = 32 compared with the adopted word-serial multipliers. As we mentioned before, the saving in the area is due to the lower area complexity O(l) of the proposed design compared to the other designs. It is worth noting that the area of the proposed design has a slight difference at the different word sizes. This is due to the reverse relationship between the number of flip-flops and the word size l as indicated in Table 1. Therefore, as l increase, the number of flip-flops decrease and the number of basic logic components increase as it is directly proportional to the word-size l. The area complexity of the flip-flop is much higher than that of the other gates. As a result, the area reduction of flip-flops will have a significant impact on the overall area reduction of the proposed multiplier. Thus, the net result is a slight increase in the proposed design area as its word size increases. The proposed multiplier structure saves a significant amount of area ranging from 76.2% to 98.5% at l = 8, 73.1% to 98.1% at l = 16, and 82.9% to 98.3% at l = 32 compared with the adopted word-serial multipliers.    Figure 9 displays the obtained area-time (AT) results of the proposed design and the adopted word-serial ones. We can read the results based on the different word-sizes l as follows:

Multiplier
(i) At word-size l = 8, the multiplier that achieves the lowest AT is the multiplier of Pan [9]. It outperforms the proposed design by %27.8 at this word size. On the other hand, the proposed multiplier outperforms the other multipliers in AT at this word size, achieving a maximum reduction of 98.9% over the design of Hua [11]. (ii) At word-sizes l = 16 and l = 32, the proposed multiplier achieves the lowest AT than the other multiplier structures due to the significant reduction of its latency and computation time at these word sizes. As we notice from Table 1, the latency of the proposed multiplier is inversely proportional to the word size l. As a result, the latency significantly decreases as the word size l increases. Figure 10 shows that proposed multiplier structure achieves a significant reduction in power consumption ranging from 74.3% to 99.6% at l = 8, 64.1% to 99.44% at l = 16, and 73.9% to 99.4% at l = 32 compared to the other multiplier designs. The reduction of power is attributed to the lower area complexity of the proposed design over the other designs. The reduction in area reduces the total amount of parasitic capacitances, resulting in a significant reduction in switching activities, which is one of the primary sources of consuming power. Figure 11 shows that the proposed multiplier structure achieves a magnified reduction in energy ranging from 61.2% to 98.8% at l = 8, 67.7% to 98.3% at l = 16, and 76.1% to 98.8% at l = 32 compared to the adopted word-serial multipliers. The energy reduction is mainly attributed to the magnified reduction of the consumed power of the proposed design over the adopted ones.
From the previous analysis, we can conclude that the recommended word-serial multiplier structure outperforms the other competitor multiplier structures in terms of area and consumed energy for the different embedded word sizes. This indicates that the proposed multiplier is suitable for IoT devices in resource-constrained IoT applications.

Summary and Conclusions
This paper proposes a word-serial accelerator multiplier structure that performs multiplication in GF(2 m ). The multiplier was extracted based on a systematic methodology that uses non-linear scheduling and projection functions to map the nodes of the algorithm dependency graph on to parallel processing elements. The main features of the proposed multiplier involve its flexibility in managing the accelerator workload and the required total computation time steps to produce the output results. The regularity and modularity of the extracted processor array block of the multiplier accelerator make it more suitable for implementation using ASIC technology. The experimental results confirm that the proposed multiplier outperforms the efficient word-serial ones previously reported on in the literature, in terms of area and consumed energy for various embedded word sizes, making it more suitable for embedded applications and other resource-constrained IoT applications. In the future, we will use the obtained multiplier structure as a building block for the ECC cryptographic processor to evaluate the overall reduction of the cryptographic processor in terms of area and consumed energy.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

IoT
Internet of Things ASIC Application Specific Integrated Circuit ECC elliptic curve cryptography DG dependency graph VLSI very large scale integrated circuit RSA Rivest, Shamir, and Adleman CPD critical path delay