Word-Based Systolic Processor for Field Multiplication and Squaring Suitable for Cryptographic Processors in Resource-Constrained IoT Systems

: Internet of things (IoT) technology provides practical solutions for a wide range of applications, including but not limited to, smart homes, smart cities, intelligent grid, intelligent transportation, and healthcare. Security and privacy issues in IoT are considered signiﬁcant challenges that prohibit its utilization in most of these applications, especially relative to healthcare applications. Cryptographic protocols should be applied at the different layers of IoT framework, especially edge devices, to solve all security concerns. Finite-ﬁeld arithmetic, particularly ﬁeld multiplication and squaring, represents the core of most cryptographic protocols and their implementation primarily affects protocol performance. In this paper, we present a compact and combined two-dimensional word-based serial-in/serial-out systolic processor for ﬁeld multiplication and squaring over GF(2 m ). The proposed structure features design ﬂexibility to manage hardware utilization, execution time, and consumed energy. Application Speciﬁc Integrated Circuit (ASIC) Implementation results of the proposed word-serial design and the competitive ones at different embedded word-sizes show that the proposed structure realizes considerable saving in the area and consumed energy, up to 93.7% and 98.2%, respectively. The obtained results enable the implementation of restricted cryptographic primitives in resource-constrained IoT edge devices such as wearable and implantable medical devices, smart cards, and wireless sensor nodes.


Introduction
Internet of Things (IoT) is a modern technology that connects a tremendous number of gadgets-such as smartphones, wearable devices, sensors, vehicles, and smart meters-to the internet [1,2]. It provides services and efficient solutions in numerous domains such as healthcare, smart cities, smart grid, industrial manufacturing, business management, logistics, smart homes, and intelligent transportation [3][4][5][6][7][8].
Security and privacy issues are the primary concern in most IoT-based systems. They prohibit its usage in most applications, especially healthcare applications. Accordingly, we should employ efficient and practical security solutions to protect the IoT-based systems. Therefore, cryptographic protocols should be applied at the different layers of the IoT framework, especially edge devices, to solve all security concerns. Most IoT edge devices have limited computing resources, which makes implementing traditional cryptographic algorithms, such as Rivest, Shamir, and Adleman (RSA) and Digital Signature Algorithm (DSA) [9], impractical. Due to its short-key sizes and enhanced computational efficiency, the Elliptic Curve Cryptographic (EEC) algorithm [10] becomes the cryptography choice for resource-constrained embedded devices such as mobile phones, smart cards, and environmental sensors. ECC's essential operation is the point multiplication, which mainly depends on the basic finite field arithmetic operations of addition, multiplication, squaring, and division/inversion. The ECC processor's overall performance primarily relies on the efficient implementation of these operations. Since the finite field multiplier is the basic building block of the other field operations of division/inversion and exponentiation, it is considered the fundamental building block of the ECC processor. Therefore, any slight improvement in its implementation results in a significant increase in the ECC processor's whole performance.

Paper Motivation and Related Work
Field multiplication in GF(2 m ) is very crucial in several field operations such as modular exponentiation and inversion/division as they are performed using a sequence of multiplications. Most of the previously reported multipliers over GF(2 m ) have high area and time complexities that render their realization in resource-constrained IoT edge devices highly challenging [11][12][13][14]. Therefore, it becomes important to have multiplier architectures that target this type of applications. Word-serial multiplier architectures are reported in the literature to solve this problem. They have a trade-off between speed and area complexities and, thus, they provide the designer more flexibility to reach the desired design. The structures of word-serial multipliers are classified into four types: Serial-In/Serial-Output (SISO) structures, Serial-In/Parallel-Output (SIPO) structures, Parallel-In/ Serial-Output (PISO), and Scalable structures. The polynomial basis word-serial systolic multipliers using SISO structure are presented in [15][16][17][18][19]. The polynomial basis word-serial multipliers with the SIPO structure are reported in [20][21][22][23]. The word-serial type-T Gaussian normal basis (GNB) multipliers with PISO structure are reported in [24]. The scalable systolic multiplier structures are reported in [25][26][27][28][29][30][31].
Modular exponentiation is a fundamental part of cryptographic algorithms. There are two binary approaches used to compute modular exponentiation: the Most Significant Bit (MSB)-first approach and the Least Significant Bit (LSB)-first approach. In LSB-first approach, the modular multiplication and squaring operations can be executed concurrently to reduce the processing time. There are many attempts in the literature to combine the multiplication and squaring operations in a unified structure to increase performance and hardware utilization [13,14,32]. To the best of our knowledge, the suggested combined multiplier-squarer structures are dedicated for high-speed applications and do not target the resource-constrained applications.

Paper Contribution
In this paper, we propose a word-based two-dimensional SISO systolic processor for combined field multiplication and squaring over GF(2 m ). The main difference between the proposed SISO architecture and the other types of word-serial multiplier architecture is that the proposed multiplier-squarer structure is extracted by using a systematic approach [33][34][35][36][37][38][39]. In contrast, the other word-serial multiplier structures are extracted conventionally without the use of specific methodology. The applied approach allows the designer to construct the design in the smallest size in order to fit all resource-constrained IoT edge devices that have more restrictions on area and power consumption. Moreover, it provides the flexibility in managing execution time and the consumed energy of these devices. Another advantage of the proposed SISO design over other word-serial conventional ones is that it provides a compact and unified structure that simultaneously performs multiplication and squaring operations. By contrast, the traditional design designs computes both operations sequentially. Moreover, it has a regular structure and local interconnections that render it more suitable for VLSI implementation.

Paper Organization
This paper can be organized as follows. Section 2 provides a brief explanation to the combined polynomial multiplication-squaring algorithm in GF(2 m ). Section 3 develops its associated dependency graph (DG). Section 4 explains the explored two-dimensional word-based SISO systolic processor. Section 5 provides the area and delay complexities of the proposed design and the best of the existing word-serial designs. Section 6 concludes this study.

Combined Polynomial Multiplication-Squaring Algorithm in GF(2 m )
Let C(x) and D(x) be two polynomials in GF(2 m ) and G(x) be the irreducible polynomial in standard basis representation. These polynomials can be represented as follows: The polynomial multiplication and squaring over GF(2 m ) can be defined as follows.
The products R(x) and Q(x) can be calculated using the combined algorithm, Algorithm 1, proposed by Choi in [13]. This algorithm calculates three partial polynomials C(x), R(x), and Q(x). Variables C i , R i , and Q i are used to indicate the values of C(x), R(x), and Q(x) at iteration i. d i−1 and c i−1 represent the (i − 1) th coefficients of input polynomials D(x) and C(x), respectively. The initial variables R 0 and Q 0 are assigned zero values and the initial variable C 0 is assigned the coefficients of input polynomial C(x). In each i iteration of the for loop, the intermediate variables are updated as follows: • Variable C i is updated by shifting left C i−1 and by reducing using the irreducible polynomial G; • Variable R i is updated by multiplying C i−1 by coefficient d i−1 and by adding the obtained result to R i−1 ; • Variable Q i is updated by multiplying C i−1 by coefficient c i−1 and by adding the obtained result to Q i−1 .
Algorithm 2 is the bit-level representation of Algorithm 1. Variable c i j+1 represents the (j + 1) th bit of C at the i th iteration. Moreover, r i j and q i j represent the i th bit of R and Q at the i th iteration, respectively. Notice that j + 1 indicates that j + 1 is to be reduced modulo m.
: end for Algorithm 2 Bit-level algorithm for multiplication and squaring over GF(2 m ).

Algorithm Dependency Graph
Algorithm 1 is an example of a Regular Iterative Algorithm (RIA). The authors of [33] showed how to obtain the dependency graph (DG) of an RIA algorithm. Figure 1 shows the DG based on Algorithm 2 for combined polynomial multiplication-squaring in GF(2 m ). The nodes in Figure 1 represent points in the two-dimensional integer domain D, with indices i and j indicating the rows and columns, respectively, and they possess the following ranges: The figure is for the case when m = 5 bits. The algorithm has three input variables C, D, and G and two output variables R and Q. Variables R, Q, and G are represented by the vertical lines. Variable C is represented by the slanted lines (red lines). Input bits c i−1 and d i−1 along with the resulting intermediate bits c i−1 m−1 are broadcasted horizontally. The initial bits r 0 j , q 0 j , c 0 j , and g j+1 are inputs to the DG as shown at the top of Figure 1. The DG nodes (circles) execute the main operations of Algorithm 2 from steps 3 to 5. Output bits r m j and q m j are produced from the bottom of the DG as indicated in Figure 1. The DG in Figure 1 can be used for design space exploration of the combined multiplication and squaring operations. The design exploration involves finding valid node scheduling functions and mapping or projecting the graph nodes to processing elements (PEs). Reference [33] explains how design space exploration could be performed by using affine and non-linear scheduling and projection functions.
The affine scheduling and projection functions cannot be used to explore word-serial systolic processors. Thus, our goal is to apply the non-linear scheduling and projection techniques discussed in [33] to the developed algorithm, Algorithm 2, in order to explore the most efficient two-dimension word-serial systolic processor that is able to satisfy any

Combined Two-Dimensional SISO Multiplier-Squarer
A SISO combined multiplier-squarer requires feeding in polynomials C, D, and G in a word-serial fashion at the start of iterations and then obtaining the Q and R polynomials in a word-serial fashion. Let us assume that we would like to perform w iterations at the same time; i.e., we would like to feed in w bits of the polynomial inputs and obtain w bits of the partial results. There are several nonlinear task scheduling and projection functions that can be used to obtain different two-dimensional SISO combined multiplier-squarer. The most efficient ones are discussed in the following sections.

Two-Dimensional SISO Task Scheduling
Following the scheduling methodology explained in [33], we can extract the following nonlinear scheduling function to partition D into w × w equitemporal zones: where 1 ≤ i ≤ m + µ and −µ ≤ j < m − 1. Figure 2 shows the node timing (scheduling time) for the case when m = 5 and w = 2. Notice that we added an extra column on the right and extra row at the bottom to render the number of columns and rows integer multiples of w. In general, we should add µ extra columns and µ extra rows to render the number of columns and rows an integer multiple of w, where µ = w m w − m. Figure 3 shows the node timing (scheduling time) for the case when m = 5, and w = 4. In this case µ = 3, thus we had to add three extra columns on the right and three extra rows at the bottom to render the number of columns and rows an integer multiple of w. Therefore, the LSBs of inputs C and G and LSBs of the initial values of intermediate variables R and Q should be padded by µ zeros on the right as shown in Figure 3. Furthermore, the MSBs of inputs D and C should be padded by µ zeros at the bottom, as shown in the same figure.
The equitemporal zones are shown as light red boxes with the associated time index values indicated in red numerals within each zone. Notice that the bits of c i−1 m−1 are computed at the nodes of column m − 2, as shown in Figure 3 and broadcasted horizontally along with the bits of d i−1 and c i−1 to the nodes of row i − 1.
One last detail needs to be mentioned here and is best explained with reference to two adjacent equitemproal zones executing at times n and n + 1. Figure 4 illustrates this situation. The north and east inputs to zone i are available at times n and n + 1, respectively. However, we notice that input C n only affects the west output C w and C e only affects the south output C s . Hence, at time n output C w is valid while output C s is not valid since we were required to add C e to it. This will result in an increase in the total number of iterations needed to produce the final result by one time step. Therefore, the total number of iterations needed to complete the combined multiplication/squaring computation will be provided by:

Two-Dimentional SISO Task Projection
Given the scheduling time in Figure 3, we note that only w × w nodes are active at a given time. Following the projection technique explained in [33], we can extract the following nonlinear projection function that maps a point p(i, j) ∈ D of Figure 3 to a point p in the PE space: where "dot" is a place holder for the argument.
Our systolic array will now consist of w 2 PEs arranged in w rows and w columns in addition to the necessary registers. Figure 5 shows the word-based two-dimensional SISO systolic processor.  Figure 5. Notice that we added two registers for the input C: The north C register feeds the words of operand C to the systolic array starting from the most significant words, while the east register C i−1 feeds the words of operand C to the systolic array starting from the least significant words. Figure 6 shows the details of the two-dimensional word-based SISO systolic array for the case when m = 5 and w   The operation of the two-dimensional SISO systolic processor can be summarized for the generic values of m and w as follows:

1.
At time n = 1, MUXes M C and M G shown in Figure 5, are set to pass the w MSBs of operands C and G, respectively, to the systolic array block. Moreover, FIFO buffers of R and Q are reset at the same time to pass zero inputs to the systolic array block since the initial values of R and Q are zeros as indicated in Algorithm 1. Notice that, the control signals y and z are set to 0 and 1, respectively, through this time step. The control signal y = 0 enables the tristate buffer shown in Figure 7 for all the light blue PEs of the systolic array, Figure 6, to pass the computed w bits of C i−1 m−1 and 1 ≤ i ≤ w. The computed word of C i−1 m−1 along with the w LSBs of D i−1 and C i−1 , 1 ≤ i ≤ w, are passed horizontally to the remaining PEs nodes of the systolic array. Moreover, the control signal y = 0 forces the bits of C i m and 1 ≤ i ≤ w through the AND gate shown in Figure 7 to have zero values as shown at the left edge of the DG, Figure 3.

2.
At time instances 1 < n ≤ m w , MUXs M C and M G are still set to pass the remaining words of inputs C and G, one word at each time step, to the systolic array. These operand words are used with the horizontally passed words of C i−1 m−1 , D i−1 , and C i−1 , 1 ≤ i ≤ w, to compute the intermediate words of R, Q, and C in a word serial fashion. The resulting words of R, Q, and C are pipelined through the FIFOs of R, Q, and C shown in Figure 5, respectively. These FIFOs have a width size of w bits and a depth size of u − 1, where u = m w . Notice that the depth of R and Q FIFOs ensures keeping the initial values of R and Q equal to zero through these time instances.

3.
At time instances n > m w , MUXs M C and M G passes the computed C words stored in FIFO-C and the G words stored in FIFO-G to the systolic array, one word at each time step. These words, along with the computed R and Q words that are stored in FIFO-R and FIFO-Q and the broadcasted words of C i−1 m−1 , D i−1 , and C i−1 , kw < i ≤ (k + 1)w, 1 ≤ k ≤ m w − 1, are used to update the intermediate partial results of R, Q, and C in a word serial fashion, one word at each time step.

4.
At time instances n = k m w + 1, 0 ≤ k ≤ m w − 1 the tri-sate buffer shown in Figure 7 is enabled (y = 0) in all the light blue PEs of the systolic array, Figure 6, to pass horizontally through the computed w bits of C i−1 m−1 , kw < i ≤ (k + 1)w, along with the w bits of inputs D i−1 and C i−1 , kw < i ≤ (k + 1)w, to the remaining PEs nodes of the systolic array. Notice that the D i−1 and C i−1 registers, shown in Figure 5, feeds the systolic array with the input words of D i−1 and C i−1 through these time instances. Furthermore, through these time instances, the control signal (y = 0) forces the bits of C i m and kw < i ≤ (k + 1)w through the AND gate shown in Figure 7 to have zero values as shown at the left edge of the DG of Figure 3. At the remaining time instances, this control signal is equal to one.

5.
Through time instances n = k m w + 1, 1 ≤ k ≤ m w , the control signal z shown at the right side of Figure 6 is equal to zero to feed the zero values of C, shown at the right edge of DG of Figure 3, to the systolic array. At the remaining time instances, this control signal is equal to one.

6.
Through time instances n ≥ m w m+µ−1 w , the resulting output words of R and Q will be loaded in a word serial fashion, one word at each time step, in registers R and Q shown in Figure 5, respectively.
An important note that should be considered here is that the vertical w bit words of R, Q, and G and the horizontal w bit words of C i−1 m−1 , D i−1 , and C i−1 are delayed one time step inside the systolic array as shown in Figure 6. This is represented by the D registers (squares) shown in this figure. This renders a one time step difference between the PEs above the D registers (squares) and the PEs below of them. This time difference is attributed to the intermediate words of C, resulting from the left column (blue cells) of the systolic array shown in Figure 6, that are produced starting from the second time step and the words of R, Q, G, C i−1 m−1 , D i−1 , and C i−1 should be delayed, as shown in Figure 6, to synchronize the operation. This resulted in the extra time step needed to complete the the combined multiplication/squaring computation as explained before in Equation (8).

Experimental Results and Discussion
In this section, we compare the proposed two-dimensional word-serial combined multiplier-squarer structure and the best of the existing word-serial multiplier structures [18,23,40,41] in terms of area and time complexities. The area is estimated in terms of numbers of Tri-State buffers, 2-input AND gate, 2-input XOR gate, 2-input Multiplexers, and Flip-Flops. The time is represented by latency and Critical Path Delay (CPD).
The estimated area and time complexities of the compared structures are given in Table 1. In this Table, the field size and word size are represented by m and w, respectively. T A represents the delay of 2-input AND gate. T X represents the delay of 2-input XOR gate. T MUX represents the delay of 2-to-1 MUX. The notations F 1 , F 2 , F 3 , L 1 , τ 1 , τ 2 , τ 3 , and τ 4 are described by the following equations.
F 1 represents the number of Flip-Flops in Pan et al. [18] design.
F 2 represents the number of Flip-Flops in Hua et al. [40] design.
F 3 represents the number of Flip-Flops in Chen et al. [41] design.
L 1 represents the latency of Chen et al. [41] design.
τ 1 represents the critical path delay of Pan et al. [18] design.
τ 2 represents the critical path delay of Hua et al. [40] design.
τ 3 represents the critical path delay of Chen et al. [41] design.
τ 4 represents the critical path delay of the proposed design.
For fair comparison, we added the area complexity of Input/Output registers for each design structure.
By inspecting Table 1, we observe that the expressions representing the estimated number of logic gates or components of the multiplier structures of Pan et al. [18] and Xie et al. [23] are approximate of order O(mw). On the other hand, it is of order O(w 2 ) for the other multiplier structures, except the MUXes and Flip-Flop components of the proposed design, which are of order O(w 2 ) and O( m/w ), respectively. Since m is extremely larger than w, we can conclude that the area complexity of the multiplier structures of Pan et al. [18] and Xie et al. [23] will be higher than that of all the other multiplier structures, including the suggested one. By examination of the gate counts' expressions of the developed multiplier-squarer and multipliers of Hua et al. [40] and Chen et al. [41], we recognize that the proposed design has a lower number of Flip-Flops compared to them. The flip-flop area expression is of order O( m/w ) for the suggested multiplier-squarer and it is of order O(w 2 ) for the multipliers of Hua et al. [40] and Chen et al. [41]. Therefore, for large values of w, the number of flip-flops will be substantially decrease compared to the other multipliers. According to the standard CMOS libraries' data, the Flip-Flop consumes the largest area on the chip compared to the other gate types. Thus, reducing the number of flip-flops in the design structure will considerably reduce the overall area, which accounts for the insignificant increase in the area of the proposed design as w increases. It is interesting to notice that the suggested design performs multiplication and squaring operations simultaneously and the compared ones perform both operations in sequence and, hence, reducing the area of the developed design for different word sizes can be considered a considerable achievement.
By examining the latency expressions in Table 1, we can conclude that the multiplier structure of Pan et al. [18] has lower latency and that the multiplier structure of Hua et al. [40] has the most significant latency compared to the remaining multiplier structures, including the recommended one. The numerical results displayed in Table 2 show that the proposed design has lower latency than the designs of Hua et al. [40] and Chen et al. [41] and higher latency than the multiplier designs of Xie et al. [23] and Pan et al. [18] for the common field size n = 409. Furthermore, we can conclude from Table 1 that the latency of all multiplier structures typically decreases as word-size w increases as it is inversely proportional to w.
By analyzing the expressions of the CPD, for all word sizes of w, we can conclude that the multiplier structures of Xie et al. [23], Hua et al. [40], and Chen et al. [41] have a fixed and lower CPD. On the other hand, the CPD values of Pan et al. [18] and the developed design principally increases as w increases. We cannot accurately assume, from the estimated expressions, which design structure has the best execution time due to the difficulty in estimating the decreased amount of latency when w increases. The numerical results given in Table 2 will confirm the question of which multiplier structure presents better execution time complexity.
The designs in Table 1 are described using the VHDL code and synthesized for the common field size m = 409 and different values of w (8,16,32) to obtain real implementation results. We used the NanGate (15 nm, 0.8 V) Open Cell Library and Synopsys tools version 2005.09-SP2 for synthesizing. We used the typical corner (V DD = 0.8 V and T j = 25 • C) and unit drive strength for all the utilized primitives.
Testing the proposed design starts by evaluating the wasted power at a frequency of 1 KHz for each multiplier structure. Then, through simulation using Mentor Graphics ModelSim SE 6.0a tools, we accumulated the switching activities in the Switching Activity Interchange Format (SAIF) file to obtain the power report. Next, we designed a testbench to simulate the suggested multiplier-squarer structure. The test bench has a single loop of 400 possible input combinations of 32-bits, each allowing the user to validate the correctness of the outputs. In order to regularly examine the resulting output's correctness, we used an error flag to designate if the implemented design is working accurately or not. If the error flag sets to '0' at the end of the simulation, then the multiplier-squarer structure works perfectly. On the other hand, if it sets to '1', the multiplier-squarer design is not operating correctly. In order to allow the examiner to examine the generated output from each input set, we utilized a "wait statement" to produce a delay of 50 ns between test vectors. Furthermore, we performed a post-layout simulation to include the additional pin cost and the propagation delay of all gates. Accordingly, we can achieve an accurate evaluation of the area, time, and consumed power. The obtained results are listed in Table 2. The design metrics used to compare the proposed and the existing word-serial designs can be defined as follows:

1.
Latency: is the total number of clock cycles needed to complete a single operation; 2.
Area (A): is the estimated design area in terms of the equivalent area of 2-input NAND gate; 3.
CPD: is the synthesized critical path delay; 4.
Time (T): is the total computation time required to complete a single operation; 5.
Power (P): is the consumed power obtained at 1 KHZ; 6.
Energy (E): is the consumed energy which obtained by multiplying power (P) by the total computation time (T).
For a fair comparison, the compared multiplier structures of [18,23,40,41] should perform multiplication and squaring operations in sequence and this doubles the obtained synthesis results of the time and consumed power/energy of these designs as indicated in Table 2. For a better explanation of the obtained results, we visualized area, time, power, and energy results by using the charts shown in Figures 9-12, respectively. Figure 9 indicates that the proposed design structure saves area at the different values of w by percentages ranging from 9.1% to 92.6% at w = 8, 11.6% to 93.7% at w = 16, and 20.7% to 91.9% at w = 32 over the existing designs. The design of Pan [18] saves 45.8% and 9.3% time at w = 8 and w = 16, respectively, over the best of the other designs including the proposed one. The design of Xie [23] saves 18.4% time at w = 32 over the best of the other designs including the proposed one. Figure 10 indicates that the multiplier of Pan [18] has the most reduced computation time at w = 8 over all the remaining designs, including the recommended one (at least 40% lower time than the multiplier of Xie [23]). At w = 16 and w = 32, the multiplier of Xie [23] has the cheapest computation time over the remaining designs (at least reduction by 0.6% at w = 16 and 6.5% at w = 32 over the multiplier of Pan [18]). The multiplier of Xie [23] outperforms the remaining designs at w = 16 and w = 32 due to the significant reduction in its latency compared to the other multiplier designs.   Figure 11 indicates the developed design that has the lowest power consumption at all word sizes due to its significant reduction in area. The savings in the area will reduce the parasitic capacitance and, thus, reduces the switching activities in the entire circuit, which is one of the main contributors to power consumption. We noticed from Table 2 that the proposed design reduces power consumption at different word sizes by percentages ranging from 10.8% to 98.5% at w = 8, 12.5% to 98.4% at w = 16, and 34.7% to 98.5% at w = 32 over the existing designs. Figure 12 indicates that the proposed design structure has the a significant reduction in energy over all the remaining multipliers at the different values of w. By observing the energy results in Table 2, we notice that the proposed design saves energy at the different values of w by percentages ranging from 9.6% to 97.3% at w = 8, 17.1% to 97.5% at w = 16, and 29.0% to 98.2% at w = 32 over the existing designs. The reduction in consumed energy of the proposed design, at all word sizes, is mainly attributed to the fair values of its execution time (T) and the lower values of the consumed power.  As we notice, the proposed design has the lower area, power, and consumed energy at all of the embedded word sizes. Thus, it enables the implementation of cryptographic processors in resource-constrained IoT edge devices such as hand-held devices, wearable and implantable medical devices, wireless sensor nodes, smart cards, and radio frequency identification (RFID) devices.

Summary and Conclusions
This paper presented new efficient two-dimensional word-based SISO systolic processor for performing the multiplication and squaring operations concurrently over GF(2 m ). The proposed systolic processor structure shares the data-path and this results in saving more area and power resources. We applied non-linear scheduling and projection functions to the algorithm dependency graph to explore the proposed systolic processor core. The applied non-linear scheduling and projection functions provide the designer more flexibility to control the processor work load and the execution time. The size of the systolic array in the processor core does not depend on the field size and that renders the proposed design more suitable for implementation in embedded and ultra-low power devices. Implementation results of the proposed two-dimensional combined word-serial processor systolic structure and the best of the existing word-serial multiplication designs show that the proposed structure achieves significant savings in area and consumed energy at different values of the embedded word sizes. This renders it more suitable for constrained implementations of cryptographic primitives in resource-constrained IoT edge devices such as hand-held devices, wearable and implantable medical devices, wireless sensor nodes, smart cards, and radio frequency identification (RFID) devices.