^{1}

^{*}

^{2}

^{1}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Typically, commercial sensor nodes are equipped with MCUsclocked at a low-frequency (

Wireless Medical Sensor Networks (WMSNs) have several benefits. New medical infrastructure can replace wired telemetry applications. This is important in fields related to ambulatory monitoring or rehabilitation, where WMSNs can provide additional flexibility [

Generally, medical applications utilize commercial sensor nodes based on low-power MCUs. Further, these nodes generally utilize a 2.4 GHz transmitter based on the IEEE 802.15.4 communication protocol [

In this manuscript, we proposed investigating the role of FPGAs in the development of infrastructure for sensor networks. In this respect, we explore a variety of topics:

How an authentication-encryption (AE) mode of a block cipher (AES) can be implemented by maximizing the utilization of the embedded resources of the FPGA, such as the DSPblocks (Section 3).

How finite field arithmetic (e.g., addition and multiplication) can be implemented through the DSP blocks of the FPGA for achieving a reduction in area (Section 4).

How cryptographic accelerators can be implemented in FPGA-based nodes or nodes based on the combination of MCU and FPGA for extending the IEEE 802.15.4 security suite with key establishment schemes (Section 4.4).

Finally, we present the design of a cryptographic core, implemented in VHDLand utilizing the described components. All the resources of the FPGA are optimally used for the implementation of the different cryptographic algorithms, based on known designs, with a good trade-off between speed and area. The proposed design can be used to accelerate and perform massive encryption and authentication primitives in applications with a large number of nodes, such as a patient monitoring application, either based on a Wireless Sensor Network (WSN) or Wireless Body Area Network (WBAN).

This manuscript is structured as follows. First, in Section 2, we describe other implementations of the IEEE 802.15.4 security suite that have been proposed in the literature and summarize our contributions. Then, in Section 3, we outline our implementation. In Section 4, we detail the proposed implementation of the NIST P-192 and B-163 curves. Finally, in Section 5, we arrange the designs sketched out in Sections 3 and 4 together. This results in a cryptographic accelerator compliant with the IEEE 802.15.4 standard and extended with Elliptic Curve Cryptography (ECC) capabilities that can be compared with other implementations in the literature. Finally, we describe our future work in Section 6 and end in Section 7 with some conclusions.

Several authors have proposed FPGA-based designs compliant with the IEEE 802.15.4 in the literature. Hamalainen

Our design attempts to improve those architectures according to the following facts. First, we have selected a low-power FPGA (Artix-7). In contrast, Song

Besides, the utilization of DSPs for constructing large multipliers in cryptographic designs is not uncommon in the literature. Güneysu

The IEEE 802.15.4 standard utilizes cryptographic techniques based on symmetric-key cryptography for ensuring data confidentiality, authenticity, integrity and replay protection [

The AES-128 requires 10 rounds for each encryption process. In each round, four different operations manipulate an internal state of 16 bytes. These operations are based on the ^{8}) extension field. The elements of this field are expressed as polynomials according to the form _{7}^{7} + … + _{1}_{0}. The set of coefficients of each polynomial forms an eight-bit vector, represented in ^{8}) and

Only the encryption part of the AES is reviewed here, since its decryption part is not utilized in the CCM mode. The inner four operations of each round in the AES encryption are the following. The ^{8}) inversion in tandem with an ^{8}) multiplication and the addition of an eight-bit constant (^{8}) multiplications of a 4 × 4 matrix made of constants.

The

By using the AES

The architecture of the

Since the CCM mode only requires the encryption part of AES, it can be implemented with two extra XOR gates of 32 bits (

In this section, we describe how the finite field arithmetic of two standardized curves (particularly the B-163 and the P-192 curves [

ECC was independently proposed by Victor Miller in 1985 and by Neal Koblitz in 1987 [

Elliptic Curves (ECs) are generally represented over prime fields (_{p}^{m}_{2m}). The latter is generally preferred for hardware implementations, since the main operations are based on logic functions and shifts.

Prime fields in the form of _{192} = 2^{192} − 2^{64} − 1 [

On the other hand, in binary extension fields in the form of ^{m}^{163} + ^{7} + ^{6} + ^{3} + 1 [

However, in order to optimize the implementation of ECC arithmetics and avoid implementing the division operation, a number of inverse-free coordinate systems have been proposed in the literature. The importance of selecting a coordinate system stems from the fact that a reduced number of either additions or multiplications is preferred in an energy-constrained design. Therefore, in order to reduce the number of cycles required for performing a point operation in a cryptographic implementation, it is important to carefully choose the coordinate system. In the next section, we describe a number of coordinate systems generally utilized in the literature. We utilize [

Elliptic curves over prime fields, (_{4}, _{6} ∈ (image)_{p}

Standard projective coordinates utilize triples represented by (_{1}, _{1}, _{1}). They are derived from an _{1} ≠ 0. In this system of coordinates, the number of operations for a point addition (PA) consists of 12 multiplications (M) and two squarings (S), whereas it requires seven multiplications (7M) and five squarings (5S) for performing a point doubling (PD). Besides, Jacobian coordinates utilizes triples, (_{1}, _{1}, _{1}), derived from the
_{1} ≠ 0. The PA and PD require 12M + 4S and 8M + 3S operations, respectively.

Finally, Chudnovsky-Jacobian coordinates utilize points represented with five coordinates _{1}, _{1}, _{1},

On the other hand, in (image)_{2}_{m}_{2}, _{6} ∈ _{2}_{m}

Similarly to prime fields, projective coordinates and Jacobian ones can be utilized. Standard projective coordinates require 16M + 2S (PA) and 8M + 4S (PD) operations, whereas using the Jacobian system of coordinates, a PA is performed in 16M + 3S operations and 11M + 3S, in the case of PD. Besides, the López-Dahab (LD) system of coordinates derives the triple, (_{1}, _{1}, _{1}), from the _{1} ≠ 0. Performing a PA via LD coordinates requires 13M + 4S operations, whereas PD is performed in 5M + 4S operations.

According to

In the case of the B-163 curve, we have selected the LD coordinates, since it requires a reduced number of multiplications in comparison with the standard projective and Jacobian coordinates (

In this section, we describe our implementation of the P-192 curve operations. These components are utilized for extending the IEEE 802.15.4 security suite using key negotiation schemes based on ECC.

Integer modular addition and subtraction are performed mod _{192} = 2^{192} −2^{64} −1 in the P-192 curve. Algorithms 1 and 2 represents both modular addition and subtraction mod _{192}.

_{191}, …,

_{0}) and

_{191}, …,

_{0}), modulus p

_{192}= 2

^{192}− 2

^{64}− 1.

_{192}. 1:

_{1}=

_{2}=

_{1}−

_{192}3:

_{2}≥ 0

_{2}5:

_{1}7:

_{191}, …,

_{0}) and

_{191}, …,

_{0}), modulus

_{192}= 2

^{192}− 2

^{64}− 1.

_{192}. 1:

_{1}=

_{2}=

_{1}+

_{192}3:

_{1}< 0

_{2}5:

_{1}7:

The DSP48E1 block [

Consequently, we utilize the DSP blocks as 48-bit with carry support. We have utilized four DSP blocks for implementing a full operation of 192-bit. In order to optimize the design of the adder/subtractor and perform both operations using only one component, we rely on the design proposed by [^{k}_{2} is computed as _{1} + (2^{k}_{192}) instead of _{1} −_{192} in the addition process (

The addition of two operands (e.g., A and B) requires one cycle in the DSP block. Then, an extra cycle is required to propagate the carry among the blocks. Consequently, four cycles are required for performing one modular addition or subtraction, since there are two 192-bit adders in the proposed design.

The NIST curves utilize pseudo-Mersenne primes for performing fast reductions using only additions and subtractions [

The reduction consists of four additions that can be executed in the adder/subtractor. Consequently, a modular reduction can be achieved in 16 cycles.

_{192}.

_{0}, …,

_{6}), where

_{i}

_{192}1:

_{0}= (

_{2},

_{1},

_{0}) 2:

_{1}= (0,

_{3},

_{3}) 3:

_{2}= (

_{4},

_{4}, 0) 4:

_{3}= (

_{5},

_{5},

_{5}) 5:

_{0}+

_{1}+

_{2}+

_{3}mod

_{192}

The DSP48E1 block supports 25 × 18-bit multiplications, which can optionally be coupled with a 48-bit accumulator. Generally, the multiplication operation is based on two main operations. First, a group of partial products are computed. Then, they are shifted and accumulated for generating the final result.

In the literature, multiplication techniques are generally categorized among parallel and sequential multipliers [

Given that we can process a 25 × 18-bit product at a time, we can use several DSPs for generating and accumulating the partial products in parallel. In this case, since we work with 192-bit operands, they can be decomposed in 16 segments of 16-bit and be processed using 16 × 16-bit multiplications. This decomposition is based on the addition of 12 segments shifted

If we operate the product, ^{2} = 144 partial products that can be added according to the displacement, 2^{k}^{k}^{22}^{k}

Finally, 23 accumulated partial products can be added together for obtaining the final result. This is done using one DSP block in addition mode. This operation is based on shifting each partial product ^{ik}_{23}_{k}_{1} ≪ _{0}.

Each MACC operation requires an initial delay (one cycle) to fill the pipeline of the DSP block and an extra cycle for each subsequent multiplication and addition. At the same time, the results of each MACC are accumulated in another DSP block, selected by a multiplexer coupled to a counter. However, given that the first half of partial accumulations (_{0–11}) and the second one (_{12–22}) are being generated at the same time, the second part is stored, while the first one is processed in a BRAM. Then, this BRAM is read through a counter and added (

Since all the partial products are computed in parallel and are being added after the first partial product is generated (_{0}), the number of cycles for computing a multiplication is 23+1 (delay MACC) +1 (DSP addition) = 25 cycles.

In this section, we describe how we have implemented the different units for performing operations in ^{163}). These operations are then contrasted with those of the NanoECC and TinyECC libraries in Section 5.

Addition in ^{m}^{163}) (

^{m}

^{163}).

^{163}) multiplication ∀

^{163}),

^{163}+

^{7}+

^{6}+

^{3}+ 1.

_{0}, …,

_{162}),

_{0}, …,

_{162}) ∈

^{128}).

_{0}, …,

_{162}) ∈

^{163}). 1:

_{i}

_{162}= 1

In our design, the ^{163}) multiplication operation is performed using a bit-serial approach. The product is then reduced by an irreducible polynomial (^{163} + ^{7} + ^{6} + ^{3} + 1,

The IEEE 802.15.4 standard does not describe how keys are generated. Those operations are supposed to be provided by the protocol upper layers. Since shared keys need to be renegotiated by the intended parties before the message counter overflows (

ECDH is a key agreement protocol that establishes a shared secret between two non-authenticated parties. It follows a similar approach as the Diffie-Hellman (DH) key exchange [

The strength of ECDH resides in the Elliptic Curve Discrete Logarithm Problem (ECDLP), _{G}

On the other hand, ECIES is an authenticated encryption protocol based on EC. ECIES has been standardized by several organizations, such as ANSI, IEEE, SECG and ISO/IEC [

ECIES consists of three main components:

A primitive that generates a MACfor authenticating each message. It can be based on HMAC-secure hash algorithm (SHA)-1, HMAC-SHA-2 or AES-CBC-MAC. Since we have already implemented AES-CBC-MAC for supporting the IEEE 802.15.4 security suite, we have selected this technique.

A Key Derivation Function (KDF) that generates a shared key. In this case, two standards are supported: X.9.63-KDF and NIST 800-5. We have selected X.9.63-KDF, which consists of a message digest generation via SHA-1 or SHA-2. In this respect, we have implemented SHA-256 for computing the KDF (Section 4.4.3).

A symmetric encryption algorithm, either based on XOR or AES with 128, 192 and 256 key-lengths in CBC or CTR modes. Moreover, Triple DES (3DES) in CBC mode is also supported. Hence, we rely on AES-128 in CTR or CBC mode since it is available from the implementation of the IEEE 802.15.4 security suite (Section 3.1).

Moreover, two parameters are required before a message is sent from A to B:

The public key of party B generated as _{B}_{b}G_{b}

Additional information, represented as _{{1,2}}.

The first part of the scheme derives a shared secret, _{E}_{MAC}_{E}_{MAC}_{1}). Then, the message is encrypted and authenticated. Finally, the material for deriving the shared secret,

_{B}

_{x}

_{y}

_{B}

_{x}

_{E}

_{MAC}

_{1}) 5:

_{k}(

_{e}

_{kMAC}

_{2}) 7:

Consequently, the number of operations by a node that encrypts and sends a message through ECIES (A) consists of two point multiplications, the generation of keys (_{E}_{MAC}

_{x}

_{y}

_{b}

_{b}

_{B}

_{x}

_{E}

_{MAC}

_{1}) 4:

_{kMAC}

_{2})

_{k}

_{e}

As noted before, SHA-256 has been implemented in the proposed accelerator to perform the KDF during the key establishment process. The secure hash algorithm, SHA-256, is part of the SHA-2 family, standardized by NIST [_{i}_{j}

The message scheduler is initialized with the padded message at the beginning of the hash computation, whereas the main pipeline registers (_{0}, Σ_{1}, _{0} and _{1}) defined in [

We have constructed two accelerators based on the NIST curves, B-163 and P-192 (^{163}) have also been utilized. Finally, a Finite State Machine (FSM) orchestrates the execution of PA, PD and PM primitives between the different components of the core (

Since the number of pins available in the target FPGA (Artix-7) is not enough for supporting two input operands and one output operand of 128/163/192 bits, we rely on a simplified slave bus interface based on the Wishbone interconnection standard [

We have performed software power analysis in the designs described in this manuscript through the Xilinx Power Analyzer (XPA) [

We have depicted the PAR results of each implemented arithmetic circuit for performing operations on the P-192 and B-163 curves in _{192} modulus in three blocks of BRAM in the P-192 adder/subtractor. Moreover, the P-192 multiplier utilizes one block of BRAM for storing the second half of the partial of products, while the first part is being accumulated. Finally, the B-163 multiplier stores the ^{163}) irreducible polynomial in two blocks of BRAMs.

According to

^{m}

Finally, the area is also dominated by the slices required by the SHA-256 implementation together with the set of registers that stores three pairs of coordinates in projective and LD form. We have implemented all the 32-bit arithmetic and logic operations of the SHA-256 algorithm via XOR gates, obtaining a reduction in area of 19.91% (

We have generated a post-PAR simulation model of the P-192 and B-163 accelerators. First, we have simulated the execution of several operations for generating the corresponding signal activity file at 10 MHz. The selection of this frequency stems from the fact that this accelerator will run at the typical frequency that

Given the area utilization of the SHA-256 implementation, this is the component of the accelerator that requires more power (53 mW in the P-192 accelerator and 49 mW in the B-163). The rest of the operations are executed in the B-163 accelerator with a reduction of 2–8 mW in comparison with the P-192 implementation, according to the achieved reduction in area (Section 5.3). Moreover, despite that the B-163 operations are performed through smaller operands, the fact that the ^{163}) multiplication requires 19.25 ^{163}) multiplications can improve both the time and energy consumption.

^{m}

^{m}

_{i}

Finally,

As depicted in

Finally, it is worth noting that we are using the XC7A100TL FPGA, which is one of the largest platforms of the Artix-7 series. Rather, using the XC7A20S (2,500 slices, 60 DSP48E1) renders the selected platform ill-suited, since a better power consumption and price are expected. Nevertheless, this platform was not available at the time of writing.

The utilization of FPGAs for sensor node construction adopts the typical threat model of FPGA-based systems. That means that an attacker generally can have two main interests in the platform: recovering the secret keys and disrupting the system. Consequently, the unused I/O pins of the FPGA must be protected against leakage, and they must reject any request. Moreover, the programming interface of the FPGA must be locked for non-authorized readings and updates. In this respect, since we are using an SRAMFPGA, an external non-volatile memory is required to store the FPGA configuration, and bitstream encryption must be activated to avoid tampering. Finally, anti-fuse and FLASH-based FPGAs can be used to avoid this problem, as well as to mitigate the impact of side-channel attacks. Moreover, a number of authors have proposed different techniques to avoid these attacks on FPGAs based on masking, hiding and utilizing random-based arithmetics [

In this manuscript, we have presented the design of two cryptographic accelerators suitable for FPGA-based nodes, extended with key negotiation capabilities. The proposed platform is based on the low-power Xilinx Artix-7 FPGA. Moreover, we have taken advantage of the DSP48E1 slice for reducing the area figures of our design. In this respect, we have replaced the logic functions in the AES folded architecture described by Chodowiec

The authors declare no conflict of interest.

Organization of the proposed AES-CCM architecture.

Proposed organization for the key schedule.

_{192} modular adder and subtractor [

One-hundred and ninety-two-bit multiplier design.

Organization of the proposed B-163 adder.

Organization of the secure hash algorithm (SHA)-256 implementation.

Organization of the P-192 accelerator.

Performance of coordinate systems in prime fields. PA, point addition; PD, point doubling; M, multiplication; S, squaring.

Standard projective | 12M + 2S | 7M + 5S |

Jacobian | 12M+4S | 8M+3S |

Chudnovsky-Jacobian | 11M + 3S | 5M + 6S |

Performance of coordinate systems in binary extension fields.

Standard projective | 16M + 2S | 8M + 4S |

Jacobian | 16M+3S | 11M+3S |

López-Dahab | 13M+4S | 5M+4S |

Place and Route (PAR) results of the cryptographic algorithms implemented only using LUTs (XC7A100TL).

_{max} (MHz) |
|||||
---|---|---|---|---|---|

P-192 modular adder/subtractor | 173.361 | 4 | 399 | 3 | - |

P-192 multiplier | 188.460 | 25 | 986 | 1 | - |

B-163 adder/subtractor | 410.231 | 1 | 219 | - | - |

B-163 multiplier | 445.177 | 163 | 312 | 2 | - |

PAR results of the cryptographic algorithms implemented only using DSPs (XCTA100TL).

_{max} |
||||||
---|---|---|---|---|---|---|

P-192 modular adder/subtractor | 92.237 | 4 | 302 | 24.31 | 3 | 8 |

P-192 multiplier | 188.460 | 25 | 433 | 56.08 | 1 | 24 |

B-163 adder/subtractor | 224.298 | 1 | 132 | 39.72 | - | 4 |

B-163 multiplier | 259.700 | 163 | 271 | 13.14 | 2 | 8 |

PAR results of the two proposed accelerators.

Platform | Artix-7 (XC7A100TL) | Artix-7 (XC7A100TL) |

_{max} |
51.244 | 51.244 |

# of Slices | 1,418 | 603 |

# of BRAMs (36 kb) | 4 | 2 |

# of BRAMs (18 kb) | 20 | 21 |

# of DSP48A1 slices | 63 | 38 |

PAR results of the SHA-256 implementation.

Platform | Artix-7 (XC7A100TL) | Artix-7 (XC7A100TL) |

_{max} |
96.834 | 42.817 |

# of Slices | 688 | 551 |

# of BRAMs (36 kb) | - | - |

# of BRAMs (18 kb) | 9 | 9 |

# of DSP48A1 slices | 0 | 32 |

Performance summary of the P-192 accelerator at 10 MHz. ECDH, Elliptic Curve Diffie-Hellman.

AES | 5.55 | 8/45 | 2.49 × 10^{-4} |

SHA-256 | 9.45 | 15/53 | 5 × 10^{-4} |

Multiplication | 4.65 | 10/47 | 2.18 × 10^{-4} |

Addition | 2.85 | 7/47 | 1.33 × 10^{-4} |

Point addition | 72.25 | 8/46 | 0.003 |

Point doubling | 86.75 | 10/48 | 0.004 |

Point multiplication | 23,056 | 10/48 | 1.10 |

ECDH | 45,112 | 10/48 | 2.21 |

ECIES | 46,129 | 10/48 | 2.21 |

Performance summary of the B-163 accelerator at 10 MHz.

AES | 5.55 | 5/43 | 2.38 × 10^{-4} |

SHA-256 | 9.45 | 12/49 | 4.63 × 10^{-4} |

Multiplication | 19.25 | 3/40 | 7.70 × 10^{-4} |

Addition | 1.95 | 4/41 | 7.99 × 10^{-5} |

Point addition | 252.95 | 2/40 | 0.01 |

Point doubling | 319.55 | 2/40 | 0.01 |

Point multiplication | 83,850.35 | 2/40 | 3.35 |

ECDH | 167,700 | 2/40 | 6.70 |

ECIES | 167,720 | 2/40 | 6.70 |

Comparison on execution time (ms) with other Elliptic Curve Cryptography (ECC) and AES-128 implementations in commercial sensor nodes (B-163).

NanoECC (160-bit)-MICA2 [ |
1,270 | - | - | - | ||

NanoECC (160-bit)-Tmote Sky [ |
720 | - | - | - | ||

TinyECC (160-bit)-MICAz [ |
- | 3,956.17 | 5,746.2 | - | ||

TinyECC (160-bit)-Tmote Sky [ |
- | 2,075.5 | 3,590.42 | - | ||

TinyECC (160-bit)-Imote2 (13 MHz) [ |
- | 571.28 | 915.31 | - | ||

Healy |
- | - | - | 0.32383 | ||

Healy |
- | - | - | 2.022 |

Comparison on energy consumption (mJ) with other ECC and AES-128 implementations in commercial sensor nodes (B-163).

^{-4} | ||||

NanoECC (160-bit)-MICA2 [ |
30.02 | - | - | - |

NanoECC (160-bit)-Tmote Sky [ |
7.95 | - | - | - |

TinyECC (160-bit)-MICAz [ |
- | 94.95 | 137.91 | - |

TinyECC (160-bit)-Tmote Sky [ |
- | 16.61 | 24.78 | - |

TinyECC (160-bit)-Imote2 (13 MHz) [ |
- | 16.83 | 26.95 | - |

Healy et al.-CC2420 [ |
- | - | - | 0.0084 |

Healy et al.-MICAz [ |
- | - | - | 0.0525 |