E ﬃ cient Implementation on Low-Cost SoC-FPGAs of TLSv1.2 Protocol with ECC_AES Support for Secure IoT Coordinators

: Security management for IoT applications is a critical research ﬁeld, especially when taking into account the performance variation over the very di ﬀ erent IoT devices. In this paper, we present high-performance client / server coordinators on low-cost SoC-FPGA devices for secure IoT data collection. Security is ensured by using the Transport Layer Security (TLS) protocol based on the TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 cipher suite. The hardware architecture of the proposed coordinators is based on SW / HW co-design, implementing within the hardware accelerator core Elliptic Curve Scalar Multiplication (ECSM), which is the core operation of Elliptic Curve Cryptosystems (ECC). Meanwhile, the control of the overall TLS scheme is performed in software by an ARM Cortex-A9 microprocessor. In fact, the implementation of the ECC accelerator core around an ARM microprocessor allows not only the improvement of ECSM execution but also the performance enhancement of the overall cryptosystem. The integration of the ARM processor enables to exploit the possibility of embedded Linux features for high system ﬂexibility. As a result, the proposed ECC accelerator requires limited area, with only 3395 LUTs on the Zynq device used to perform high-speed, 233-bit ECSMs in 413 µ s, with a 50 MHz clock. Moreover, the generation of a 384-bit TLS handshake secret key between client and server coordinators requires 67.5 ms on a low cost Zynq 7Z007S device.


Introduction
The growth in the penetration of the Internet of Things (IoT) [1] in our daily life, be it in fields such as smart homes, smart enterprises, smart hospitals or smart cities, which require a large number of interconnected IoT devices, open the subject of IoT data security concerns.In fact, large amounts of information are transferred through heterogeneous networks, ranging from local wireless sensor networks (WSN) to Wide Area Networks (WAN).Fortunately, Transport Layer Security (TLS) [2] provides an end-to-end network secure information transfer over insecure channels by combining heterogeneous cryptographic protocols like symmetric schemes (e.g., 3DES, AES) [3,4], secure hash functions (e.g., SHA-1, SHA-2, SHA-3) [5,6] and public-key algorithms (e.g., RSA, ECDH, ECDSA) [7].These last cryptosystems are computationally intensive, due to the complex operations required by public key protocols, and may not be of generalized use due to hardware limitations.However, in controlled environments, such as local WSNs, simplified protocols implemented over compact cryptoprocessors can be a solution [8].In the case of IoT coordinator nodes, which require transferring data to the Internet, Secure Sockets Layer (SSL) [9] or TLS are the preferred solution.
In the IoT paradigm there are different agents implied, such as sensors, cameras, actuators or microchips, which collect and transfer information through the Internet.As it is difficult to regulate the performance of all IoT devices, security management for IoT applications becomes much more difficult than for a single device [8].Due to the low performance hardware resources of a large number of IoT agents [9], the targeted cryptographic algorithms are not suitable to be implemented on every IoT device [10].Hence, we propose to design high-performance client/server coordinators on low-cost SoC-FPGA devices for secure IoT data collection, as shown in Figure 1.The IoT Client coordinators (IoTC1, IoTC2) collect data from IoT agents (A1, A2, A3, A4, A5, A6) and send it to the server through the Internet.The IoT Server coordinator (IoTS) acts as an interface between the IoTCs and the server's memory, where these data will be stored.The secure data transfer between IoTCs and the IoTS is ensured by the TLSv1.2protocol, in order to protect the information from unauthorized users.In fact, TLSv1.2 allows to generate a shared secret key between the IoTS and each IoTC (Key1, Key2) that could be used to encrypt/decrypt data based on private-key algorithms.
Electronics 2019, 8, x FOR PEER REVIEW 2 of 18 AES) [3,4], secure hash functions (e.g., SHA-1, SHA-2, SHA-3) [5,6] and public-key algorithms (e.g., RSA, ECDH, ECDSA) [7].These last cryptosystems are computationally intensive, due to the complex operations required by public key protocols, and may not be of generalized use due to hardware limitations.However, in controlled environments, such as local WSNs, simplified protocols implemented over compact cryptoprocessors can be a solution [8].In the case of IoT coordinator nodes, which require transferring data to the Internet, Secure Sockets Layer (SSL) [9] or TLS are the preferred solution.
In the IoT paradigm there are different agents implied, such as sensors, cameras, actuators or microchips, which collect and transfer information through the Internet.As it is difficult to regulate the performance of all IoT devices, security management for IoT applications becomes much more difficult than for a single device [8].Due to the low performance hardware resources of a large number of IoT agents [9], the targeted cryptographic algorithms are not suitable to be implemented on every IoT device [10].Hence, we propose to design high-performance client/server coordinators on low-cost SoC-FPGA devices for secure IoT data collection, as shown in Figure 1.The IoT Client coordinators (IoTC1, IoTC2) collect data from IoT agents (A1, A2, A3, A4, A5, A6) and send it to the server through the Internet.The IoT Server coordinator (IoTS) acts as an interface between the IoTCs and the server's memory, where these data will be stored.The secure data transfer between IoTCs and the IoTS is ensured by the TLSv1.2protocol, in order to protect the information from unauthorized users.In fact, TLSv1.2 allows to generate a shared secret key between the IoTS and each IoTC (Key1, Key2) that could be used to encrypt/decrypt data based on private-key algorithms.In this paper, we focus on securing data transferred from/to IoT coordinators by means of the TLS protocol, since SSL is considered insecure [11].Efficient implementations of these protocols as embedded cryptosystems can be problematic, since the target devices are usually very limited in terms of power, resources and timing.Several TLS/SSL embedded cryptosystem implementations have been proposed in the literature [12][13][14][15][16][17].OpenSSL [18,19] is the most deployed library for TLS/SSL applications through software implementations of basic cryptographic functions.For only-software TLS/SSL implementations [17,20], servers can be overloaded with heavy cryptographic operations, which results in long response times.To alleviate this bottleneck, dedicated hardware coprocessors [12][13][14][15][16][17]21] have been proposed, as Network Security Processors (NSP), as a solution to free these severs from cryptographic operations for flexible management.Nevertheless, although effective efforts have been made [10] for the acceleration of encryption methods, NSPs can provoke an overhead of hardware resources utilization [12,16] to achieve high-performance, due to the required intensive computations within cryptographic algorithms.This constraint paves the way for a HW/SW co-design implementation approach to provide a trade-off between security, area and speed.This approach is based on implementing the computing-intensive cryptosystems in hardware [22,23], while the control of TLS/SSL protocols is performed in software using microprocessors.In this context, Field Programmable Gate Array (FPGA) devices are suitable platforms, as they provide reconfigurability, flexibility and high In this paper, we focus on securing data transferred from/to IoT coordinators by means of the TLS protocol, since SSL is considered insecure [11].Efficient implementations of these protocols as embedded cryptosystems can be problematic, since the target devices are usually very limited in terms of power, resources and timing.Several TLS/SSL embedded cryptosystem implementations have been proposed in the literature [12][13][14][15][16][17].OpenSSL [18,19] is the most deployed library for TLS/SSL applications through software implementations of basic cryptographic functions.For only-software TLS/SSL implementations [17,20], servers can be overloaded with heavy cryptographic operations, which results in long response times.To alleviate this bottleneck, dedicated hardware coprocessors [12][13][14][15][16][17]21] have been proposed, as Network Security Processors (NSP), as a solution to free these severs from cryptographic operations for flexible management.Nevertheless, although effective efforts have been made [10] for the acceleration of encryption methods, NSPs can provoke an overhead of hardware resources utilization [12,16] to achieve high-performance, due to the required intensive computations within cryptographic algorithms.This constraint paves the way for a HW/SW co-design implementation approach to provide a trade-off between security, area and speed.This approach is based on implementing the computing-intensive cryptosystems in hardware [22,23], while the control of TLS/SSL protocols is performed in software using microprocessors.In this context, Field Programmable Gate Array (FPGA) devices are suitable platforms, as they provide reconfigurability, flexibility and high performance.This is of special interest for the new FPGA generations; such as Zynq from Xilinx or Stratix 10 SoC from Intel, which are equipped with advanced components in a single chip including ARM microprocessors, Advanced eXtensible Interface (AXI) buses, embedded memory or DSPs, and completely match the System on Chip (SoC) paradigm.
In this work, we present a carefully designed SW/HW implementation of the client/server TLSv1.2 protocol for IoTCs and IoTSs, which is implemented on low-cost FPGAs/SoCs suitable for IoT applications.The use of modern FPGA-based SoCs enables the achievement of an optimal trade-off between security, flexibility, area, and speed.Flexibility ensures the possibility of easier algorithm modifications, while leaving the hardware architecture fixed.Among the supported TLS cipher-suites, we have selected Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) [24], Elliptic Curve Digital Signature Algorithm (ECDSA) [25], Advanced Encryption Standard (AES-128) [3], Secure Keyed-hash message authentication (HMAC) [26] and Secure Hash Algorithm (SHA256) for our implementation.These algorithms are all combined to generate 384-bit TLS secret shared keys.The interest on ECCs [7] is justified by the fact that these systems provide better security with smaller key sizes when compared to the RSA method [27], and they are especially suitable for hardware implementation when binary fields are used [28].
Therefore, the paper provides two main contributions: the first one is the proposed SW/HW partitioning for efficient TLSv1.2 negotiations.The main idea is to implement the core operation of ECC, which is ECSM, within a scalable hardware coprocessor accelerator and to integrate it around an ARM microprocessor.Meanwhile, the control of ECDHE and ECDSA protocols, the execution of AES-128 algorithm, HMAC and SHA256 functions are ensured by the ARM microprocessor.The second contribution is the proposed internal architecture of the ECC accelerator, with low area requirements while maintaining high performance.It is based on time-area optimized finite field units and the use of dual-port block RAMs as registers.In addition, the I/Os of this ECC accelerator are 32-bit wide, which allow an easier integration with 32-bit microprocessors (e.g., ARM, PowerPC and Microblaze) via 32-bit buses (e.g., AXI and PLB).
The rest of this paper is organized as follows: Section 2 presents the TLSv1.2handshake protocol and the considered ECC cryptosystems.Section 3 is devoted to the description of the internal architecture of our ECC accelerator.The proposed FPGA-based IoTS and IoTC designs, the performance evaluation on a Xilinx Zynq device and comparisons with other works in the literature are illustrated in Section 4. Finally, conclusions are presented in Section 5.

Transport Layer Security Protocol
The TLSv1.2 protocol allows to generate a shared private key between IoTSs and IoTCs for each session based on cipher suite agreed during the TLS handshake.A demonstration of the TLS handshake between IoTC and IoTS is shown in Figure 2.
The negotiations are based on sending and receiving records, which are blocks of data.Initially, TLS1.2 begins with ClientHello() (step 1), in which the IoTC provides the cipher suite of the supported cryptographic algorithms and compression methods.It also provides random client data (Rand IoTC ) to be used later in the handshake.Then, the IoTS replies with ServerHello() (step 2) by providing random server data (Rand IoTS ) and the list of the selected cryptographic and compression methods to be used during the TLS process.In the proposed designs, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 is the supported TLS cipher suite.Once the Hello step is done, the IoTS and the IoTC calculate in parallel a pair of private/public ephemeral keypairs (steps 3 and 4) using an EC-based keypair generation algorithm [7] and send to the other party the public key.The server uses ECDSA to sign its ephemeral public key (Ps) in Step 4 and sends the signature to the client.On the other side, the IoTC verifies the received signature using the ECDSA verification algorithm (Step 5).If the verification is successful, the IoTC sends its public key (Pc).Then, a 384-bit shared secret key will be generated (Step 6) by the combination of ECDHE and HMAC-SHA256.The first algorithm provides a 256-bit PreMasterSecret key, while, the second generates a 384-bit MasterSecret key.From the latter, two 128-bit (client_write_key, server_write_key) secret keys are extracted.Finally, in order to check if the handshake was not tampered with (Step 7), the IoTC and IoTS encrypt "ping" and "pong" using the AES algorithm by server write key and client write key, respectively.Then, they exchange the encrypted messages, and each part decrypts the received message using the appropriate key to retrieve "ping" and "pong" messages.Otherwise, the TLS handshake process was tampered.
The rest of this paper is organized as follows: Section 2 presents the TLSv1.2handshake protocol and the considered ECC cryptosystems.Section 3 is devoted to the description of the internal architecture of our ECC accelerator.The proposed FPGA-based IoTS and IoTC designs, the performance evaluation on a Xilinx Zynq device and comparisons with other works in the literature are illustrated in Section 4. Finally, conclusions are presented in Section 5.

Transport Layer Security Protocol
The TLSv1.2 protocol allows to generate a shared private key between IoTSs and IoTCs for each session based on cipher suite agreed during the TLS handshake.A demonstration of the TLS handshake between IoTC and IoTS is shown in Figure 2.

Elliptic Curve Cryptography
In ECC, most of elliptic curves are defined over prime fields (Fp) and binary fields (GF(2 m )) [29].As binary fields are more suitable for hardware implementations [23], in this paper we are interested in GF(2 m ), where the field element k is a binary of fixed length m: The arithmetic is defined by the polynomial representation: An Elliptic Curve E defined over GF(2 m ) consists of a set of points P represented by the coordinates (x p ,y p ), where x p and y p are elements of GF(2 m ) solving the Weierstrass expression [28]: The conception of ECC schemes consists of three parts, namely, the curve parameters domain, the key generation and the encryption/decryption algorithms.In the literature, several standard curve domains are recommended with different key-length, m, where m is prime number in the set {163, 233, 239, 283, 409, 571}.The sect233r1 (NIST B-233) curve [30], defined over GF (2 233 ), is widely used in TLS1.2.It is recommended for HW implementations when high speed and less area consumption are intended [31].
The EC-based key generation algorithm results in the (d,Q) keypair, where the private key d is an integer of m-bits and Q is a point on E. The keypair (d,Q) is calculated as follows: where G is the generator point defined by sect233r1.
In the literature, several standard cryptographic protocols based on ECs are reported.In this work, we will use ECDHE_ECDSA protocols, as they are used in TLS1.2 for secret key exchange and digital signature, respectively.In the following, we present the ECDHE and ECDSA algorithms.

Elliptic Curve Diffie Hellman Ephemeral
ECDH stands for EC-based Diffie-Hellman key agreement protocol.It ensures the establishment of a secret shared key between two parties through an insecure channel.This key could be used by a symmetric cryptosystem for data encryption.In the literature, two versions of ECDH are reported, namely, ECDH static and ECDH Ephemeral (ECDHE).The difference is that the first version always uses the same keypairs, while the second generates new keypairs for each connection.The ephemeral version is recommended in TLS protocols.The shared secret key is obtained by applying the following steps: IoTC and IoTS exchange Q1 and Q2.

•
Extract the shared secret key from the coordinates of the shared point Q.
We note that ECDHE requires the execution of four ECSMs.However, the computation of Q1 and Q2 are performed in parallel, as well as the computation of Q = n1 × Q2 and Q = n2 × Q1.The execution time (T ECDHE ) of ECDHE algorithm can be estimated as: where T rand represents the execution time of the m-bit secure random generation and T ECSM corresponds to the execution time of single ECSM.

Elliptic Curve Digital Signature Algorithm
ECDSA is an EC-based DSA algorithm proposed in 1992 by Scott Vanstone [25].It is used for data integrity to avoid message tampering during transfer by signing the message.This protocol consists of two algorithms, namely signature generation and signature verification.In our work, the first procedure is performed by the IoTS to sign its ephemeral public key.Meanwhile, the second procedure is executed by the IoTC to check if the received public key is appropriate to the server or to a third-part.Pseudocode descriptions of the two algorithms are presented in Algorithm 1 and Algorithm 2, respectively, while their detailed justification and description can be found in [32,33].

Algorithm 1. Elliptic Curve Digital Signature Generation.
Inputs: private key d, message msg, domain parameters (m, a, b, G, n, h) Outputs: Signature (r, s) The signature for msg is then (r, s) In our work, the message msg of Algorithm 1 is the concatenation of the coordinate of the IoTS ephemeral public key.The resulting signature of the message msg is represented by (r, s).The execution time (T alg1 ) of Algorithm 1 can be estimated as: T alg1 is linked to the following execution times: secure random generation of k (T rand ), secure hash function (T Hash ) for e computation and single ECSM (T ECSM ) for computing the coordinates of the point R, Modular Inversion (T MI ), two Modular Multiplications (T MM ) and Modular Addition (T MA ) to obtain s.
According to Algorithm 2, the verification of the signature requires the execution of the secure hash function, the computation of a MI and two MMs for v, u1 and u2 calculations.Inputs: message msg, signature (r, s), domain parameters (m, a, b, G, n, h), senders public key P Outputs: accept or reject signature To compute the coordinates of the point X, two ECSMs and single Elliptic Curve Point Addition (ECPA) are required.The execution time (T alg2 ) of Algorithm 2 could be estimated by Equation (6).
To use ECDHE_ECDSA, both the IoTC and IoTS are required to be able to perform ECSM, which is the main operation of most ECC protocols.This operation is considered as the most expensive operation for embedded systems in terms of hardware requirements and timing performance.Therefore, we propose to implement a dedicated ECC hardware accelerator for high-speed ECSM computation in order to enhance the overall performance of TLS execution, while also taking in consideration the area usage.In the following, the considered ECSM algorithm and the internal hardware architecture of ECC coprocessor are described.

ECC Accelerator Design
Depending on the representation of the scalar and the points, several fast and regular ECSM algorithms are reported in the literature [34].In this work, the ECSM is performed based on the Montgomery Power Ladder (MPL) algorithm over projective coordinate system [29].Making field operations explicit, this algorithm uses the binary representation of the scalar k as it is shown in Algorithm 3. The use of the MPL algorithm [35] is often suggested to withstand side channel attacks by performing the Elliptic Curve Point Addition (ECPA) and Elliptic Curve Point Doubling (ECPD) in parallel regardless of the current scalar bit value.In the other hand, the introduction of the projective point coordinate system [36] within ECPA and ECPD computations ensures high performance by avoiding Modular Inversion (MI) execution at each iteration of the main loop.This operation is the most complex and costly to implement on embedded systems when compared to Modular Addition (MA), Modular Squaring (MS) and Modular Multiplication (MM), also required in ECPA and ECPD calculations.Hence, the combination of MPL and projective system allows the enhancement not only of design security but also of the overall cryptosystem performance.These features enable the proposed algorithms for efficient ECC hardware implementations when security, high-speed and low-area requirements are targeted.
The computation of ECSM based on Algorithm 3 requires three steps: initialization (lines 1 and 2), main loop (lines 3 to 11) and calculation of the resulting point coordinates (lines 12 and 13).The first step performs two field squarings (line 1) and a single ECPD (line 2).The second step performs, at each iteration, ECPA (lines 5 and 8) followed by ECPD (lines 6 and 9).These operations are executed in the projective system by performing a set of field additions, field multiplications and a field squaring.In the final step, two field inversions are required (lines 12 and 13) to obtain the coordinates (x 3 ,y 3 ) of the resulting point.
The internal architecture of the proposed ECC accelerator for ECSM computation based on Algorithm 3 is presented in Figure 3.The internal architecture of the proposed ECC accelerator for ECSM computation based on Algorithm 3 is presented in Figure 3.The proposed architecture consists of three finite field arithmetic units over GF (2 233 ) (an adder, a squarer and a multiplier, called RNOKOA11C), dual-port RAM (dp_RAM), a 233-bit register (reg_k), a control unit, and three multiplexers (mux).The field units ensure the computations of field addition, field squaring and field multiplication, respectively.The dp_RAM block is used for storing the coordinates (, ) of the point , the intermediate results of Algorithm 3 and the coordinates ( ,  ) of the resulting point.Meanwhile, the reg_k register is used to store the scalar value k to manage the main loop.It is also useful as temporary storage of (x,y) values before transferring them to the RAM.The control unit is responsible for the coordination between the integrated components of the internal architecture for performing ECSM.The proposed architecture provides an excellent trade-off between area and performance, based on the following aspects: 1. Exploiting the block RAMs available in FPGA devices within the internal architecture of ECC accelerator instead of using registers, thus saving Look-Up Tables (LUT) resources at the expense of introducing some extra clock cycles.2. Integrating the I/O interface into the ECC processing unit, taking advantage of the displacement reg_k.3. Avoiding the use of a dedicated field divider/inverter, by means of using the Itoh-Tsujii algorithm (ITA) [7], thus requiring only the multiplier and the squaring units.In this case, our The proposed architecture consists of three finite field arithmetic units over GF (2 233 ) (an adder, a squarer and a multiplier, called RNOKOA11C), dual-port RAM (dp_RAM), a 233-bit register (reg_k), a control unit, and three multiplexers (mux).The field units ensure the computations of field addition, field squaring and field multiplication, respectively.The dp_RAM block is used for storing the coordinates (x,y) of the point P, the intermediate results of Algorithm 3 and the coordinates (x 3 ,y 3 ) of the resulting point.Meanwhile, the reg_k register is used to store the scalar value k to manage the main loop.It is also useful as temporary storage of (x,y) values before transferring them to the RAM.The control unit is responsible for the coordination between the integrated components of the internal architecture for performing ECSM.The proposed architecture provides an excellent trade-off between area and performance, based on the following aspects: 1.
Exploiting the block RAMs available in FPGA devices within the internal architecture of ECC accelerator instead of using registers, thus saving Look-Up Tables (LUT) resources at the expense of introducing some extra clock cycles.

2.
Integrating the I/O interface into the ECC processing unit, taking advantage of the displacement reg_k.

3.
Avoiding the use of a dedicated field divider/inverter, by means of using the Itoh-Tsujii algorithm (ITA) [7], thus requiring only the multiplier and the squaring units.In this case, our ECC accelerator needs 353 clock cycles for performing 231 squarings and 10 multiplications required for GF (2 233 ) field inversion execution.This performance overhead is assumable, taking into account that only two field inversions are required.
It is worth mentioning, that the input (i_port) and the output (o_port) ports of the proposed ECC accelerator are 32-bit wide.It means that this ECC accelerator can be easily integrated around various 32-bit microcontroller through 32-bit buses.

Field Multiplier Unit
As shown in Algorithm 3, the field multiplier is the one having the most noticeable effect on the performance of the scalar-point multiplying, thus requiring a careful design.The RNOKOA11C multiplier unit is implemented based on an improvement of the Karatsuba-Ofman Algorithm (KOA) [37], named Non-Overlapping KOA (NOKOA) multiplier [38].The NOKOA multiplier allows to perform field multiplication in only one clock cycle, thus enabling high-performance ECC accelerators.However, area requirements are excessive for its implementation on low-cost devices [22].In [22], two modifications of NOKOA, requiring 3 and 9 clock cycles for completing a field multiplication, are presented.These modifications, named NOKOA3C and NOKOA9C, respectively, require less area but are not suitable for use in our ECC scalar-point multiplication unit, due to the lack of output registers.In fact, the use of RAM blocks instead of registers makes necessary to register the result provided by the multiplier.Figure 4 shows the proposed architecture of the RNOKOA11C multiplier, which meets the requirements imposed by the used of RAM as registers.It presents a recursive structure, thus consisting on a lower-level NOKOA multiplier, a control unit, two multiplexers, two XOR networks, and the RT, RE, RO and MO registers.This new multiplier requires 11 clock cycles for performing a field multiplication Electronics 2019, 8, x FOR PEER REVIEW 8 of 18 As shown in Algorithm 3, the field multiplier is the one having the most noticeable effect on the performance of the scalar-point multiplying, thus requiring a careful design.The RNOKOA11C multiplier unit is implemented based on an improvement of the Karatsuba-Ofman Algorithm (KOA) [37], named Non-Overlapping KOA (NOKOA) multiplier [38].The NOKOA multiplier allows to perform field multiplication in only one clock cycle, thus enabling high-performance ECC accelerators.However, area requirements are excessive for its implementation on low-cost devices [22].In [22], two modifications of NOKOA, requiring 3 and 9 clock cycles for completing a field multiplication, are presented.These modifications, named NOKOA3C and NOKOA9C, respectively, require less area but are not suitable for use in our ECC scalar-point multiplication unit, due to the lack of output registers.In fact, the use of RAM blocks instead of registers makes necessary to register the result provided by the multiplier.Figure 4 shows the proposed architecture of the RNOKOA11C multiplier, which meets the requirements imposed by the used of RAM as registers.It presents a recursive structure, thus consisting on a lower-level NOKOA multiplier, a control unit, two multiplexers, two XOR networks, and the RT, RE, RO and MO registers.This new multiplier requires 11 clock cycles for performing a field multiplication Table 1 shows synthesis results comparing NOKOA9C [22] to RNOKOA11C multipliers over GF (2 233 ) finite field.These results have been obtained using Xilinx ISE 14.4 over Virtex 5 devices (xv5vlx110-3f1760).As it is shown, the number of LUTs is almost the same, because the additional register required by RNOKOA11C is included into the LUTs occupied by the XOR network.Small differences in the number of LUTs and delay are due to optimizations performed by the software tool.Regarding the number of clock cycles, RNOKOA11C requires 11 clock cycles instead of the 9 clock cycles required by NOKOA9C, but it fits the requirements for being the multiplier unit of our ECC accelerator, which has been named MP_ECC_B-233_RNOKOA11C.Table 1 shows synthesis results comparing NOKOA9C [22] to RNOKOA11C multipliers over GF (2 233 ) finite field.These results have been obtained using Xilinx ISE 14.4 over Virtex 5 devices (xv5vlx110-3f1760).As it is shown, the number of LUTs is almost the same, because the additional register required by RNOKOA11C is included into the LUTs occupied by the XOR network.Small differences in the number of LUTs and delay are due to optimizations performed by the software tool.Regarding the number of clock cycles, RNOKOA11C requires 11 clock cycles instead of the 9 clock cycles required by NOKOA9C, but it fits the requirements for being the multiplier unit of our ECC accelerator, which has been named MP_ECC_B-233_RNOKOA11C.

Implementation of MP ECC_B-233_RNOKOA11C
In order to check the suitability of MP_ECC_B-233_RNOKOA11C for medium-performance applications, such as IoT coordinators/gateways, it has been implemented in a MiniZed board [39] with a Zynq 7Z007S device from Xilinx.This low-cost device includes a single-core ARM Cortex-A9 microprocessor and 14400 LUTs of programmable logic for software/hardware co-design.The software tool used for this implementation has been Vivado 2018.2 from Xilinx.Also, for comparison purposes, it has been implemented on Virtex 5 devices using Xilinx ISE 14.4.Implementation results are presented in Table 2, where MP_ECC_B-233_NOKOA11C is compared to other ECC scalar-point multipliers with similar area.From Table 2, it is evident that this new design requires less than half the area of other implementations, while providing similar performance figures.Thus, it is perfectly suitable for the target application.It should also be noted that our design includes a 32-bit I/O interface, while the other alternatives do not include such feature.

FPGA Implementation of TLS Cryptosystem
Among the considered TLS cipher-suites, HMAC, SHA256 and AES are characterized by its high-performance implementation due to a relative mathematic simplicity.ECDHE and ECDSA are characterized by its high security but are considered the most time/area consuming as they involve complex operations over large prime numbers.To achieve the best trade-off between flexibility, area and speed, a SW/HW co-design implementation approach is presented in this work.The proposed partitioning is based on the implementation of ECSM within a compact ECC hardware accelerator for faster execution.The dedicated core is integrated around an embedded ARM microprocessor.The rest of the required operations for TLS negotiation are managed in SW by the processor.Figure 5 presents the hardware architecture of the proposed embedded system.The hardware architecture was implemented on the Xilinx Zynq-7Z007S SoC device in the Avnet Minized Dev board [39] for both IoTS and IoTC coordinators.As commented in the previous section, this low-cost device consists of a single-core ARM Cortex-A9 microprocessor, able to run at up to 666.666 MHZ, along with 100 block RAMs and 14400 slice LUTs for software/hardware co-design.The MiniZed board also includes a Murata "Type 1DX" LBEE5KL1DX wireless module for wireless communications.

ECC Accelerator Integration around ARM Processor
For connecting the ECC accelerator with the ARM processor through the AXI bus, Xilinx's Intellectual Property InterFace (IPIF) is used for 32-bit data/instruction exchanging, as it is shown in Figure 6.The IPIF is configured with four 32-bit registers: InsIn, DataIn, InsOut and DataOut.The processor uses a set of instruction codes through the InsIn register to manage the ECC core.The second register is used to transfer the digits of the input point coordinates and the scalar from the ARM to the req_k register.The control unit makes use of the third register to notify the processor that the coordinates of the resulting point from the ECSM computations are ready.The last register ensures the transfer of the resulting point coordinates to the processor.The proposed architecture contains a single Cortex-A9 ARM microprocessor (PS), the MP_ECC_B-233_RNOKOA11C accelerator, an AXI interconnect bus and a Wireless_mgr controller.The latter is used for the WiFi connection of the IoTS and IoTC designs with gateways that provide internet access.The AXI bus allows 32-bit data/instruction exchanges between the ARM microprocessor and the ECC accelerator.It runs with a 50 MHz clock.The ARM processor ensures not only the control of the ECC accelerator but also of all TLS processes.The roles assigned to the processor are defined as follows: • Generation of 256-bit random numbers.

•
Execution of AES, HMAC and SHA256 functions.

•
Computation of finite field inversions, multiplications and additions required for ECDSA.

•
Control of ECDHE and ECDSA algorithms.

•
Control of internet communication between the IoTS and the IoTCs.

ECC Accelerator Integration around ARM Processor
For connecting the ECC accelerator with the ARM processor through the AXI bus, Xilinx's Intellectual Property InterFace (IPIF) is used for 32-bit data/instruction exchanging, as it is shown in Figure 6.The IPIF is configured with four 32-bit registers: InsIn, DataIn, InsOut and DataOut.The processor uses a set of instruction codes through the InsIn register to manage the ECC core.The second register is used to transfer the digits of the input point coordinates and the scalar from the ARM to the req_k register.The control unit makes use of the third register to notify the processor that the coordinates of the resulting point from the ECSM computations are ready.The last register ensures the transfer of the resulting point coordinates to the processor.To perform ECSM, three steps are required for each execution, namely, ECC core reset, transmission of the inputs, and retrieving of the resulting point coordinates.Before starting the ECSM computation, the ARM processor resets the ECC accelerator by sending the 0 x 000000001 instruction.After that, the control unit stores sixteen 8-bit digits of the ECSM input point coordinates followed by eight 8-bit digits of the scalar transmitted from the processor to the dp_RAM.It must be noted that the processor transmits the 0x000000002 instruction after each digit to prepare the control unit to receive the next digit.Once twenty-four 8-bit digits of the inputs are loaded, the control unit manages the field units to perform ECSM computations.During this time, the InsOut register value is 0 x 000000000.When the ECC accelerator completes the execution, the control unit changes the InsOut register value to 0 x 000000003 in order to notify the processor that the ECSM execution is done, then sends sixteen 8-bit digits of the resulting point coordinates.The processor uses the 0 x 000000004 instruction after receiving each digit to order the control unit to send the next digit.
Table 3 summarizes the hardware resources occupied by the ECC accelerator and the proposed architecture for IoTS and IoTC coordinators on the Zynq-7Z007S device.The results are shown in terms of slice LUTs and selected RAM blocks.3, it must be noted that the difference in hardware resources between IoT designs and the ECC accelerator is 5108 LUTs and 2 RAMs.This is due to the AXI interconnect bus and the wireless_mgr controller.The proposed ECC accelerator requires only 24% of the total available LUTs in the targeted device.Meanwhile, the overall design occupies 60% of them.Moreover, the proposed architecture requires only 9 block RAMs.

Software Development
The proposed IoTS and IoTC coordinators run on Embedded Linux by loading the Linux boot image for Zynq (BOOT.bin)and the Linux system image (image.ub)files to the QSPI flash and the eMMC memory, respectively, both available on the board.These files are generated by means of Xilinx Petalinux 2018.2 tool based on the hardware description file (bitstream.bit) of the proposed hardware architecture.The idea behind the use of embedded Linux is that the OS allows flexible use of the WiFi module for internet communication between the IoTS and IoTCs using TCP/IP To perform ECSM, three steps are required for each execution, namely, ECC core reset, transmission of the inputs, and retrieving of the resulting point coordinates.Before starting the ECSM computation, the ARM processor resets the ECC accelerator by sending the 0 x 000000001 instruction.After that, the control unit stores sixteen 8-bit digits of the ECSM input point coordinates followed by eight 8-bit digits of the scalar transmitted from the processor to the dp_RAM.It must be noted that the processor transmits the 0x000000002 instruction after each digit to prepare the control unit to receive the next digit.Once twenty-four 8-bit digits of the inputs are loaded, the control unit manages the field units to perform ECSM computations.During this time, the InsOut register value is 0 x 000000000.When the ECC accelerator completes the execution, the control unit changes the InsOut register value to 0 x 000000003 in order to notify the processor that the ECSM execution is done, then sends sixteen 8-bit digits of the resulting point coordinates.The processor uses the 0 x 000000004 instruction after receiving each digit to order the control unit to send the next digit.
Table 3 summarizes the hardware resources occupied by the ECC accelerator and the proposed architecture for IoTS and IoTC coordinators on the Zynq-7Z007S device.The results are shown in terms of slice LUTs and selected RAM blocks.From Table 3, it must be noted that the difference in hardware resources between IoT designs and the ECC accelerator is 5108 LUTs and 2 RAMs.This is due to the AXI interconnect bus and the wireless_mgr controller.The proposed ECC accelerator requires only 24% of the total available LUTs in the targeted device.Meanwhile, the overall design occupies 60% of them.Moreover, the proposed architecture requires only 9 block RAMs.

Software Development
The proposed IoTS and IoTC coordinators run on Embedded Linux by loading the Linux boot image for Zynq (BOOT.bin)and the Linux system image (image.ub)files to the QSPI flash and the eMMC memory, respectively, both available on the board.These files are generated by means of Xilinx Petalinux 2018.2 tool based on the hardware description file (bitstream.bit) of the proposed hardware architecture.The idea behind the use of embedded Linux is that the OS allows flexible use of the WiFi module for internet communication between the IoTS and IoTCs using TCP/IP client/server sockets.Figure 7 summarizes the software development required to implement the TLS1.2 protocol.To implement this TLS1.2protocol between the IoTS and IoTC coordinators, Server_TLS.py and Client_TLS.pypython codes have been developed for each design, respectively.Python has been used in order to exploit socket, random, hashlib and hmac libraries for TCP/IP socket communication, random generation, SHA256 and HMAC executions, respectively.Since Python is interpreted code, which makes its execution slower, we propose to implement the AES and ECC algorithms in C for faster executions.Then, we generate the static shared library (AES_ECC.so)from the resulting C code to be imported and used in Client/Server_TLS.py files.The C code and the static shared library are generated using the Xilinx Software Development Kit (XDSK) tool.The shared library consists of two C function files, namely AES.c and ECC.c.The first file defines AES encryption() and decryption() functions.The second file describes ECDSA() and ECDHE() functions for performing the considered ECC protocols.The two functions are based on the ECC_driver.cfile and finite field functions required in the ECDSA algorithm.It must be noted that the inputs and the outputs of the AES and ECC functions are based on radix-2 8 and radix-2 32 representations, respectively.Radix-2 8 is used since the AES algorithm performs 8-bit operations, while, radix-2 32 is considered for ECC algorithms not only because the ARM is a 32-bit microprocessor but also for the AXI 32-bit bus where data/instruction are transferred digit-by-digit in serial mode.The representation of large numbers in radix 2 8 and radix 2 32 is performed in Python based on the ctypes.c_intlibrary.The ECDSA() function requires the computation of MA, MM and MI.In the ECDSA protocol, MA, MM and MI computations over 256-bit operands are required for Algorithm 1 and Algorithm 2. These computations are ensured by the MA(), MM() and MI() functions.In fact, MM is performed based on Montgomery radix-2 32 Modular Multiplication algorithm [27,43].In the other hand, MI is executed by modular exponentiation (Mexp) according to Fermat's little theorem [7], as it is shown in Equation (7).This theorem and, consequently, Equation (7), are valid when n is prime integer.The easiest way to perform Mexp is the left-to-right binary method [44].
The ECC_driver.c file contains C drivers to control the ECC accelerator.Itis composed of four functions: reset_hw(), send_crd_sclr(), Get_result() and VR_adr_hw().The first three functions allow to reset the ECC accelerator, send the inputs of Algorithm 3, and retrieve the resulting point coordinates, respectively.As our designs run on embedded Linux, the ARM processor needs at system initialization to generate a virtual address (ECC_vr_adr) for the ECC accelerator and map it To implement this TLS1.2protocol between the IoTS and IoTC coordinators, Server_TLS.py and Client_TLS.pypython codes have been developed for each design, respectively.Python has been used in order to exploit socket, random, hashlib and hmac libraries for TCP/IP socket communication, random generation, SHA256 and HMAC executions, respectively.Since Python is interpreted code, which makes its execution slower, we propose to implement the AES and ECC algorithms in C for faster executions.Then, we generate the static shared library (AES_ECC.so)from the resulting C code to be imported and used in Client/Server_TLS.py files.The C code and the static shared library are generated using the Xilinx Software Development Kit (XDSK) tool.The shared library consists of two C function files, namely AES.c and ECC.c.The first file defines AES encryption() and decryption() functions.The second file describes ECDSA() and ECDHE() functions for performing the considered ECC protocols.The two functions are based on the ECC_driver.cfile and finite field functions required in the ECDSA algorithm.It must be noted that the inputs and the outputs of the AES and ECC functions are based on radix-2 8 and radix-2 32 representations, respectively.Radix-2 8 is used since the AES algorithm performs 8-bit operations, while, radix-2 32 is considered for ECC algorithms not only because the ARM is a 32-bit microprocessor but also for the AXI 32-bit bus where data/instruction are transferred digit-by-digit in serial mode.The representation of large numbers in radix 2 8 and radix 2 32 is performed in Python based on the ctypes.c_intlibrary.The ECDSA() function requires the computation of MA, MM and MI.In the ECDSA protocol, MA, MM and MI computations over 256-bit operands are required for Algorithm 1 and Algorithm 2. These computations are ensured by the MA(), MM() and MI() functions.In fact, MM is performed based on Montgomery radix-2 32 Modular Multiplication algorithm [27,43].In the other hand, MI is executed by modular exponentiation (Mexp) according to Fermat's little theorem [7], as it is shown in Equation (7).This theorem and, consequently, Equation (7), are valid when n is prime integer.The easiest way to perform Mexp is the left-to-right binary method [44].
The ECC_driver.c file contains C drivers to control the ECC accelerator.Itis composed of four functions: reset_hw(), send_crd_sclr(), Get_result() and VR_adr_hw().The first three functions allow to reset the ECC accelerator, send the inputs of Algorithm 3, and retrieve the resulting point coordinates, respectively.As our designs run on embedded Linux, the ARM processor needs at system initialization to generate a virtual address (ECC_vr_adr) for the ECC accelerator and map it to its physical address (ECC_BASE_ADDR).This step is ensured by the VR_adr_hw() function, where the following instructions are executed: 1.
ECC_vr_adr = mmap(NULL, pg_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, (ECC_BASE_ADDR & (pg_size-1))); Once the virtual address is generated, the addresses of the four registers used for data/instruction exchanging can be calculated as follows: Table 4 presents the execution time of the developed crypto functions for the TLS1.2 protocol, as well as the time of all of the TLS1.2 process.The reported performances include the following execution times:

•
Data representation from large numbers to radix-r for AES and ECC computations.

•
Client/Server data exchanging via sockets.The proposed design performs a single 233-bit ECSM using the ECC accelerator in 400 µs.Moreover, the IoTS and IoTC perform the ECDHE procedure in 1.7 ms.This time depends on the size of n1 and n2, which are required in the ECDHE procedure.In our case, the size of both n1 and n2 is 233 bits.For the ECDSA protocol, the IoTS generates the signature in 3.5 ms, while the IoTC checks the received signature in 4.1 ms.These two times are linked to the bit-size of the k generated in line 1 of Algorithm 1 and the intermediate results (u1, u2) of Algorithm 2. Finally, the generation of the 384-bit secret key between the IoTS and the IoTC based on the TLS1.2 protocol is achieved in 67.5 ms. Figure 8 shows a screenshot of one measurement of TLS1.2() execution in the server side (Figure 8a), and the client side (Figure 8b).

Comparison with Some Recent Works
In order to compare our proposal with other works, an ad hoc experimental setup have been prepared, which consists of two Minized boards hosting Zynq devices and communicated using WiFI.Both boards' devices include the MP_ECC_B-233_RNOKOA11C accelerator for cryptographic operations, while one the Minizad boards acts as server and the other one acts as a client for the TLS handshaking.Figure 9 shows a picture of this experimental setup.Table 5 shows the performance comparison of our design and some FPGA-based TLS/SSL implementations.The comparisons are made in terms of occupied slice LUTs, selected RAM blocks and execution time for single TLS/SSL handshake negotiations.

Comparison with Some Recent Works
In order to compare our proposal with other works, an ad hoc experimental setup have been prepared, which consists of two Minized boards hosting Zynq devices and communicated using WiFI.Both boards' devices include the MP_ECC_B-233_RNOKOA11C accelerator for cryptographic operations, while one the Minizad boards acts as server and the other one acts as a client for the TLS handshaking.Figure 9 shows a picture of this experimental setup.

Comparison with Some Recent Works
In order to compare our proposal with other works, an ad hoc experimental setup have been prepared, which consists of two Minized boards hosting Zynq devices and communicated using WiFI.Both boards' devices include the MP_ECC_B-233_RNOKOA11C accelerator for cryptographic operations, while one the Minizad boards acts as server and the other one acts as a client for the TLS handshaking.Figure 9 shows a picture of this experimental setup.Table 5 shows the performance comparison of our design and some FPGA-based TLS/SSL implementations.The comparisons are made in terms of occupied slice LUTs, selected RAM blocks and execution time for single TLS/SSL handshake negotiations.Table 5 shows the performance comparison of our design and some FPGA-based TLS/SSL implementations.The comparisons are made in terms of occupied slice LUTs, selected RAM blocks and execution time for single TLS/SSL handshake negotiations.

Figure 1 .
Figure 1.Global Scheme of the targeted IoT (Internet of Things) application.

Figure 1 .
Figure 1.Global Scheme of the targeted IoT (Internet of Things) application.

Algorithm 3 .
Montgomery ladder over projective coordinates, making field operations explicit.

Figure 6 .
Figure 6.Integration of ECC accelerator with AXI (Advanced eXtensible Interface) bus.

Figure 6 .
Figure 6.Integration of ECC accelerator with AXI (Advanced eXtensible Interface) bus.

Figure 8 .
Figure 8. Screenshot of TLS1.2() execution time in the (a) server side, and the (b) client side.

Figure 8 .
Figure 8. Screenshot of TLS1.2() execution time in the (a) server side, and the (b) client side.

Figure 8 .
Figure 8. Screenshot of TLS1.2() execution time in the (a) server side, and the (b) client side.

Table 2 .
MP_ECC_B-233_RNOKOA11C implementation results and comparison to other implementations.

Table 3 .
Hardware resources requirements of the proposed architectures.

Table 3 .
Hardware resources requirements of the proposed architectures.

Table 4 .
Execution time of the involved crypto functions for TLS execution.