Hardware Performance Evaluation of Authenticated Encryption SAEAES with Threshold Implementation

: SAEAES is the authenticated encryption algorithm instantiated by combining the SAEB mode of operation with AES, and a candidate of the NIST’s lightweight cryptography competition. Using AES gives the advantage of backward compatibility with the existing accelerators and coprocessors that the industry has invested in so far. Still, the newer lightweight block cipher (e.g., GIFT) outperforms AES in compact implementation, especially with the side-channel attack countermeasure such as threshold implementation. This paper aims to implement the ﬁrst threshold implementation of SAEAES and evaluate the cost we are trading with the backward compatibility. We design a new circuit architecture using the column-oriented serialization based on the recent 3-share and uniform threshold implementation (TI) of the AES S-box based on the generalized changing of the guards. Our design uses 18,288 GE with AES’s occupation reaching 97% of the total area. Meanwhile, the circuit area is roughly three times the conventional SAEB-GIFT implementation (6229 GE) because of a large memory size needed for the AES’s non-linear key schedule and the extended states for satisfying uniformity in TI.


Introduction
There is an increasing demand for secure data communication between embedded devices in many areas, including automotive, industrial, and smart-home applications. To enable cryptography in resource-constrained devices, researchers have studied lightweight cryptography that has a good performance in implementation by design. Lightweight cryptography emerged from block cipher design [1], which now covers a larger area in cryptography, including authenticated encryption (AE). In particular, NIST is running a standardization process for lightweight AE algorithms (NIST LWC) [2].
Side-channel attack (SCA) [3,4] is a considerable security risk in lightweight cryptography's main targets: embedded devices under a hostile environment in which a device owner attacks the device with physical possession. Consequently, NIST LWC considers the grey-box security model with side-channel leakage [5]. In addition to security, the cost of implementing SCA countermeasures in resource-constrained devices is a big issue because SCA countermeasures multiply the cost.
Threshold implementation (TI) [6] is an SCA countermeasure based on multi-party computation (MPC) [7]. TI is popular for hardware implementations because it can provide the security in the presence of glitches, i.e., transient signal propagation through a combinatorial circuit, which is inevitable in common hardware design. Consequently, there are an increasing number of papers reporting authenticated encryptions with TI [8][9][10]. Researchers are even optimizing the algorithms for TI: the TI-friendly S-boxes [11,12] and the TI-friendly modes of operation [13,14].
SAEAES is an instantiation of the SAEB mode of operation [15] with the standard block cipher AES [16], and is a NIST LWC candidate. Choosing AES is a practical decision for providing backward compatibility with the numerous AES accelerators and coprocessors that the industry has invested so far. However, not so many NIST LWC candidates chose AES (COMET [17], mixFeed [18], and SAEAES [19] our of the 32 candidates) because newer lightweight primitives outperform AES in lightweight implementations. The impact of using AES is even larger with TI. Many lightweight algorithms, such as GIFT [20] and SKINNY [21], use an S-box with which an efficient, i.e., 3-share and uniform TI is available [21]. In contrast, this is not the case for AES [22], which was standardized before TI become popular. The early AES TI compensated for this disadvantage by refreshing the output share by adding fresh randomness [23][24][25], but this raised another implementation challenge of generating fresh randomness at a high rate. Daemen's changing of the guards [26] in 2017 opened the door for enabling a uniform TI for a larger class of functions, and its generalization enabled the first 3-share TI for AES without fresh randomness in 2019 [27].

Purpose and Approach
The question that naturally arises is the cost of the backward compatibility: how many more gates do we need by choosing AES instead of other lightweight algorithms with TI? The question has been unanswered because of the gap between the conventional works on lightweight AE and efficient TI implementation: the conventional SAEAES implementations are all without TI [15,19,28]. The purpose of this paper is to implement the first threshold implementation of SAEAES and to evaluate the cost we are trading with the backward compatibility.
Our approach is to extend the recent AES implementation with the 3-share and uniform TI using the generalized changing of the guards [27], but we redesign the AES circuit architecture to satisfy the additional requirements by the mode of operation. Then, we evaluate our design's performance and compared it with the previous implementation of SAEB-GIFT [13]: the same mode of operation instantiated with the state-of-the-art lightweight block cipher GIFT [20].

Contributions
Here we summarize our key contributions.

(I) Identification of design challenges in extending AES implementation to SAEAES (Section 4)
Our design is based on the 3-share and uniform TI of AES using the generalized changing of the guards [27]. We identify that the mode of operation enforces the byte order, making the conventional row-oriented serialization inefficient [23]. Also, the mode of operation should preserve the secret key that the on-the-fly key schedule overwrites. (II) Column-oriented AES implementation (Section 5.2) We propose a new AES circuit architecture that uses the column-oriented data serialization to address the aforementioned incompatibility with the row-oriented serialization. (III) The first SAEAES implementation with threshold implementation (Section 5) We show the first TI of SAEAES that uses the column-oriented serialization and the 3-share and uniform AES S-box. The design has an independent key store for preserving the secret key until the next AES call. (IV) Improved TI of key array (Section 5.5) We show the concrete realization of the key array for TI that reduces the register size by 216 bits or 32% from the original design [27]. (V) Performance evaluation and comparison (Section 6) We synthesize our design using a standard cell library to evaluate its circuit area in GE (gate equivalents). We show that our design uses 18,288 GE with TI composed of AES (14,256 GE, 78%), the key store (3422 GE, 19%), and the mode of operation (610 GE, 3%). Compared with the conventional SAEB-GIFT implementation that uses 6229 GE [13], the SAEAES implementation is roughly three times larger. We identify that the non-linear key schedule and the extended states for satisfying uniformity as the major factors for this difference.

Organization
This paper is organized as follows. We begin by reviewing the algorithm of SAEAES in Section 2, and the previous TI of AES in Section 3. Then, we state the design challenges we address in the paper in Section 4. We describe our proposed design Section 5 followed by the performance evaluation in Section 6. Section 7 is the conclusion.

Authenticated Encryption
Authenticated encryption (AE) is a cryptographic algorithm that provides confidentiality and integrity using a symmetric key. An AE encryption algorithm converts a plaintext and an associated data into a ciphertext and authentication tag. The corresponding AE decryption algorithm converts the ciphertext and the unencrypted associated data, back into the original plaintext. By using the tag during the decryption, the algorithm checks the integrity, i.e., detects changes in the original ciphertext or the associated data, to prevent forgery attacks. A common AE construction is to combine a block cipher with a mode of operation, and AES [16] with the Galois/counter mode (AES-GCM) is by far the most popular AE approved by NIST SP800-38D [29] and RFC5288 [30], and being used in major systems including SSL/TLS.
Lightweight cryptography is a branch of cryptography on designing cryptographic algorithms that achieve efficient performance in resource-constrained devices. The demand for such lightweight cryptography is higher than ever before for the recent technology trend of adding connectivity to embedded devices. Moreover, NIST is running a standardization process called NIST LWC [2] since 2017, which makes lightweight AE an even more active research area.

SAEAES and its Algorithm
SAEAES [19] is an instantiation of the lightweight mode of operation SAEB [15] using AES, which is a candidate of NIST LWC [2]. We focus on SAEAES_128_64_128 with the 128-bit key, 64-bit associated data block, and 128-bit tag among the ten variants.
SAEAES is composed of HASH, Encryption, and Decryption algorithms that process the associated data (AD), plaintext, ciphertext blocks, respectively. SAEAES is based on the sponge construction, in which the data blocks are absorbed into the internal state in between iterated AES calls, as shown in Figure 1.
HASH consumes AD blocks A 1 , . . . , A n and a 120-bit nonce {N U , N L }, to generate an initial value for the subsequent Encryption or Decryption denoted by {IV U , IV L }. Encryption consumes the message blocks in the same way: for previous AES output {t U , t L } and the message block M i , the next AES input is {M i ⊕ t U , t L } and the corresponding ciphertext block is C i = M i ⊕ t U . In Decryption, on the other hand, it recovers the message block M i = C i ⊕ t U , and feeding {C i , t L } as the next AES input. The tag is the final AES output {T U , T L }, as shown in Figure 1.
The SAEAES (and SAEB) satisfies the following four properties that contribute to a lightweight implementation: Minimum state size No extra memory in addition to AES, which reduces the register size in hardware implementation.
Inverse free No need for AES decryption. This reduces the cost of implementing inverse AES operations and overhead for selectors or conditional branches.
XOR only The extra operation in addition to AES is XOR only, which is more efficient than other options, such as GF (2 128 ) multiplication in AES-GCM.
Online The message and ciphertext blocks are scanned only once. There is no need for a buffer storing the blocks until the second scan.
Efficient handling of repeated associated data SAEAES can skip shortcut some operations for several Encryption/Decryption with the same (i.e., repeated) associated data.

Hardware Implementations of SAEAES
The original SAEB paper reported a compact hardware implementation of SAEB instantiated with AES [15] using 3502 GE. The design uses the byte-serial architecture [23] that we discuss later in more detail. Balli et al. further reduced the circuit area to 2067 GE using the bit-serial technique, and compared it with with many other AEs using SKINNY and GIFT [28].
There is no TI of SAEAES as far as the authors are aware. Meanwhile, there is a TI of SAEB instantiated with the GIFT lightweight block cipher [13,20]. Caforio et al. evaluated a number of NIST LWC candidates with TI [10].

Side-Channel Attack and Countermeasure
Many embedded devices, such as a smartcard, store a service provider's key and should withstand attacks by the legitimate device owners. Under such a hostile environment, the attacker with physical access uses side-channel attack that exploits information leakage in power consumption and/or electromagnetic radiation [4]. Designing cryptographic modules secure against such attacks is challenging, and researchers have studied new attacks and countermeasures for more than two decades.
Masking based on MPC is by far the most well-studied countermeasure against a power side-channel attack [4,7]. In (additive) masking, we encode a sensitive variable x into a set of variables called a share x = [x a , x b , x c ] satisfying x = x a ⊕ x b ⊕ x c thereby randomizing and decoupling the sensitive value from the data representation. An attacker with a limited access to only a proper subset of the share cannot reconstruct the original value x. Masking provides a way to realize a target cryptographic algorithm (e.g., AES) while preserving the shared representation, thereby providing the protection against side-channel attack.

Threshold Implementation
Threshold implementation (TI) proposed by Nikova et al. [6] is an MPC-based countermeasure suitable for hardware implementation because it can be secure even in the presence of glitches, i.e., transient signal propagation through combinatorial circuit inevitable in the common hardware design. In TI, for a target function f , we compose a set of functions called the sharing { f a , f b , f c } given by are the input and output shares. { f a , f b , f c } should satisfy the following three properties: The sharing given by Equation (1) satisfies non-completeness because f a , f b , and f c do not accept x a , x b , and x c , respectively. Uniformity The sharing { f a , f b , f c } is said to satisfy uniformity if it preserves the uniform distribution: a uniform sharing generates a uniformly distributed output share given a uniformly distributed input share. The uniform distribution of the input share is the necessary condition for the TI's security. With uniformity, we can feed the output share to the next sharing thereby enabling cascaded connection between sharings.

Composing a Sharing for a Target Function
Designing a sharing satisfying the above three properties for a given target function (e.g., S-box) is an important challenge in TI. Besides, we want to minimize the number of shares because the hardware cost grows quadratically to the number of shares. For a function having an algebraic degree of d, there is a generic way to construct the correct and non-complete shared function using d + 1 shares [31]. Since d > 1 for a non-linear function such as an S-box, three is the the minimum number of shares. Target functions with d > 2 are commonly decomposed into sub-functions with d = 2 for realizing a 3-share sharing.

Lack of Uniformity and Refreshing
The availability of uniformity depends on the target function [11]. Consequently, many lightweight algorithms, including GIFT and SKINNY, choose an S-box with which a 3-share and uniform sharing is available [21]. In contrast, the AES and Keccak S-boxes did not have a uniform sharing until very recently. The early AES TI used a non-uniform sharing, and compensated for the lack of uniformity by refreshing the output share by adding fresh randomness [23][24][25]. Although the refreshing saves non-uniform sharing, the need for a lot of fresh randomness raised another implementation problem: the previous implementations need 2560-10,240 random bits for each encryption, which requires a considerable cost in terms of circuit area and power consumption [27].

Changing of the Guards
Daemen successfully realized a 3-share and uniform Keccak S-box by introducing an elegant technique called the changing of the guards [26]. The idea is to construct a sharing of a layer of S-boxes instead of each S-box.
] constructed from the neighboring input. This makes the final share uniform again in the same way as the conventional refreshing with a fresh randomness. In other words, the changing of the guards recycles the previous input shares as a substitute of fresh randomness in a provable way.
This technique is applicable to any bijective mapping including the Keccak S-box, but its application to the AES S-box was not trivial: field multiplication appears in decomposing the S-box to reduce the number of shares, which is non-bijective. Wegener and Moradi successfully decomposed the AES S-box into a series of bijective mapping and applied the changing of the guards, but it required four shares instead of three [32].
Finally, Sugawara generalized the changing of the guards [27] to support non-bijective mapping, and realized the first uniform and 3-share sharing of AES. The idea is to consider the unshared representation of the Daemen's changing of the guards as shown in Figure 2 (right). The key point is extending the original function f into a modified function followed by a null function ⊥ that maps anything to zero, which ensures the availability of a uniform sharing. By generalizing this extension to a non-bijective mapping by using the generalized Feistel network, we can construct the changing of the guards sharing of non-bijective functions including field multiplication. By applying the generalized changing of the guards to each stage of the decomposed AES S-box, Sugawara proposed the first 3-share TI without using fresh randomness shown in Figure 3.
As a side effect of extending a non-bijective function, the datapath width should be extended from 8 to 14 bits.

Design Challenges
Our approach of implementing SAEAES is to extend the previous AES implementation using the generalized changing of the guards [27]. In this section, we discuss several challenges we face in the extension.

Byte Order and Serialization
The byte-serial architecture scans the 16-byte AES state a byte at a clock cycle, which is commonly used for a compact AES implementation. There are the row-oriented and column-oriented serializations. Many of the conventional lightweight AES implementations, including the previous SAEAES implementation [15], follow the rigorously optimized architecture by Moradi et al. [23] that uses the row-oriented state and key arrays in Figure 4.   Figure 4. The row-oriented state and key arrays in the conventional compact AES implementation [23].
One drawback of the row-oriented serialization is its incompatibility with the AES's native byte order. This incompatibility has no problem as far as considering a single AES call because we can absorb the difference by redefining the data representation.
However, we cannot change the data representation with a mode of operation because it specifies the byte order for supporting arbitrary-length (i.e., block-unaligned) messages. Table 1 shows the timing we feed an 8-byte string m 7 m 6 · · · m 0 in the column-and row-oriented serialization. With the column-oriented serialization, we can feed the bytes in the original order every cycle. With the row-oriented serialization, on the other hand, we need to reorder the bytes and feed them in an interleaved manner. This reordering and synchronization costed an additional 56-bit shift register in the previous SAEAES implementation [15]. Table 1. The byte order for feeding an 8-byte byte string m 7 m 6 · · · m 0 in the column-and row-oriented serialization.

On-The-Fly Key Schedule
On-the-fly key schedule is a common technique for reducing the register cost that updates the secret key in place for key schedule. Although overwriting an original key is not a problem for a particular application that uses a single AES call (e.g., challenge-response authentication), SAEAES needs the same secret key for processing the next block. There are two ways to get the same key again: (i) storing it in another register or (ii) recovering the original data by implementing the reverse key schedule. In this paper, we use the first approach-discussed in Section 5.3.

Scan Flip Flop and Clock Gating
The previous design further optimize the row-oriented arrays (Figure 4) using the netlist-level (cf. register-transfer level (RTL)) optimization [23]. The design uses scan flip flops (SFFs), a register with a builtin selector, which is more efficient than an individual register and selector combined. Moreover, the design uses a gated clocking to control the data flow instead of implementing an enable signal using a selector with a feedback line. These techniques are very efficient and followed by many designs [13,14,21,23].
Meanwhile, the SFF's original purpose is to instrument a design with a scan-path chain and provides a way for testing after fabrication, and not for optimizing a design. A logic synthesizer does not infer an SFF unless a designer explicitly instantiate the cell in the code. This type of optimization binds the design to a specific standard-cell library, and is unavailable to some designers who releases a synthesizable RTL design (e.g., IP vendors). Also, some conservative coding/design rules prohibit such an aggressive optimization for a possible incompatibility with an automated scan-path insertion. Moreover, avoiding glitches on the gated clock signal needs a careful design using a dual-phase or dual-edge clock that increases the engineering cost. To avoid these disadvantages, we use neither SFF nor gated clock, as we discuss in Section 5.1.

Design Policy
We use a conservative design policy based on the discussion in Section 4. We describe the design at the register-transfer level (RTL); we use no netlist-level optimization, including the direct standard-cell instantiation of SFFs, so that the design will not bound to a specific library. The design completely synchronizes to a single-edged clock and uses no gated clock.
The circuit area is the primary performance target. By considering the design choices in software/hardware codesign, and following the common practice of letting hardware do regular operations for improving the efficiency, we choose the coprocessor interface aiming at accelerating the main time-consuming part of AD processing, encryption, and decryption. Meanwhile, the design relies on an external processor for handling exceptional cases such as padding following the previous works [10,[13][14][15]28]. Meanwhile, there is another approach of including the padding circuit inside the design, which is more suitable for a high-speed design with a direct memory access [33]. For implementing one-zero padding for SAEAES, we will need an additional input for indicating the end of the message, and some selector and AND gates for overwriting the incoming data stream.
The implementation holds several parameters during their lifetime. In particular, it preserves the secret key over multiple AES calls to eliminate the hidden cost of an external key register. The design assumes an asynchronous register interface: an external controller needs no cycle-accurate synchronization.

Column-Oriented Serialization
We propose the column-oriented arrays in Figure 5 to address the issue of the row-oriented serialization discussed in Section 4.1. Figure 5 (left) shows the column-oriented state array. Figure 5 explicitly shows selectors attached to registers because we do not use SFFs. The circuit uses the vertical links for shifting the serialized data, and the horizontal links for MixColumns and ShiftRows. In particular, we realize ShiftRows by shifting the data using the horizontal links while controlling the number of shifts using enable signals. This array finishes an AES round using 27 cycles, as shown in

Key Array
As shown in Figure 5 (right), the column-oriented key array has a simplified datapath for the AES's key schedule being column-oriented. It uses no horizontal link, which significantly reduces the circuit area without an SFF, i.e., when a selector is not for free. Table 2 summarizes the key array's operations in each cycle. The key array works as a shift register for the first 16 cycles. It then feeds the bytes in the fourth column to S-box for SubWord during the 17-20th cycles. Here, we use S 13 (cf. S 03 , see Figure 5) as an S-box input to realize RotWord. The XOR gate connected to S 30 calculates the XOR between the neighboring columns while shifting the data in the 1-16th cycles. Table 3 compares the circuit areas of the row-and column-oriented arrays after logic synthesis (See Section 6 for the tool, library, and conditions for the performances evaluation). The row-oriented arrays, (SR) and (KR) in Table 3, implement the ones in Figure 4 using SFFs, while the columns-oriented arrays (SC) and (KC) implement the ones in Figure 5 using an ordinary register with an enable signal. The SFF is more efficient than a register with a selector, and (SC) is larger than (SR). Meanwhile, (KC) is smaller than (KR) because the simplified data flow eliminates most of the selectors on the registers. Although the row-oriented design is still better by 161 GE in total, it is relatively minor compared to the entire circuit area. Thus, it is a reasonable cost for unbinding the design from a specific standard cell library and reducing the engineering cost for handling multiple clocks. Table 3. Performance comparison of the row-and column-oriented arrays.  Figure 6 shows the proposed SAEAES design, including AES implementation composed of the key store, S-box implementation, and column-oriented arrays. The 8-bit buses receive the message and key bytes in the AES's native order, i.e., the column orientation. A single AES round takes 27 cycles (see Table 2), and the entire AES operation finishes in 282 cycles.

AES Implementation
The key store is the 8-bit and 16-stage shift register with a feedback that stores the original key (see Section 4.2), and transfers it to the key array at the beginning of an AES encryption. We avoid a reverse key schedule because it has a significant impact on the key array and the S-box circuit in addition to doubling the latency. We use the Canright's AES S-box representation [34] divided into four pipelined stages, which is necessary for TI, following the previous work [27].  Table 6.

SAEAES Implementation
The mode of operation is a thin wrapper by following the previous SAEB and SAEAES implementations [13,15], as shown in Figure 6. The wrapper consists of some 8-bit AND, XOR, and selector gates for changing the datapath depending on the target operation.
The SAEAES implementation supports five commands: Com INIT , Com E , Com N , Com D , and Com T that involve at most one AES call, as summarized in Table 4. Figure 1 shows how these commands realize the SAEAES's HASH, Encryption, and Decryption. Figure 7 illustrates the active path on the simplified diagrams for each command. The XOR and AND gates control the next input to AES by combining the previous AES output, the input message/ciphertext byte, and a domain-separation constant. We use the same XOR for generating the output and tag.
The circuit has an 8-bit FIFO-like interface: a user pushes input bytes into the circuit, which updates the output bytes simultaneously. The circuit starts an AES encryption after receiving the sufficient number of bytes. Table 4. Five commands that SAEAES implementation supports. See Figure 1 for the corresponding operation in SAEAES, and Figure 7 for the active datapath in the proposed design.    Table 6.

S-box
We use the 3-share and uniform S-box [27] in Figure 3 both for the round operation and key schedule (see Table 2 for the timing). The S-box circuit has the 42-bit datapath width, and we extend the entire datapath, including the input and output buses, to 42 bits. A user feeds the shared representation of a message/key, 42 bits at a time, by taking 16 cycles. The timing is the same as the unprotected implementation, and an AES call takes 282 cycles in total.

State Array
We need to extend the S-box's input size from 8 to 14 bits for the generalized changing of the guards. To store these 14-bit data, we extended the column-oriented arrays to store 224 (=14×16) bits. We then duplicate the extended arrays to store the intermediate data in a shared representation. As a result, the state uses 672 (=14 × 16 × 3) bits, as summarized in Table 5.

Key Array and Key Store
We store the secret key in the duplicated key arrays and an independent shift register. The previous implementation [27] extends the entire key array from 8 to 14 bits, similar to the state array, resulting in 672 bits of registers. However, since only 4 out of 16 bytes go through the S-box calculation, the previous design wastes the 216 (= (16 − 4) × 6 × 3) extended bits. The previous work pointed out this inefficiency but gave no concrete realization [27].
Instead of extending the bit width of the key array, we add an independent 18-bit and 4-stage shift register for storing the extended bits ((C8) in Figure 6). As a result, the duplicated key array and the new shift register combined use 456 bits of registers as summarized in Table 5, reducing 216 bits from the previous design. We implement the key store in the same way using 456 bits of registers ((C2) and (C3) in Figure 6).
For each AES call, we need 448 (= (14 × 16 × 2)) random bits to make a shared representation of the AES's input. We use 304 (= (8 × 16 × 2 + 6 × 4 × 2)) random bits for making a key share for rekeying: 256 bits for a shared representation of the 128-bit key, and additional 48 bits for the extended bits. In addition, we need 24 random bits for initializing the S-box circuit once at the time of boot.

Performance Evaluation
We synthesized the designs using Synopsys Design Compiler with the NanGate 45-nm standard cell library [35]. For a component-wise comparison, we preserved the module hierarchy up to the major components. Table 6 summarizes the post-synthesis performances of each component. Our unprotected SAEAES implementation uses 4690 GE. Considering that our design has the key store (961 GE), this size is comparable to that of the previous SAEAES implementation (3502 GE [15]) that needs an external key register. Our design has some disadvantages due to the conservative design policy and the compatibility with TI: (i) the state and key arrays are larger for not using the netlist-level optimization (see Table 3) and (ii) the S-box circuit is pipelined. These disadvantages, however, are canceled out by the elimination of the 56-bit shift register (roughly 400 GE) needed in the previous design for reordering the bytes [15] (see Section 4.1). Our design has room for further optimization with a more aggressive design policy.
Our TI design uses 18,288 GE. The underlying AES implementation uses 14,256 GE, which is smaller than the previous implementation with 17.1 kGE [27]. Reducing the key-related registers from 672 to 456 bits (see Section 5.5 and Table 5) is the main reason for this improvement. The sizes of the state array and the S-box circuit are similar to the previous design.
The mode of operation uses 610 (=18288 − 2877 − 545 − 14256) gates or 3.3% with TI, i.e., AES occupies 97% of the total area. The SAEB's minimum state size and XOR-only properties [15] contribute to this small footprint. AES-GCM, on the other hand, needs additional 512 bits of registers corresponding to 3844 GE, which expands to 1280 bits or 9610 GE with TI (with the estimated register cost of 961 128 GE/bit based on (C2) in Table 6). We note that AES-GCM also needs an independent protection to its GF(2 128 ) multiplication [36].
The register storing the shared representation of the key occupies 5161 GE or 28% (obtained by subtracting the key-related size in the unprotected implementation (2028 = 961 + 1067) from that in the TI implementation (7189 = 2877 + 3222 + 545 + 545)). Some of the previous implementations have an unprotected key schedule by considering non-profiling attacks only [21,25,37]. If we use such an unprotected key schedule in our design, the total circuit will be roughly 13 kGE by saving 5161 GE. Table 7 compares our design with the previous implementation of SAEB-GIFT: SAEB instantiated with the GIFT block cipher [15]. The SAEB-GIFT implementation [13] uses 6229 GE, which is roughly 1/3 of our SAEAES implementation. Since both implementations use the same mode of operation (SAEB), the difference comes from that of AES and GIFT. The key store, non-linear key schedule, and S-box are the key factors of the difference.

Key Store
In comparing the unprotected implementations, the key store (961 GE in Table 5) is the major reason for the SAEB-GIFT implementation (2761 GE) being smaller than ours by 1929 GE. As discussed in Section 5, the SAEAES implementation uses the key store because the reverse key schedule is so expensive in AES. Meanwhile, the SAEB-GIFT implementation uses an efficient reverse key schedule [15] by exploiting the GIFT key schedule defined as a simple nibble permutation [20]. The key store becomes even larger with TI for storing the shared representation of the key.

Non-Linear Key Schedule
In contrast to the non-linear AES key schedule that needs 3 shares, we can protect the GIFT's linear key schedule with only 2 shares. This 2-share representation reduces the memory capacity for the key array. Indeed, the recent TI-friendly authenticated encryptions [13,14] exploit this linear part to improve the performance with TI. As a result, the GIFT's key array is 2410 GE with TI, while AES needs 7189 GE for storing the key ((C2), (C3), (C5), and (C8)).

S-Box
Even with an unprotected key schedule, our implementation is larger than that of SAEB-GIFT by 8084 GE. The main reason is the increased state size by the generalized changing of the guards: the need for extending the data width from 8 to 14 bits increase the memory size by ×1.75 (=14/8). In contrast, GIFT needs no such extension because the designers chose an S-box that has a uniform sharing [21]. We can reduce the memory size by using the previous non-uniform sharing [23][24][25], but we need to implement an efficient random numbers as discussed in Section 3. Table 7 also shows the performances of the other authenticated encryptions: (i) the Arribas et al. KETJE-JR implementation based on the changing of the guards sharing [9] and (ii) the Caforio et al. implementations of several NIST LWC candidates based on GIFT-128 [10], which have the similar circuit sizes compared to that of SAEAES. These implementations are larger because they traded the circuit area with speed: they finish a round function each cycle by using multiple S-box circuits (cf. 27 cycles/round in our implementation). This comparison gives another insight about the cost of using AES.

Conclusions
We presented the first TI of the authenticated encryption algorithm SAEAES. We used the Sugawara's 3-share and uniform TI of AES S-box [27], but completely redesigned the internal data structures (the state and key arrays) because SAEAES prefers the column-oriented serialization. We show that our design achieves 18,288 GE with TI. Meanwhile, it is roughly three times larger than the SAEB-GIFT implementation using 6229 GE [13]. Since both implementations use the same mode of operation (SAEB), AES is responsible for the larger area: the main difference comes from the larger number of registers needed for the non-linear key schedule and the larger states extended by the generalized changing of the guards.