Analysis of Area-Efficiency vs. Unrolling for eSTREAM Hardware Portfolio Stream Ciphers

: The demand for low resource devices has increased rapidly due to the advancements in Internet-of-things applications. These devices operate in environments that have limited resources. To ensure security, stream ciphers are implemented on hardware due to their speed and simplicity. Amongst different stream ciphers, the eSTREAM ciphers stand due to their frugal implementations. This work probes the effect of unrolling on the efﬁciency of eSTREAM ciphers, including Trivium, Grain (Grain 80 and Grain 128) and MICKEY (MICKEY 2.0 and MICKEY-128 2.0). It addresses the question of optimal unrolling for designing high-performance stream ciphers. The increase in the area consumption is also bench-marked. The analysis is conducted to identify efﬁcient design principles for ciphers. We experimentally show that the resulting performance after unrolling may disagree with the theoretical prediction when the effects of technology library are considered. We report pre-layout synthesis results on 65 and 130 nm ASIC technology as well as synthesis results for Xilinx FPGA platform in support of our claim. Based on our ﬁndings, cipher design and implementation suggestions are proposed to aid hardware designers. Furthermore, we explore why and where area-efﬁciency for these ciphers saturate.


Introduction
The rise of the Internet-of-things applications has increased the demand for low resource devices.To ensure security it is essential to protect the data generated from these devices.The devices operate in environments where resources such as energy, memory, computational power and space are very limited.Achieving high levels of security in limited resources is a challenge.While implementing a cipher on the hardware it is essential to enhance throughput and minimize the area consumption while ensuring security [1].In cryptography, symmetric ciphers are usually used to encrypt data between two parties.Symmetric ciphers can either be block or stream ciphers.In block ciphers, the message is encrypted into fixed-size blocks whereas in stream ciphers the message is encrypted digit by digit using a pseudo-random key bitstream.On resource constraint hardware stream ciphers are usually preferred due to their speed and simplicity [2].eSTREAM [3] was an EU-ECRYPT project that allowed submissions of stream cipher proposals.In the four years amongst 34 proposals, 7 were included in the portfolio, 4 in software profile and 3 in the hardware profile.The hardware portfolio comprised of Grain (Now Grain Family: Grain-128AEAD [4], Grain-128a [5], Grain-128 [6], and Grain v.1 [7] ciphers for constrained computing environments) MICKEY [8] and Trivium [9] stream cipher proposals.These ciphers are considered hardware-efficient due to their low cost and high performance [10][11][12].Although the security of Trivium was criticized in [13], most importantly they are considered to be secure as there are no successful cryptanalytic attacks till date on Trivium [14], MICKEY [15], and Grain [16]).
As the world is getting closer to the paradigm of pervasive computing, high-performance security for all the increased information exchange is becoming more and more challenging.This need for better performance fuels and justifies academic and industrial efforts in the direction of the design of high performance embedded Application Specific ICs (ASICs) dedicated to a certain cryptographic function.Physical constraints are important while designing ciphers for various computing environments.For efficient hardware implementations, loop unrolling is a micro-architectural configuration that replicates the main module of an iterative process by multiple modules to execute these rounds in one clock cycle, the duplication number is called the unrolling factor.As a direct consequence of unrolling, higher throughput performance is expected, along with an increase in the circuit area and the critical path of the design.In addition to achieving hardware efficiency, unrolling comes with an added benefit for cryptography cores.The unrolled hardware implementations are less vulnerable to side-channel analysis (SCA) attacks since they execute multiple rounds in every clock cycle and hence require a stronger hypothesis for cryptanalysis on multiple values due to deeper diffusion of the secret state [17].
In this paper, we analyze various possible design points in the area-performance curve for VLSI designs of stream ciphers for unrolling factors beyond what the conventional design permits.For block ciphers, the unrolling simply duplicates the round hardware and the area increase is predictable.Whereas for stream ciphers, the numbers are much more interesting since we are only replicating the boolean logic until a certain factor.Hence stream cipher unroll can promise a lightweight solution to performance boost via unrolling.With this motivation, we took up the three hardware portfolio eSTREAM ciphers for HDL implementation, along with their modified versions.In this paper, we perform unrolling by a range of factors (L = 1 implies no unrolling, L = k implies full unrolling having loop trip count k).Starting from the synthesizable Verilog HDL and verification, we took up synthesis on various ASIC technology libraries and FPGA platform.The aim is to analyze and benchmark resource utilization, throughput performance and area efficiency due to unrolling optimization in these ciphers.Consequently, a series of interesting results were encountered that are novel and unconventional.
The paper is organized as follows: Section 2 provides the existing work related to the performance analysis of hardware stream ciphers.In Section 3, the specifications and design parameters are discussed for Trivium, Grain and MICKEY ciphers.In Section 4, the unrolled hardware implementation is presented.Section 5 states various aspects of imprecise modeling of the effects of unrolling whereas Section 6 guides designers how area-efficiency grows and saturates.Finally, Section 7 concludes the paper.

Related Work
In cryptography, the block ciphers require huge gate footprint and memory for implementation.For this reason, they are not implemented over resource-constrained devices.On the other hand, stream ciphers can be used in resource-constraint environments due to their simplicity and throughput.After the widely used stream ciphers were proven to be insecure the eSTREAM project was developed by the European Network of Excellence in Cryptology [18] to create more efficient and robust stream cipher algorithms.There were three rounds where seven algorithms were finalized.In hardware portfolio three algorithms were selected including Grain [7], Mickey [8] and Trivium [9].Grain family is proposed to tackle the security issues in old stream cipher algorithms.Mickey 2.0 is designed to ensure security in resource constraint hardware.Whereas Trivium is designed to provide high performance in terms of speed and low gate count.
We discuss some relevant works to the aforementioned stream ciphers.In [1], authors discuss the efficiency of stream ciphers Trivium, Grain and Mickey.A comparison of these lightweight stream ciphers with block ciphers is presented.Block ciphers consume a large footprint of Gate Equivalents (GE) when compared to stream ciphers.For instance, the traditional Advanced Encryption Standard consumes about 2400-3500 GE whereas the lightweight stream ciphers take approximately 2000 GE.The performance of these stream cipher algorithms is further analysed in [19] by conducting simulations using Java programming language.The work suggested the adoption of Grain, Mickey and Trivium for resource-constrained hardware.Profile 2 eSTREAM is adopted in [20] using five leading Phase 2 candidates which are implemented in Spartan 3 Xilinx family and ASIC technology.This work is compared with old stream cipher algorithms based on hardware efficiency criteria.In the Grain, five parallelization factors: 1, 2, 4, 8, 16 are used in the basic architecture with 122 slices under 193 MHz the frequency.In contrast, the implementation of 16× parallel architecture shows a maximum throughput of 2480 with 6.97 Mbps/slice.In the Trivium, seven parallelization factors are used in the basic architecture with 188 slices under 201 MHz the frequency.While in 64× parallel architecture implementation, the maximum throughput 12,160 with 31.34Mbps/slice is reached.
In the work [21], six different stream cipher algorithms are implemented for comparison using the Xilinx Spartan XC3S700A-4FG484 device.The metrics used to compare algorithms are throughput-to-area ratio and consumed area.In Grain v1, 318 slices are used under 177 MHz the frequency with 0.558 Mbps/slice throughput-to-area while in Mickey 2.0, 98 slices are used under 250 MHz the frequency with 2.55 Mbps/slice throughput-to-area.For the Trivium implementation, 149 slices are used under 326 MHz the frequency with 2.18 Mbps/slice throughput-to-area.However, the consumption of area resources was high in Grain v1 and Trivium.A new FPGA implementation approach is proposed in the work in [22].In this paper, authors implement Mickey 2.0, Trivium and Grain v1 with two stream ciphers: Lizard and Plantlet using Xilinx's Spartan7 serial and Verilog hardware.In their findings, Trivium has the highest frequency with maximum 416 Mbps, while the Mickey 2.0 has the second-highest frequency with 384 Mbps both in basic versions.In the serial version, Trivium has the smallest consumption of area which is 133 slides, while the Grain v1 has the second smallest which is 26 slices.Trivium achieved the maximum throughput-area ratio which is 165.5 Mbps/Slice in the parallel version.
In the existing literature, the performance of hardware stream ciphers is computed and analyzed in terms of throughput, area efficiency and security.As far as we know to the best of our knowledge, there exists no work that probes the effect of loop unrolling on hardware efficiency for stream ciphers (Grain, Mickey and Trivium).Loop pipelining and loop unrolling are two methods that can improve the hardware performance by introducing parallelism in loop iterations.In loop pipelining, the concept of pipelining is introduced to allow the operations to be implemented concurrently.However, pipelining increases the complexity and cost of the hardware.On the other hand, loop unrolling introduces multiple copies of the loop body to adjust the loop iteration counter.For stream ciphers, unrolling can be a promising lightweight solution to boost the performance.Therefore, we show the effect of unrolling on the efficiency of estream hardware ciphers.The optimal unrolling for designing high-performance stream ciphers are discussed.The results can be used by hardware designers to identify design principles to achieve better performance in stream ciphers.

Trivium, Grain and MICKEY Specifications
In this section, we present the parameters of Grain-80, Grain-128, MICKEY 2.0, MICKEY-128 2.0 and Trivium stream ciphers.These ciphers have two phases for initialization namely, key/IV setup, randomization, followed by the key-stream generation [23].The specifications are summarized in Table 1 and are discussed in the following subsections.

Grain
The Grain [24] family of stream ciphers are designed for hardware environments where resources such as power and memory are limited, for instance, radio-frequency identification (RFID).The Grain family has 80-bit and 128-bit variants having linear feedback shift register (LFSR), nonlinear feedback shift register (NFSR) and a nonlinear output function (NFSR).The LFSR introduces the minimum period for the key-stream whereas NFSR produces non-linearity to the cipher.The sate of NFSR is balanced by masking the output of LFSR with the input to the NFSR.The LFSR is represented as S t = s t + s t+1 , ...., s t+79 whereas NFSR is represented by B t = b t , b t+1 , ...., b t+79 .The function H(B t , S t ) is the output key-stream bit represented by z t .

Grain-80 Design Parameters
The keysize of Grain-80 [7] is 80 bits whereas the initialization vector (IV) is of 64 bits.The key and IV is loaded to shift registers b i = k i , 0 ≥ i ≤ 79, s i = IV i , 0 ≥ i ≤ 63 where the remaining bits are set as 1.The cipher is then clocked 160 times to produce the key-stream bits.The updated function of the LFSR and NFSR are defined in Equations ( 1) and (2), respectively. (1) ( The Boolean function h(x) is defined as: where the variables x 0 , x 1 , x 2 , x 3 and x 4 correspond to s t+3 , s t+25 , s t+46 , s t+64 and b t+63 , respectively.The output function H(B t , S t ) is given by where A = {1, 2, 4, 10, 31, 43, 56}.

Grain-128 Design Parameters
Grain-128 [25] has a keysize of 128 bits, whereas size of the IV is specified to be 96 bits.Th initialization is done using b i = k i , 0 ≥ i ≤ 127, s i = IV i , 0 ≥ i ≤ 95 and loaded to NFSR and LFSR.The updated functions of LFSR and NFSR are provided by Equations ( 5) and ( 6), respectively.

Mickey 2.0
Mickey [26] stands for Mutual Irregular Clocking KEY-stream generator is a family of stream ciphers that are implemented on resource constrained hardware.The aim of Mickey is to provide high level of security while having low complexity over hardware implementation.There are two variants of Mickey 2.0, MICKEY-80 and and MICKEY-128 having 80 bit and 128 bit key, respectively.There are two registers R and S represented by (r 0 , . . ., r 99 ) and (r 0 , . . ., r 99 ), respectively.The registers have two modes of clocking CLOCK R and CLOCK S .Clocking of shift registers introduces pseudorandomness.Four control variables are used to update the contents of R and S in a non-linear manner.The hardware-oriented form of the two clock modes are presented as follows: CLOCK R :

Trivium
Trivium [27] is designed to have high speed, low gate count and reasonable implementation efficiency.Trivium consists of three shift registers of different lengths.In each round a bit is shifted to three shift registers creating internal state denoted by S = (s 1 ...., s 93 , s 94 , ...., s 177 , s 178 , ...., s 288 ).The structure of the internal state of Trivium is presented in [9].Trivium generate 264 bits of key stream from 80 bit key and 80 bit IV.The key and IV are uploaded to the shift registers and updates 1152 times to generate key streams.The pseudo-code of the key stream is presented as follows: This section discusses the unrolling methodology for the three stream ciphers at hand.Since MICKEY is based on Jump Registers, its unrolling is different from that of Trivium and Grain which are based on feedback shift registers (FSRs).We take MICKEY 2.0 and Grain-80 as a test example.
To study the effects of unrolling on implementation, we started from a basic design implementation of the cipher using VHDL and attempted unrolling.For all the designs, the natural interface resulting out of unrolling is used to avoid performance imbalance.For example, the basic Trivium design consists of two 1-bit pins for clock and reset, two 80-bit pins for Key and IV, one 1-bit output pin for a valid signal (indicating beginning of key-stream generation) and one 1-bit output pin carrying the key.For Trivium, we purposefully unrolled beyond the state update function (till 128) to study the effect on throughput and area performance.

Design Synthesis Setup
Synthesis is carried out using Synopsys Design Compiler Version H-2013.03-SP1, with compile_ultra option.The synthesis is driven by throughput maximization with the max_area constraint set to 0. The frequency is scaled up until there is a failure to meet the clock constraint.The area is reported using equivalent NAND gates.All CMOS synthesis reported has been performed targeting either

Unrolling MICKEY 2.0
Figure 1 shows the architecture of MICKEY 2.0, with no unrolling (generating 1 bit/cycle of key-stream).R and S are so called Jump Registers.The registers jump to new values using jump bit (cbr, cbs).Hence unlike the conventional LFSRs, the entire R and S may be updated after a clock cycle as each bit depends on its neighboring bits for update.For n-× unrolling we update the registers ahead n times.Since we wish to obtain n simultaneous output bits, we look-ahead for n − 1 more rounds of feedback and control bits to generate n − 1 more key-stream bits on the same clock edge.Theoretically, n ≥ 1, we experimented for n = 2, 3, 4, 8. Figure 2 shows the 2× unrolled implementation hardware for MICKEY 2.0 with critical path highlighted.R1 and S1 are temporary buffers for holding the next iteration states of R and S, respectively.Algorithm 1 discusses a generic n-× unrolling methodology for MICKEY 2.0.For every increment in the unroll factor, future values of cbr, cbs, fbr, fbs, ibr, ibs are generated and the original R and S registers are updated.Every iteration of loop indexed with j in Algorithm 1, delivers a batch of n-bits of key-stream and executes till the key-stream generated bitlength l is exhausted.The inner loop indexed with i, produces one bit of key-stream.MICKEY 128 2.0 unrolling has been explained by [28], however beyond 2 times unrolling was not explored due to lowering of area efficiency.Moreover, the calculation of future control and feedback bits depended on the clocking of only a few particular bits of R and S (skipping the entire register update), which does not hold true for MICKEY 2.0.

Algorithm 1: MICKEY 2.0 n-× parallelization algorithm
Input: State R 0 and S 0 after Preclock stage Input: Unrolling factor n, key-stream bits l Output: n-key-stream bit per clock Return first keybit as R 0 [0]⊕S 0 [0]; Set ibr 0 , ibs 0 as 0. for j=1 to l n step 1 do for i=0 till ≤ (n − 1) step 1 do Calculate cbr i , cbs i , f br i , f bs i for R i , S i as per CLOCK_KG.
Call CLOCK_R(R i , 0, cbr i ) and let R i+1 be the state of R i after clocking.
Call CLOCK_S(S i , 0, cbr i ) and let S i+1 be the state of S i after clocking.
Unrolled Design Synthesis for MICKEY Table 2 reports the synthesis results for synthesizing MICKEY 2.0 unrolled versions at the highest possible operating frequency for 65 nm CMOS.Deviating from the conventional wisdom suggesting unrolling always improves design efficiency, we see a contradiction.For 2-× unrolling, the Throughput Per Area Ratio (TPAR) is highest and drops for higher unrolling values.For each unrolled implementation, the critical path of the design varies.Consequently a different Boolean gate implementation is modeled based on the synthesis tool.Synthesis tools heuristics are hard to model, however we try to answer this in the next section.MICKEY 128 2.0 has a larger key size and internal state in comparison.Table 3 gives the synthesis results for synthesizing MICKEY 128 2.0 unrolled versions at the highest possible operating frequency for 65 nm CMOS.Unlike MICKEY 2.0, the TPAR for MICKEY 128 2.0 shows are rising trend with the unroll factor.Hence when we go from rolled MICKEY to 2× unrolled MICKEY, we find that the throughput exactly doubles, whereas the area, though it increases, is less than double.Tables 4 and 5 gives the FPGA synthesis results for MICKEY 2.0 and MICKEY 128 2.0 respectively.

Unrolling Grain-80/128
For Grain-80, parts of LFSR and NFSR (bit 65 til 79) do not contribute to the calculation of the next state of these registers as referred in Figure 3. Consequently, the combinational logic functions ( f (x), g(x), h(x)) can be carefully replicated L times for an L times look ahead or unrolling; Grain 80 ×1, ×2, ×4, ×8 and ×16 can be implemented by calculating L functions set f L (x), g L (x), h L (x) instead of one function set f (x), g(x), h(x) whereas L = 2, 4, 8, 16.For Grain-80, L can be taken as any factor of 160 up to 16 (where 160 is the state size for Grain-80, ref.Table 1).Figure 4 exhibits the doubling of combinational logic for an unrolled (×2) version of Grain80.Unrolling beyond 16 needs duplication of LFSR and NFSR registers too other than the combinational logic functions.Consequently, no improvement is observed in area efficiency.For Grain-128 too, parts of LFSR and NFSR (bit 96 til 127) do not contribute to next state calculation.Consequently, unrolling up to Grain-128 × 32 can be calculated (L can be any factor of 256 up to 32).Beyond L = 32, no area efficiency gain is achieved, hence for Grain-80/128 we discuss the unrolling results till L = 16/32, respectively.

Unrolled Design Synthesis for Grain-80/128
Tables 6 and 7 report the synthesis results for synthesizing Grain-80 and Grain-128 unrolled versions at the highest possible operating frequency for 65 nm CMOS.The results show an improvement in the TPAR design efficiency as the unrolling factor is increased.Consequently, Grain-80 × 16 and Grain-80 × 32 are the most efficient unrolled versions of Grain-80 and Grain-128, respectively.A similar trend of improvement in area efficiency can be seen for FPGA synthesis results as can be seen in Tables 8 and 9 for Grain-80 and Grain-128, respectively.For trivium, we skip the unrolling methodology details since it follows similar idea as that of Grain and discuss the results directly.Tables 10 and 11 benchmark the highest possible throughput performance against area resource utilization for 65 nm and 130 nm technology libraries, respectively.
Table 12 reports the synthesis results achieved with FPGA technology.As the target board, Xilinx Spartan6, device XC6SLX45 is used.Xilinx ISE synthesis tool version 14.3 is used with "balanced" as the design goal.For all the designs, a conservative clock frequency of 100 MHz was selected, which could be achieved after placement and routing.For the unrolling factor of 64, the design could not be fit in the I/O bounds of the device.For predicting the efficiency improvement, corresponding to the unrolling factor, a popular approach is to count the number of Boolean functions required in the unrolled implementation.Thenceforth, the gate count is derived by approximating each Boolean function with an equivalent number of NAND gates.This is also presented in the Trivium specification [9] and reproduced here (Table 13) for convenience of reading.Evidently, this approach is naive and cannot capture the effects of technology reliably.This is reflected in the experimental results presented in the Tables 10-12.For each unrolling factor, we compute the predicted throughput/area by assuming a doubling of throughput.In order to compare, how the theoretical prediction of the area-efficiency matches with the practical results, we scaled the area-efficiency of the original model in each case to a value of 1 and computed the relative area-efficiency increase for all the following points using the formula , where i indicates an element in the set of unrolling factors.The resulting graph is shown in the Figure 5.The dotted line indicates a relative increase of 1.As long as the unrolling provides a relative increase more than 1, it is justified for an area-efficient design.It can be observed from the Figure 5 that, the predicted gain in area-efficiency matches well with the FPGA target technology.Whereas, for ASIC technology library, there are deviations from the prediction.On a finer study, it turns out that the theoretical prediction is too optimistic compared to the ASIC technology library and too pessimistic compared to the FPGA technology library.In the following, we attempt to explain (some of) the hidden factors that contribute to the mismatch between a theoretical model and practical results.It is worth noting that the unrolling factor of 64 provides the highest area-efficiency, it is not significantly different from the unrolling factor of 72 for ASIC technologies.It is, therefore, advisable to explore unrolling beyond the state update function.

Effect of Cell Selection
Each technology library comes with a rich set of logic gates for allowing an efficient implementation.Depending on the user-driven synthesis constraints, a particular cell from the library is selected for implementing a Boolean logic.This choice is driven, internally, by synthesis heuristics, and therefore, hard to model.However, once the design is mapped to a complete gate-level implementation, it is possible to reason, why a particular selection is chosen.We provide a simple example from 130 nm technology library, when we move from the basic Trivium implementation to a quadruple-unrolled version.
For the basic Trivium implementation, only one read output from each of the state registers is needed.For unrolled versions, the number of outputs for several registers increases.This leads to a selection of different FlipFlop (F/F), as shown in the Figure 6.The upper part of the figure contains a snapshot of the original Trivium implementation and the lower part shows the quadruple-unrolled version of the same.The corresponding F/F with one and two outputs are also shown.In this example, four storages, s 89 , s 90 , s 91 and s 92 require at least two outputs.The dual-output F/F incurs more delay and area.For the design with 64 unrolling factor, there still remains a few states with single output, e.g., s 192 .On the other hand, there are storages with as many as five outputs, e.g., s 225 .

Effect of Increased Driving Load
A logic circuit is characterized to have a drive strength of n× if it is able to drive an output capacitive load of n • C with the same rise and fall delays as compared to a reference inverter driving a load capacitance of C. The same logic implementation in a technology library often has many implementation variations depending on the output capacitive load.This allows the synthesis tool (or physical designer) to select a small-area implementation for non-critical path as well as to adopt multiple threshold voltages for cells with low drive strength when low-power design is targeted.The output load depends on load capacitance the following logic cell as well as on the number of fanouts.For higher output loads, bigger cells need to be selected from the technology library, which incurs higher delay overheads.This effect becomes prominent with increasing unrolling factors, due to sharing of combinational logic and increased fanouts of the storage cells.

Effect of Interconnect and Buffer Insertion
In submicron process technologies, the effect of interconnect delay is more dominant compared to the logic/transistor delay.This problem is aggravated in designs with long interconnect wires.To the first order, interconnect delay is proportional to the square of the interconnect length.To address this issue, buffers are inserted in a wire, effectively generating multiple wire segments and obtaining a linear delay increase with increasing wire length [29].Buffers are also inserted to address increasing fanout load.While a pre-layout synthesis result does not accurately model the effect of buffer insertion, yet this is reflected in the experimental results since, the synthesis is run in topographical mode.In this mode, a virtual layout of the design is created for accurate net delay and capacitance prediction.The effect of buffer insertion is observed in the reported critical path for designs with more unrolling.
Other than these factors of imprecise modeling, the difference in ASIC vs. FPGA technology also contributes to it.Yet another factor is the design constraints and synthesis options, that can lead to hugely varying results even if the synthesis is targeted for the same device.

Growth and Saturation of Area-Efficiency: Lessons for Designer
The demand of physical constraints play an important role in designing ciphers today, as the security concerns become ubiquitous ranging from packet encryption in large-scale servers to security protocols in ultralight wireless sensor nodes.The prime intention behind design unrolling is to increase the area-efficiency.Consider a cryptographic accelerator with area A and a throughput performance of T. By simply deploying two independent accelerators, one doubles the area to 2A and hopes to get a throughput of 2T.However, this throughput improvement is not unconditional.In particular, it is assumed that the two accelerators will be provided with independent message streams, essentially requiring more ports to the system.Even under the ideal condition, where both area and throughput doubles, the area-efficiency of the overall system remains T A .A design, which can be unrolled, allows to increase area-efficiency beyond the basic design.This is a key optimization technique for area-constrained cipher implementations.

Growth of Area-Efficiency
The area-efficiency stops growing around the unrolling factor.This is observed to be between 64 and 72 for Trivium.However, and more importantly, this observation is valid only when we do a throughput-driven synthesis.On the other hand, if we set a low frequency constraint (e.g., for low-power applications) then, unrolling beyond the state update function is easily justifiable.An experiment is performed for the Trivium implementation with 130 nm technology.For both the factors of 64 and 128, the design was synthesized with a target clock of 500 MHz.The resulting area-efficiency, in terms of Gbps/KGates, are 7.41 and 10.05 for 64 and 128-unrolled designs, respectively.In other words, the area-efficiency keeps growing beyond the conventional highest unroll factor of 64 for some synthesis constraints and some target technologies.To allow unrolling beyond the state update function, the cipher needs to have a "simple" update function.For example, unrolling beyond the state update failed to increase the area-efficiency when studied in the case of SNOW 3G, ZUC and RC4 [30].

Saturation of Area-Efficiency
Even before the area-efficiency growth stops at a certain point, the value saturates, i.e., one obtains less return in throughput improvement for a given area increase.This can be quantified by considering a primitive model with combinational logic (C), storage area (S), total area (A = C + S) and throughput (T).For an unroll factor of n, the resulting area is A n = S + n • C and the new throughput T n is n • T. Hence, the area-efficiency for an unroll factor of n is Equation ( 14) clearly shows that unrolling leads to a growing area-efficiency proportional to n when C S → 0. So, the area proportion of combinational logic in the overall cipher should be as low as possible.This effect is illustrated in Figure 7, where different curves for absolute area-efficiency corresponds to different C S values, ranging from 10% to 50%.For a lower value of C S , the growth is sharper and saturates late, at high unroll factor.

Conclusions
In this paper, we benchmark Trivium, Grain and MICKEY VLSI performance with various unroll factors on 65 nm, 130 nm CMOS technology and Xilinx FPGA.The theory-vs-practical results for unrolling are presented.The unroll factors beyond conventional maximum limits is extended to get higher area efficiency.For Trivium, we compared the deviation of throughput efficiency against the theoretical prediction and explained some factors for the mismatch.The reason and cutoff points for the area-efficiency saturation due to unrolling is explored.We outlined how area-efficiency varies with unrolling for MICKEY, Grain and Trivium.A direct along with n-way parallelized designs is implemented to illustrate the feasibility of maintaining the high-throughput, low-resource and high-security qualities of the cipher.Efficiency of a design is estimated by calculating ratio of throughput to area.The results presented in this paper can be used by cryptographers to choose the most efficient design.As our future work, we intend to extend this work in two major directions: First, extend the unrolling design methodology to support other prominent stream ciphers and benchmark its effect in terms of exploring critical design points in the area vs. throughput design space.Secondly, support for the countermeasures against the side channel analysis attacks.Moreover, we also intend to explore the affect of loop pipelining on stream ciphers performance.

•
Faraday UMC 65 nm SP/RVT Low-K process technology library.Best case condition with 1.1 V, −40 • C parameters are assumed.• Faraday UMC 130 nm high speed FSG process technology library.Typical case condition with 1.2 V, 25 • C parameters are assumed.

Figure 6 .
Figure 6.Effect of cell selection: multiple output.

Figure 7 .
Figure 7. Area-efficiency vs. unrolling factor for different C S .