Improved Filtering Techniques for Single- and Multi-Trace Side-Channel Analysis

: Side-channel analysis (SCA) attacks constantly improve and evolve. Implementations are therefore designed to withstand strong SCA adversaries. Different side channels exhibit varying statistical characteristics of the sensed or exﬁltrated leakage, as well as the embedding of different countermeasures. This makes it crucial to improve and adapt pre-processing and denoising techniques, and abilities to evaluate the adversarial best-case scenario. We address two popular SCA scenarios: (1) a single-trace context, modeling an adversary that captures only one leakage trace, and (2) a multi-trace (or statistical) scenario, that models the classical SCA context. Given that horizontal attacks, localized electromagnetic attacks and remote-SCA attacks are becoming evermore powerful, both scenarios are of interest and importance. In the single-trace context, we improve on existing Singular Spectral Analysis ( SSA ) based techniques by utilizing spectral property variations over time that stem from the cryptographic implementation. By adapting overlapped- SSA and optimizing over the method parameters, we achieve a signiﬁcantly shorter computation time, which is the main challenge of the SSA -based technique, and a higher information gain (in terms of the Signal-to-Noise Ratio ( SNR )). In the multi-trace context, a proﬁling strategy is proposed to optimize a Band-Pass Filter (BPF) based on a low-computational cost criterion, which is shown to be efﬁcient for unprotected and low protection level countermeasures. In addition, a slightly more computationally intensive optimized ‘shaped’ ﬁlter is presented that utilizes a frequency-domain SNR -based coefﬁcient thresholding. Our experimental results exhibit signiﬁcant improvements over a set of various implementations embedded with countermeasures in hardware and software platforms, corresponding to varying baseline SNR levels and statistical leakage characteristics.


Introduction
Side-channel analysis (SCA) attacks over cryptographic implementations are constantly evolving and improving. Current and future primitives and their implementations are designed to enable low-cost embedding of security mechanisms as much as possible [1,2]. However, new channels and adversarial mechanisms are constantly on the rise; for example screaming channels [3] that target Radio Frequency (RF) ID IoTs, multiple-SOCs/Network/cloud SCAs, e.g., [4][5][6][7][8][9], which are directed towards more general computing platforms, to name a few. Future Lightweight Crypto. and Post-Quantum Crypto. proposals are nowhere near to a solution to the side-channel issue with respect to the low electronic cost requirements of countermeasures. In particular, information extraction and exfiltration mechanisms provide leakages with different statistical characteristics. For example, the electromagnetic channel provides leakages with less algorithmic noise which are more sensitive to specific frequency ranges. Information sniffing over multi-process computation platforms shape different characteristics of the leakage due to • We improve upon the SSA technique adapted to the single-trace SCA context by utilizing the variations in spectral properties over time. By implementing OVSSA [21], adapting it to the single trace scenario and optimizing over the method parameters, we achieve not only a significantly shorter computation time, which is the main Achilles' heel of the method, but also lower the data complexity and generate an overall higher information gain (in terms of the Signal-to-Noise Ratio (SNR)). Concretely, the proposed technique provides ∼5× max. SNR improvement for about the same number of leakage traces (data complexity). However, the main improvement is in the pre-processing evaluation time. The SSA based pre-processing technique time complexity depends primarily on a Singular Value Decomposition (SVD), which is generally quadratic in time as a function of the number of leakage time samples, n. The OVSSA based pre-processing technique time complexity depends on SVDs over chunked leakage traces (fewer samples), with a parameter Z, i.e., n/Z. That is, the time complexity improvement is generally O( n Z ), which was shown to be significant in our experiments. i.e., Z depends on the spectral characteristics of the leakage throughout the trace and for round-base cryptographic implementations the n/Z factor is expected to yield significant improvements. • In the multi-trace SCA context, we devise a profiling tactic to optimize a Band-Pass Filter (BPF) based on a criterion utilizing a low computational cost SNR metric in Section 2.4.3. Our experiments below achieve optimal results for unprotected designs. However, as the protection level increases, the (optimized) BPF shows a significant reduction in performance that can be attributed to the different and more complex spectrum of the leakage, which requires more sophisticated filters. Therefore, we also propose an optimized shaped filter utilizing a frequency domain SNR-based coefficient thresholding for the multi-trace scenario. The results obtained when using this filter show significant improvements over all datasets and designs, yield the highest SNR compared to all the other methods with an improvement of an order of magnitude, and reduce data-complexity by a factor of ∼2.5×, as reported in Table 1. The rest of the paper is organized as follows. In Section 2, we present the low computational effort toolbox we chose to evaluate security, which is used in our proposed criterion for optimal filter design. In Section 2.3, we describe single-trace pre-processing techniques and possible improvements by adapting OVSSA Section 2.3.2, and in Section 2.4.3 we detail the multiple-trace (statistical) pre-processing techniques, while defining our optimization criterion and set of parameters that are optimized for band-pass filters and shaped filters. In Section 3, we present the designs and countermeasures from which we collected datasets to apply these tools. In Section 4, we describe the experimental results, and the data and complexity gains for both the single-trace and multi-trace scenarios in Sections 4.1 and 4.2, respectively. Finally, a discussion of the results along with some concluding remarks are provided in Section 5.

Tools and Theory
In this section, we first discuss our optimization criterion for optimal filter design and our evaluation setting. We then describe in detail the single-trace and multiple-traces pre-processing techniques proposed in this paper.

A Simple, Computationally Attractive Optimization Criterion
In this subsection, we discuss the simple metric we use to evaluate security, namely the SNR, and present the rationale for this particular metric. We further detail the evaluation scenario; i.e., a profiling adversary, in the context of filter estimation (profiling), utilized in an attack campaign. Consider an internal variable manipulated by a cryptographic algorithm, Y of n-bit, and y its realization. Throughout the data manipulation by a device, information leaks via side-channels and is associated with the manipulated data, as well as with other physical parameters. We assume henceforth that an adversary takes a divideand-conquer approach over a small chunk of a secret variable of sk bits, such that sk ≤ n. Denote a leakage trace by a measurement set of T time points, i ∈ {1, . . . , T}. The leakage trace, corresponding to the manipulation of y, is therefore denoted by L = {L 1 , L 2 , . . . , L T }. A matrix of these leakage traces, containing a set of measurements of several realizations y, is denoted by L.
In what follows, we focus on a univariate analysis of the leakage distribution targeting one of the most widespread adversarial scenarios, which also enables a tractable analysis. Our main goal is to devise a simple security oriented metric that can be efficiently processed on the one hand, but on the other is sufficiently statistically robust to capture the main leakage characteristics of simple countermeasures embedded within a device. The rationale for this fast processing metric is our goal to utilize it as an optimization criterion for filters, which is why the procedure is repeated across multiple optimization parameters. Theoretically, there are scenarios which can be augmented by a multivariate analysis (or a multivariate leakage distribution over multiple leakage time samples jointly), e.g., when shuffling [22] or serial masking [23,24] countermeasures are embedded. However, in our measurement and evaluation environment, we evaluate a large set of different countermeasures that are almost all low-cost and univariate by nature. We address several basic questions concerning optimal filtering with varying leakage distributions. We also use the univariate approach in the shuffled leakages scenario, though a very complex multivariate approach could be applied [22], albeit with exponential data complexity in the number of dimensions. This is due to the fact that the univariate approach is still indicative of the level of leaked information from a shuffled design.
The SNR is a statistical measure that indicates how informative a signal is within a noisy environment: it compares the power level of a signal to the power level of the background noise; hence, the assumption is that the evaluator has access to labeled leakages. Traditionally, the SNR is defined as the ratio of the signal power to the noise power. SNR in the side-channel sense, as first proposed by Mangard [25]. The SNR has been utilized in numerous works and aims to indicate the univariate informativeness of a leakage time sample. To do so, both the signal and noise components are estimated. The signal power is estimated by first averaging out the noise in the leakage per secret variable state (y), and then computing the outcome-leakage variance over y. The noise first captures the level of noise (variance) in the leakage per y state, and averages the outcome-leakage over the states. Specifically, for the t-th leakage trace, the SNR defined by Of course, in practice, the true quantities in Equation (1) are unknown, so that we use estimates (using empirical averages and standard deviations) to obtain an estimate of the SNR. In the evaluation cases below we target y, giving y = Sbox(x ⊕ k), where x and k are the plaintext and key, respectively, and Sbox is the substitution-box step taking place in the first round of the advanced encryption standard (AES) algorithm implemented within our encryption cores.
Note that although the univariate SNR makes use of simplifying statistical assumptions regarding the leakage distribution, it is still probably the most widespread and viable tool to identify points-of-interest (POI) in time where the manipulation of a secret variable takes place, and is also used to link higher level security estimations such as the guessing entropy (GE) or the attack success rate (SR). In addition, it typically serves as a tool for valuable speedup of security evaluations. Generally, information-theoretic-based metrics, such as the ones used in [26,27], and full distribution analysis would be more statistically correct to use. However, in the context of this particular work, they are less practical for performing fast calibrations of many filter parameters due to their high computational complexity and their underlying data complexity, which are required to fully capture the leakage distribution.

A Profiled Evaluation
Template attacks [28] are performed in two subsequent (or interleaved) phases of profiling and attack. It is assumed that the adversary has gotten hold of one device for which the secret key can be programmed (in a controlled manner), so that the leakage can be profiled. Then, another target device under an attack campaign is used, where the adversary tries to extract information on the underlying key. In the context of a multi-trace (statistical) scenario, where we aim to find a viable filter, we consider the template setting; namely, we assume the leakages are y-labeled to optimize the SNR w.r.t. filter parameters.

SSA Utilized for SCA Denoising
SSA is a spectral estimation method, which is utilized in classical time series analyses, multivariate statistics, multivariate geometry, dynamic systems analysis and signal processing [29][30][31][32][33][34]. SSA can be applied to a single trace (i.e., it requires a single vector observation, denoted by one measurement here). SSA can serve to decompose a signal into meaningful components (usually divided into trends, oscillations, and noise) relying on the celebrated SVD, which can be used for denoising. In our context, SSA can be applied as a pre-processing tool to each measurement (time series measurement). Then, the estimated SNR, based on a set of single processed measurements, can be utilized to evaluate the gain in the security context. The feature of interest in this case is the ability to reduce noise with SSA, as was successfully demonstrated in [20]. Formally, denote X ∈ R N×1 , a time series measurement (leakage trace), and further, define the matrix X as where K defines the window length of observations in X, allocated to each row of the matrix. From the first row, each following row is skewed by one sample. The number of rows, L, is set from N by the selection of the window size K, i.e., K = N − L + 1. X is a general case of a "non-square Hankel matrix". A Hankel matrix is a square matrix in which each ascending skew-diagonal from left to right is constant. The SSA procedure follows the computation of the (unnormalized) sample covariance matrix S = XX T ∈ R L×L , where (·) T represents the transposition, computing the (nonnegative) eigenvalues of S, denoted by λ 1 , . . . , λ L , and sorting them in a decreasing order, where u 1 , . . . , u L represent their corresponding eigenvectors. Then, compute X i = u i u T i X, while keeping only a subset eigenvectors which are associated with the trend and oscillation components, and excluding ones associated with the noise components. That is, one needs to choose sets of indices [I 1 , . . . , I m ], corresponding to a set of eigenvalues, where I i and I j are disjoint for all {i, j}, and the union of all the sets {I j } j sets is {X i } L i=1 . Then, these sets are associated with the trends, oscillations and noise components. Once the set of indices associated with noise components is chosen, one can simply exclude them and compute: X = ∑ i X I i . This decision step is typically based on thresholding the slope of the sorted descending eigenvalues as discussed below, and is generally highly heuristic; it depends on various parameters such as the noise distribution, the characteristic of the signal, and the parameters K and N. Therefore, closed-form formulas for thresholds are challenging to achieve with high coverage.
Finally, in order to reconstruct a time series denoised leakage trace from the resulting X' matrix, a Diagonal Averaging (DA) step is utilized. If X' is a Hankel matrix, DA can simply average over elements corresponding to indices on each diagonal line in the matrix to form the time series. Formally, for matrix A, it is defined as If X' is not a Hankel matrix, it can be Hankelized, utilizing the Hankel transform (e.g., consider [35]).
An illustrative example is shown in Figure 1. Figure 1c shows an encryption current measurement quantized with a 16-bit oscilloscope versus measurement time (in time samples). Clearly, the repetitive round-based iterations of the encryption process are visible within the leakage. Figure 1a shows the sorted descending eigenvalues of S corresponding to one leakage trace after SVD over X. Finally, Figure 1b shows reconstructed elements associated with trends noise and oscillations. Intuitively, rapidly changing eigenvalues correspond to trends, while slowly changing eigenvalues correspond to oscillations. However, without additional, restricting assumptions regarding the signal model, it is generally unclear how to predict the decay rate (i.e., slow or fast) of the ordered eigenvalues. Therefore, there is no universal method for consistent separation of the corresponding eigenvectors. Thus, heuristic approaches have been proposed, which are only efficient to some extent. Here, we chose to classify the eigenvectors based on their maximum discrete derivative with respect to the eigenvalues index (or slope), and suggest two heuristic thresholds, denoted by a, b ∈ R, to classify the eigenvectors, as specified in Algorithm 1.

endFor
In order to find the optimal parameters {a, b}, we first compute the SNR over the SSA a,b 'ed traces, listed in the matrix L, independently for each time instance t. We then focus on the maximal SNR, based on which the optimal parameters {a, b} are chosen. Formally, this optimization is described as Here, the matrix L is the leakage metric of all recordings corresponding to y realizations, listed in a vector y of labels (required for the SNR procedure). Note that optimal thresholding for eigenvalue exclusion while reconstructing signals with SVD was investigated in [36]. Though there are some similarities, SSA embedding and reconstruction, and SVD reconstruction are different. In addition, the threshold presented in [36] is only optimal under certain conditions-for the noise (i.e., white), the dimensions of the measurement, and the variance of eigenvalues-and is only guaranteed asymptotically. In the SCA context, in most cases, the noise is not white nor can we meet the rest of these conditions, especially the dimensions of the measurements in the single-trace context.

We Can Do Better with OVSSA
Typically, cryptographic computations generate a repetitive structure of the leakage as a result of the periodicity of the computation (e.g., rounds in an SPN or sponge stages and iterations in asymmetric protocols) and due to the periodicity of the clock strobe. However, the spectral characteristics vary over the time sections throughout the computation. This is because the leakage on the first (last) rounds changes abruptly between computations, which is not related to encryption, to (from) a periodic sequence of encryption computations. However, the crucial points in time, for a divide and conquer adversary that aims to extract leakages prior to large key diffusion, are typically situated exactly where the spectral characteristics change rapidly, in the final (resp. initial) rounds. For example, the encryption spectogram exhibits varying spectral characteristics over time within the leakage, as illustrated in Figure 2. Therefore, any attempt to estimate spectral characteristics from the entire signal will not accurately characterize each and every individual time section robustly.
In this paper, to the best of our knowledge, OVSSA is proposed for the first time in the SCA context. OVSSA is defined with parameters q for the overlap and Z for the computation interval, where n is the number of time samples, and l is the window length (see [21]), in our case, q = 100 and Z = 201. l is determined by α · (log(n)) c , where α is a small constant, and c ∈ [1.5, 3] (see derivation [37] and its use in the SCA context [20]). The Z parameter was derived in our experiments by investigating the spectrograms (e.g., see Figure 2), and observing the rate at which the spectral characteristics changed, where q was set to be roughly Z/2 for a good overlap width. We chose c to be 1.5 to reduce the run time. Figure 2 shows that the most energized frequency components change with time, and that the change roughly does not span more than 200 time samples. This is the reason for a Z value of 201 (for example). More concretely, we specifically evaluate the argument maximizing the objective criterion for each of the hardware/software designs outlined below: where OVSSA Z implies performing OVSSA with a computation interval of Z.

Multi Trace-Evaluation Criterion in the Time Domain
Generally, it is understood that data are modulated into SCA leakage traces with carrier frequencies such as the system clock. The reason for filtering the leakage is to preserve the signal contribution from frequency ranges that are informative and exclude all other frequency bands which contribute noise. However, it is difficult to know the type of filter required in advance, since it depends on multiple factors such as the measurement setup, the underlying device and the digital system complexity and the embedded countermeasures. Typically, filters are found and reported by ad hoc experimentation (trial and error), and parameters are set without any clear selection criterion, or alternatively they are optimized by heuristic methods. For several examples of the large range of band-pass (BP) filters, see [12][13][14], low-pass (LP) filters were recommended in [13,15], high-pass (HP) filters in [16,17] and band-stop (BS) filters [14].
To filter traces in the frequency domain, the discrete Fourier transform (DFT) is computed over the leakages, using the fast Fourier transform (FFT) [38], an efficient computation algorithm of the DFT. The magnitude and phase of the DFT components encapsulates their contribution to the overall signal. In our context, DFT evaluation can be used to analyze the frequency component separately from the filter noise-contributing components, etc. FFT has been shown to be efficient to overcome different side-channel countermeasures such as phase/time randomization [39], and to reduce sampling complexity [40,41].
An example of a standard cosine wave DFT versus a leakage traces DFT of one unprotected simple AES encryption case is illustrated in Figure 3a,b, respectively. Even this simple example captures the fact that most of the information is concentrated in a specific frequency band, with varying amplitudes. In the following, we show the increased complexity involved in finding good filters when countermeasures are embedded. In order to fit the best filter, one must evaluate all parameters jointly per target design, i.e., without any prior knowledge. We denote the filter parameter set by {param}, and compute the SNR of the filtered traces in the time domain for each parameter set realization. Each computation must be performed on the entire dataset (assume N traces with n time samples each). We aimed to assemble sufficiently large datasets to reach convergence in the SNR (i.e., to capture an accurate estimate of the SNR). The goal was to obtain solid statistics and be able to compare the data complexity as well as SNR levels. Generally, the complexity required to perform the fitting operation is as follows: Denote a BPF as: where w is the width of the filter, and i is the offset (or Slice as illustrated in Figure 4a. The filter is applied in the frequency domain: iFFT(BP(w, i) · FFT(trace)). Let us now define the optimization procedure for a BPF utilizing the time-domain metric formally: where FFT(L(j, :)) denotes we apply FFT over each leakage measurement in matrix L independently (on each row).

Efficiency Metrics for the Multi-Trace Context
We start by listing our two main objective functions: (1) low adversarial data complexity (Data) and (2) high leakage informativeness or information leakage (InfLk). More specifically, we are interested in assessing the asymptotic value of our information leakage evaluation metric (here, the SNR) along with the data complexity when reaching its asymptotic value. We weight both factors equally where the cost of data complexity is evaluated by 1/N tr , where N tr is the number of required traces to reach the asymptotic max t (SNR) value. Our combined (Data · InfLk) efficiency scalar metric, Eff max , is therefore defined as Eff max max t (SNR(L(t, :), y)) #N tr .
The above efficiency metric is evaluated throughout this manuscript with one exception, for shuffling countermeasures. Because information spreads across time samples as an outcome of instruction shuffling, instead of looking at max t (SNR), we evaluate the integration of the SNR across the shuffled time span, to approximately quantify the total informativeness. More precisely, we define where N S is a parameter for the number of shuffled clock cycles of the internal variable y.

Multi Trace-Frequency Domain Optimization Criterion
For protected designs giving rise to signals that sparsely occupy the frequency band of interest, a simple BPF cannot pass all the dominant frequencies without transmitting much of the noise with it. For example, Figure 12 shows that in the leakage spectograms of several countermeasures, such as dual-rail and shuffling, there are more than one dominant frequency bands. Hence, a different approach is required. To this end, we directly selected the dominant frequencies by isolating the frequency coefficients and using the simple univariate SNR metric in the frequency domain to filter for the informative ones. This makes it possible to shape the filter (as illustrated in Figure 4b).
Formally, define SNR c := SNR coeff (FFT(L), y), which denotes the SNR computed in the frequency domain over the DFT coefficients of the traces. We then select a subset of the DFT coefficients, denoted {c}, based on a predetermined threshold: where ηth is the threshold from which we take DFT coefficients, based on their SNR c value, relative to the maximal SNR c value. In this context, it is interesting to consider wavelet (and other) signal decomposition methods. The discrete wavelet transform (DWT) [15,18] is the sum over the duration of both scaled and shifted versions of the wavelet function. In particular, it has several metaparameters for decomposition such as the number of scaled versions and shifts, and the basis functions. DWT is associated with another step to find effective side-channel leakage filters known as multi resolution analysis (MRA), which is a recursive composition of low-pass and high-pass filters. The methodologies discussed in this paper can naturally extend to other transforms and coefficient domains to filter out the noise in the transformed domain using an efficient cryptographic criterion. However, we chose not to evaluate the wavelets transform because of its very large space of associated meta-parameters. Our goal here was to target low-computational complexity optimization and fast evaluation. For slow but more complete approaches, one can also pursue information-theoretic based metrics, e.g., [26,27].

Designs and Datasets
In the experimental section below, we evaluate several different countermeasures leading to leakages with very different statistical characteristics, frequency domain characteristics, noise levels (on both hardware (HW) and software (SW) platforms), hence with different dataset sizes. Specifically, we evaluate the leakages captured from a: The baseline (no pre-processing) SNR level, #traces available, the protection mechanisms and platforms (SW or HW) are listed in Table 2.

Experimental Results
We analyzed the information gain and the data complexity of various design implementations of AES − 128, and suggest improvements to existing methods. We addressed two contexts: multi-trace, and single-trace. In each context, our goal was slightly different. In the multi-trace context, we aimed to achieve the highest SNR with the given toolbox (i.e., filters), while simultaneously reducing the data complexity as much as possible. In the single-trace context, we tried to achieve the highest SNR possible while efficiently processing the data (i.e., maintaining low computational and time complexity) using only a single trace. Before discussing the experimental results, we refer the reader to Table 3 for a legend of the methods and naming conventions discussed in Section 2.

Multi-Trace
The most natural filter to utilize is a BPF given its simplicity and widespread use in the community. Our goal was to fit the best BPF to the data. Intuitively, the leakage of information from a device should only have a few dominant frequencies, since, for instance, outputs of the Sboxes are computed once per round, and each device leakage is modulated by a certain operation frequency (i.e., the clock frequency). Note that this is true for an unprotected design, but might not be valid for protected designs with, for example, randomized phase or complex analog-nature countermeasures (e.g., consider the shuffling case in Figure 9). Our goal was to more formally devise a rigorous procedure to evaluate various bandwidths (w) and offsets (i) empirically, and fit a different BPF on each hardware design. Based on the optimization procedure discussed in Section 2.4.1 we evaluate BP * as shown in Figure 5).
The optimization of the threshold decision and the bandwidth and offset parameters are visualized in Figure 5: Figure 5a demonstrates that setting the threshold too low (too many frequencies) or too high (too few frequencies) results in a decrease in the SNR of the filtered frequencies. Figure 5b demonstrates that there is one zone (from 500 to 600 in the DFT coefficients, roughly) in which the informativeness (evaluated via SNR) resides, where the optimal width is located in the 'dense' region of the graph. Notice that as the bandwidth increases, the graph becomes sparser. This is due to the fact that as the bandwidth increases, there are fewer frames. The parameter optimization results (CMOS) were: bandwidth (w) = 56 and offset (i) = 588 DFT coefficients, corresponding to 40 and 420 MHz, respectively.
Though the initial results were good; that is, the optimization process as compared to the results from the raw leakages showed significant gains, we predicted that the optimized BPF approach would not adapt well to more protected designs. Therefore, instead of trying to fit the best BPF filter, we turned to the shaped filter approach, which involves selectively keeping frequency coefficients, which are informative by using a metric to evaluate them, as discussed above.
For this purpose, a thresholding optimization procedure was defined in Section 2.4.1 to determine whether to include a certain frequency coefficient by evaluating its magnitude against the maximum value across all coefficients. The outcome filter obtained by choosing all informative frequencies formed a custom filter devolved to each of the designs/data sets/countermeasures embedded, a shaped filter (see Figure 6). This step resulted in a very significant impact on both data-complexity and extracted signal level. With both methods in mind, we tested various designs and implementations. In Figure 6, the first row of the sub-figures shows a visual representation of the filters, the second row shows the corresponding SNR of the optimal filters applied to the data sets; in other words, the optimal BP and the optimal shaped filter, as defined in Equations (6) and (9), respectively. As can be seen, the optimal BP-filter worked well for unprotected designs (e.g., unprotected CMOS), but failed to keep up as the design became more protected; e.g., for dual-rail and shuffling. The shape of the SNR c filter is rather interesting: for unprotected CMOS, we only observed one frequency cluster matching the optimal BP-Filter. For dual-rail, there were at least two main clusters, which are due to the fact that this design leaks at two carrier frequencies relating to the complete precharge phase (Return To Zero, RTZ) and evaluation phase. For the exemplary Shuffling rp 2 test case, there are a few dominant frequencies, one of which is the clock of the micro-controller, and the others are a direct result of the shuffling operation: the Sbox computation can occur at two points in time, hence at several different frequencies. We next made one apples-to-apples comparison between the unprotected CMOS and dual-rail designs: as shown in Figure 7, the shaped SNR-based filtering method (top figure) kept providing good filtering for both the CMOS and dual-rail designs. However, the optimal BP-Filter (bottom figure) failed to keep up with the dual-rail design in terms of SNR, yielding approximately a factor of 2 between the filters and also a change in the required data complexity, as will be detailed next. Note that the results of the dual-rail design were scaled by a factor of 10 to visualize them on the same plot as the results of the CMOS design (as noted in the figures).
To summarize, to assess the efficiency of the devised filters as compared to the raw (unfiltered) results, we evaluated the rate of convergence (ROC) of our metrics as a function of the #samples as shown in Figure 8. From left to right, the different figures relate to the CMOS, dual-rail and Shuffling rp 8 designs. The vertical grey lines indicate the approximated data complexity required for convergence. It is clear that except for the CMOS design, the SNR c filter achieved the highest SNR (and even for CMOS, the difference is negligible-0.001 ≈ 5% diff.). Nevertheless, the BP-Filter always made a considerable improvement over the unfiltered data, though not as good as the SNR c filter (with the protected designs). Although the difference between the BP-Filter and the SNR c filter seems small, owing to the log-scale of the figures, the numbers suggest a considerable factor of X2-3. While the Shuffling data sets we used in the manuscript were not large enough for full convergence, we can still observe two phenomena: (a) Since the implementation is in software, the order of magnitude of the SNR is much higher/the data complexity required is much lower; attacks are therefore very viable. (b) We can already see that there is a significant difference in the informativeness metric between the various methods.
All of the above illustrate the fact that the filters not only provide a higher SNR, but also reduce the data complexity to perform an attack. Recall that since both the optimized band-pass and the shaped filters are optimized versions proposed in this manuscript, a fair comparison would be to compare to the raw traces (No Filter).

A Shuffled Software Example
We next compared a software implementation of instruction shuffling to demonstrate the effect of different pre-processing techniques on different countermeasures and specifically on different shuffling implementations with different level of protection. Below, unshuffled refers to the basic unprotected software implementation, and shuffled rp i to the case where i Sboxes' computation order is permuted over (in sets of 16). We acknowledge that we cannot directly compare the following results to the hardware implementations due to the clear baseline SNR difference in the SW case. However, the results still reveal important insights regarding pre-processing techniques. Thus, we only compare the different shuffling implementations to each other and to the unshuffled implementation. As can be seen in Figure 9, there was already a notable reduction in the univariate SNR for all of the shuffling methods. Furthermore, as the number of shuffling instructions increased (time variance), the SNR decreased. Clearly, as the number of shuffling instructions increased, the more peaks there were (i.e., the information is spread out). This observation led us to define different efficiency criterion, as shown in Equation (8).  Figure 10 shows the effect of different filters on the shuffled implementations. Note that in order to fit all the plots on the same graph, the SNR of rp 4 , rp 8 were scaled. As noted on the graph, rp 4 was scaled by a factor of 2, and rp 8 by a factor of 4. This follows theory nicely where the SNR levels decrease linearly with the number of shuffled instructions as the probability that a computation of the target Sbox will take place at the i-th round is uniform. Therefore, the reduction in SNR is linear.
The top sub-figure in Figure 10 shows the SNR levels of the SNR c filter applied to the shuffling implementations. It is clear that as the random permutation size increases, so does the length of time samples containing useful information (i.e., for rp 2 from 500 to 600, for rp 4 from 500 to 700, for rp 8 from 500 to 800). We can further see that the number of peaks increases as the number of shuffling instructions increases (for rp 2 two distinctive peaks, for rp 4 four peaks, for rp 8 at least four peaks, but with some aliasing due the leakage measurement impedance).
The middle sub-figure in Figure 10 shows the SNR levels of the optimal BP-Filter applied on shuffling implementations. It is evident that this filter failed to deliver the same SNR levels as the SNR c . Moreover, the optimal BPF smooths out the gains as compared to the SNR c filter. This was more pronounced in the SNR of rp 8 and rp 2 (the smoothing action is compared to the SNR c filter). The smoothing effect actively reduced the time span in which distinct SNR levels appeared; in turn, it reduced the overall informativeness (for our criterion in Equation (8)). The bottom plot in Figure 10 shows the SNR levels of a univariate feature selection in the frequency domain applied on the shuffling implementations. This feature selection also creates a custom filter, but it also failed to keep up with the SNR levels of the SNR c filter. We elaborate further on feature selection techniques in Section 4.1.2.  Figure 11 shows the rate of convergence of the different filters compared to the baseline SNR. We used the metric of Eff int . The figures show the N S ·T 0 SNR(·)dt, denoted in abbreviated form by SNR. This Eff int metric not only weighs the SNR levels, but also the number of contributing time samples, compared to max t (SNR), which works on a single time point. Note that as discussed above, the data set was not large enough for full convergence, but still large enough to see considerable differences between the methods. To the right of each plot on the figure is a 'zoom in' view that visualizes the difference between the methods more clearly. In all cases, the filters were considerably better than the baseline SNR, and as the permutation size increased, the amount of information extracted from the optimal BP-Filter decreased. In contrast, the results for the SNR c filters were more efficient. Another advantage of using the Eff int metric and showing the N S ·T 0 SNR(·)dt rather than max t (SNR) is that the integral of the baseline SNR over time remains the same, independently of the permutation size. It implies that the overall information leakage is preserved. Therefore, it is possible to compare the results of filters across rp i , i ∈ {2, 4, 8}.
For rp 2 (top Sub- figure), SNR c filter yielded 1.878x compared to 1.798x for the Optimal BP-Filter, where 1x denotes no pre-processing. For rp 4 , SNR c filter yielded 1.602x compared to 1.264x for the Optimal BP-Filter, and for rp 8 , SNR c filter yielded 1.5161x compared to 1.332x for the Optimal BP-Filter. Thus, there was a considerable difference between SNR c and the optimal BPFs. Interestingly, the optimal filter gain was roughly 2x for the shuffling flavors (SW), but roughly 10x for the hardware implementations. This stems from the overall lower levels of noise present in software implementations; i.e., the potential to filter out noise is much lower in software, because the signal is already very large.

Multi Trace-A Cautionary Note on Feature-Selection Tools
It is interesting to compare popular filters from artificial intelligence (AI)/machine learning (ML) tools since they have become highly popular for pre-processing and SCA attacks (in a general context) [42][43][44][45]. For that purpose, we implemented various feature selection (FS) tools and processed them in the frequency domain over the FFT leakages to find the best shaped filter. The two main observations deriving from this analysis are that: • As illustrated in the bottom plot in Figure 10, FS with complex statistical tools such as the Mutual-Information (MI) exhibit extremely poor results. This is clearly due to the fact that for information theoretic tools to function properly, the distribution of the leakage needs to be decently captured, which implies a large observation space; i.e., statistically, the full distribution is badly characterized and the filter is far from converging. • More (statistically) simple FS tools were attempted, such as the Pearson-corr (ρ) to filter frequency coefficients.The experiments showed that it performed quite similarly to our SNR based criterion. However, consistently results were slightly poorer since the correlation was not scaled to the noise such as the SNR.

Single-Trace
As discussed above a spectogram provides considerable information relating to the type of filter required in the multi-trace context, but also in terms of the Z parameter and the spectral characteristics that change over time, which is important for OVSSA in the single-trace context. Figure 12 illustrates the spectogram of an exemplary trace taken from the unprotected CMOS rolled implementation (left), the dual-rail implementation (middle) and the rp 2 shuffling SW implementation.
These spectograms can generate a good intuition as to whether a certain filter will work well or poorly. For instance, shuffling-rp 2 demonstrates at least two dominant frequencies at any given time sample throughout the trace. The unprotected CMOS implementation exhibits only one dominant band, which is why, intuitively, BPFs will work well for unprotected CMOS, but not for shuffling. The dual-rail design demonstrates leakages at a smaller frequency than the CMOS design (since the precharege and evaluation phases generate a larger effective clock period). However, for all designs, the X-axis (time) shows that the spectrum characteristics change as discussed in Section 2.3.2. Overall, in this manuscript, we processed 1 × 10 6 traces with SSA/OVSSA, where the goal was to capture enough cleaned traces to evaluate the SNR of each technique as a comparison metric to evaluate performance. The processing took ∼5 months on a very powerful server with 50 machines where each multi-threaded over 20 processors. Each SSA computation took about 30 s, compared to 10 s for OVSSA. Figure 13 shows two exemplary pre-processed traces; the resulting traces from the OVSSA and SSA processing are significantly different as possible to see.  Figure 14 shows an exemplary SNR following SSA and OVSSA (with 0.5 × 10 6 traces) in the top-left sub-plot. Clearly, the OVSSA generated larger amplitudes, but perhaps more importantly, the noise was significantly reduced. In this example, the SNR was computed with y labeling following the first Sbox layer in an AES round, therefore SNR peaks appear closer to the beginning of the trace. However, in later rounds we observed that the SNR of the SSA still exhibited small SNR peaks where the y-classification was clearly wrong (diffusion); the OVSSA, however, cleans these regions nicely, as its SNR indicates.
The top-right sub-plot in Figure 14 shows the SNR convergence with the number of traces of the baseline {unprocessed traces, the SSA-and OVSSA-processed} traces. The plot shows that 0.2-0.3×10 6 traces are sufficient for convergence. The SNR of all pre-processing techniques converged with similar data complexity. However, the maximum-SNR achieved with SSA was ∼2.5× as compared to unprocessed traces, while OVSSA achieved ∼5× improvement as compared to unprocessed traces.
The last important point as mentioned in Sections 1 and 2.3.2, relates to the preprocessing time-complexity. The bottom sub-plot in Figure 14 shows the time complexity required to process the data set as a function of the data set size in samples. As both SSA/OVSSA worked on a single trace every time, both graphs are linear. Clearly, OVSSA had much lower time complexity, with a respective gain of ∼4× over SSA (see next paragraph) due to the fact that OVSSA works on small segments of the trace, while SSA works on the entire trace at once. All in all, OVSSA exhibited more efficiency than SSA both in time complexity and information extraction (SNR).
Concretely, the proposed technique provides ∼2× max. SNR improvement for about the same number of leakage-traces (data-complexity), as compared to SSA-based approach. However, the main improvement is in the pre-processing evaluation time. The time complexity of the SSA based pre-processing technique depends on SVD, which is generally quadratic in time as a function of the number of leakage time samples, n. The OVSSA based pre-processing technique time complexity depends on SVDs over chunked leakage traces (fewer samples), with a parameter Z; i.e., n/Z chunks. That is, the time complexity improvement is generally O( n Z ), which was shown to be very significant in our experiments, although we failed to perfectly achieve the suggested gain of 1000/201 ≈ 5. This is due to software implementation variations, server computing power variations and neglected arithmetic. As discussed above, Z depends on the spectral characteristics of the leakage throughout time and for round-base cryptographic implementations the n/Z factor is expected to yield significant improvements as exemplified.

Conclusions
In this paper, we presented several advances in two SCA contexts. In the single-trace context, we improved upon existing SSA-based techniques by exploiting informative variations in spectral properties over time, stemming from the cryptographic implementation. By adapting overlapped-SSA and further optimizing over subsequent processing-related key parameters, we achieved a significant gain compared to SSA and no pre-processing, both in terms of (shorter) computation time and (higher) information gain, i.e., SNR. In the multi-trace context, we proposed a profiling strategy for a BPF optimization based on a computationally attractive objective function, which was shown to be efficient for unprotected and weakly protected implementations (albeit with reduced effectiveness). In addition, we also proposed a differently optimized filter, exploiting frequency-domain SNR-based coefficients thresholding, which is slightly more computationally demanding. The simulation results of our extensive empirical examination show the significant performance improvement over a set of implementations embedded with countermeasures, both in hardware and software platforms. In all the implementations/platforms we considered, our proposed shaped filter achieved the highest SNR levels, while the BPF-based filters suffered from reduced effectiveness when applied to varying protection-level countermeasures embedded.

Data Availability Statement:
The data presented in this study are available in article.

Conflicts of Interest:
The authors declare no conflict of interest.