An Automated End-to-End Side Channel Analysis Based on Probabilistic Model

: In this paper, we propose a new automated way to ﬁnd out the secret exponent from a single power trace. We segment the power trace into subsignals that are directly related to recovery of the secret exponent. The proposed approach does not need the reference window to slide, templates nor correlation coefﬁcients compared to previous manners. Our method detects change points in the power trace to explore the locations of the operations and is robust to unexpected noise addition. We ﬁrst model the change point detection problem to catch the subsignals irrelevant to the secret and solve this problem with Markov Chain Monte Carlo (MCMC) which gives a global optimal solution. After separating the relevant and irrelevant parts in signal, we extract features from the segments and group segments into clusters to ﬁnd the key exponent. Using single power trace indicates the weakest power level of attacker where there is a very slight chance of acquiring as many power traces as needed for breaking the key. We empirically show the improvement in accuracy even with presence of high level of noise.


Introduction
Many Side Channel Analysis attacks have succeeded in breaking the secret keys analyzing power trace(s) generated from devices. However, there are many assumptions and limitations on these attacks. Some of them exploit as many power traces as needed to recover secrets. In addition, others have proposed methods to recover secrets from overall power trace(s) but not their exact location on the power trace from which each bit of secrets have been recovered.
One of the well known approaches is to find the reference window and apply a peak detecting algorithm [1,2]. However, the success of this approach heavily depends on the selection of a "good" reference window and the performance of peak detecting algorithms since we cannot search for all the windows in polynomial time. Therefore we need an automated approach that is feasible in polynomial time and does not require human intervention to succeed. Side channel analysis with machine learning has gained much interest [3][4][5][6].
Our work suggests a new approach to finding keys in a Bayesian approach. Our contributions on Side channel analysis and signal processing are as follows: • We exploit only one single power trace and recover the secret.

•
We suggest the methods to compute the probability of locations from which the secret came and find out global optimal solutions in a Monte Carlo approach.

•
We suggest the methodology that is more robust than ad-hoc attacks in the presence of noise.
In correlation analysis [2,[7][8][9][10][11], a set of power traces is used to find a correlation between the power trace and the key guess. Our method assumes weaker power of an attacker where a chance of acquiring more than one power trace is slight. Analyses with clustering methods [12][13][14] are researched. Recently, horizontal attacks [7] have been succeeding with clustering algorithms [15,16]. These attacks exploiting clustering algorithms and horizontal attacks have so far fixed the dimension (length) of the segment/subsignal of power trace as trace_length key_bits when treating segments. However, when treating time series, dimension of data segments have to be chosen with careful concern. There is no guarantee that operations are executed with an exact period of trace_length key_bits . Our work does not assume the operations are executed periodically but rather estimates the start and end points of the operation executions. Moreover, we suggest method to extract feature from time series segments in different lengths.

Notations
• y t : The t-th signal (time location). Total signal length is T, so we simplify y 1:T = (y 1 , y 2 , · · · , y T ). • r t : The t-th random variable which indicates whether t-th point is change point or not. That is, 0, otherwise.
• K: The number of change points. That is, K r = ∑ T t=1 r t • τ k : The kth change point. As K r is the number of total change points, 1 ≤ k ≤ K r and trivially τ 1 = 1 and τ K r +1 = T + 1. The dimension of feature. •φ k : The feature extracted from the kth segment with dimension of D.
• c k : The cluster assigned to kth segment.

Problem Definition
Our goal is to divide the power trace into operation-relevant segments and assign clusters to each segment so that we can figure out which operation was executed. In Figure 1 is shown whole process. Formally, we can define our goal as: We defined each step, rather than one whole global model that segments time series and searches cluster (i.e., some function h(·), c * = h(y 1:T )) due to the high complexity of that model, if it exists.

Change Point Detection
We have found from the power trace that there is an idle period or a piece-wise constant between operations of binary exponentiation, square and multiplication. Exploring this period is the most important part of whole work since segments divided only by the exact locations of operations will have similar patterns. We can model solving change point detection problem (with their number unknown) [17]. However we do not adopt reversible jump MCMC [18,19]. In this subproblem embedding the change point detection algorithm, we only find idle periods and incomplete segments are made. We then make complete(operation-relevant) segments by combining the incomplete segments in Section 2.2.2. Reason behind dividing two steps is that first, we do not have any information about the key or the shape of power trace of each operation and change point detecting problem whose number of change points are not known is already complex enough to avoid putting detection of constants and unknown shapes together into model. For detecting piece-wise constants, we can define by Bayes' theorem the posterior distribution of change points r 1:T , p(r 1:T |y 1:T , θ) and our goal as:

Merging Segments
As mentioned above, in this subproblem, we merge incomplete segments to complete segments whose start and end points indicate the start and end points of the operation. The goal here is building the merging function g(·) as r M 1:T = g(r 1:T , y 1:T ).
Only merging but not splitting is allowed in function g(·), so K r M ≤ K r and {τ M k |1 ≤ k ≤ K r M } ∈ {τ k |1 ≤ k ≤ Kr} holds.

Extracting Features from Time Series Segments
It is known that previous correlation power analysis attacks have used power traces or parts of power traces that are cut in the same length. However, the lengths of power traces (or their parts) are not guaranteed to have the same length. Therefore, the model or algorithm for extracting information relevant to each operation must be capable of treating subsignals in different length. In this part, given the segments in different lengths, we extract the features that will be the input to the clustering part which needs fixed dimension.
where each rowφ k = [φ (k,1) , · · · , φ (k,d) ] T is a feature vector of k-th segment. The goal of this subproblem is building the feature extractor f (·) that is capable of coping with the segments in different lengths

Clustering Features
The last step is clustering all the segments with features extracted. Clustering is an unsupervised machine learning approach that assigns each data point a cluster based on similarity (or distance). After all the segments are assigned the clusters, we can find the key exponent. Our goal is identifying three clusters of features. In general, if the number of clusters is not known, optimizing the number of clusters is required. From this point of view, the number of clusters should be carefully considered to obtain highly accurate performance, but we can easily put three on it since there are only three patterns we look for, which are square, multiplication and the idle period between operations. After we identify the clusters, it becomes a trivial problem to decide which operation or period is related to one of clusters.

Preprocessing
There are many countermeasures to side channel analysis including the random noise addition. In order to deal with the random noise addition, we sample each power trace with the median filter. The median filter will decrease the effect of noise and reflect the trend of power signal. We apply the median filter to the power trace. We use the stride size equal to the window length. As a result, we reduce the length of power trace to analyze. Since the magnitude of power trace is a positive value, we use the absolute value of the power trace: where N P is a length of the original power trace and N w is window size. Figure 2 shows the effect of preprocessing the power traces.

Posterior Distribution
As mentioned above, the model here detects the piecewise constant part from the time series.
where random variable m k ∼ N (µ, V) is a mean of time series between τ k and τ k+1 − 1 and ∼ N (0, σ 2 ). The likelihood that y 1:T is observed given r 1:T ,m 1:K r and θ is p(y 1:T |r 1:T , m 1: By conjugacy of the exponential family [20], posterior distribution of If we plug the probability of m 1:K r into Equation (10), we get by identification, p(m 1:K r |y 1:T , r 1:T , θ) × p(y 1:T |r 1:T , θ) = p(y 1:T , m 1:K r |r 1:T , θ) Therefore, we can get from Equation (11), The trivial solution for this problem is to make every point as a change point, so that the sum of errors becomes least. In order to avoid this situation we have to model the number of change points K r . We modelled the prior distribution to control the number of change points as Bernoulli distribution.
By Bayes' theorem we can model the posterior distribution of random variable r 1:T by putting Equation (11) and (13) Joint distribution of r 1:T and y 1:T , p(r 1:T , y 1:T |θ) is needed since the evidence of the observation p(y 1:T |θ) is intractable to compute.

Markov Chain Monte Carlo
We use Markov Chain Monte Carlo(MCMC) to find the global optimal solution [21,22] for the model we designed. In Reference [19], m k was also the random variable to infer and apply reversible jump MCMC to cope with changing dimension (the number of m k s to estimate), but our work considers only r 1:T and replace m k withȳ k and simulate only for change points.
We detect the piecewise constants by computing the expectation of random variable r 1:T , wherer 1:T is a true mean of r 1:T andr 1:T is an Monte Carlo estimate of r 1:T when r s 1:T ∼ p(r 1:T ). The last part holds when n → ∞.
Instead of generating samples from the intractable posterior distribution p(r 1:T |y 1:T , θ), we propose a function from which is easy to draw samples, namely proposal function(Metropolis-Hastings, Algorithm 1). We can think of two proposal functions regarding the property of change point detection problem. Given one sample, one of change points can be popped and two segments are merged or one change point is born and segment is split. Either case, this is reverting r t , 0 → 1, or 1 → 0. The other proposal function is a swap between 0 ↔ 1.

Algorithm 1 Metroplis Hastings
procedure MH(p(·), q(·), r 0 1:T , n) posterior, proposal and initial sample, number of samples for s = 1, 2, · · · , n do This proposal probability is reducible, when calculated in Metropolis-Hastings, to α = p(r * 1:T )q(r s−1 1:T |r * 1:T ) p(r s−1 1:T )q(r * 1:T |r s−1 since in reverting case the probability only depends on the length of the sample and in swapping case the number of change points does not change (K r s−1 = K r * ). Figure 3 shows ther 1:T obtained from MCMC section. However this model only detects the piecewise constants, so non-constant parts (i.e., drastic changes) are all high in probability of being change points.

Merging Segments
In Reference [23], it is shown that by controlling parameter and adopting various models, various goals can be achieved more than just detecting piecewise constants. However, we assume the least about the data that adopted merging segments part. The change points obtained from Section 3.2 indicate the locations where mean of each segment is changed. Therefore, when operations are executed, it is likely that the operation part is split into many segments and there are more than one change points. We detect whether segments are from the operation part or the idle part and merge segments from the operation part. We consider the following two properties of the idle part.

•
Whether the length of segment is long enough to be a segment • Whether the segments suspected as idle periods lie on similar level of power Details for merging segments are on Algorithm 2 below. Figure 4 shows merged segments.

Extracting Features from Time Series Segments
In this part, we extract features of fixed dimension from the segments. That is, given the segment y τ k :τ k+1 −1 , we compute feature of fixed dimensionφ k . We applied two approaches to extract the features and this part is to be further researched.

Polynomial Least Square Coefficients
First approach we apply is polynomial fitting. Polynomial model has its form as below: where β d is a dth parameter which describes the influence of x d on y and β = [β 0 , β 1 , The solution to polynomial fitting with minimum least squared error exists in closed form:β where n-by-D matrix X = [x 1 , x 2 , · · · , x n ] T is concatenation of n data samples. Now that each segment is given, for segment y τ k :τ k+1 −1 , let X k = [x τ k , x τ k +1 , · · · , x τ k+1 −1 ] T and x t = [1, t, t 2 , · · · , t D−1 ] T , τ k ≤ t ≤ τ k+1 − 1. Then we can estimateβ k for each segment of time series and let it be a D-by-1 feature of each segment as follows:φ We visualize coefficients by reconstructing power traces in Figure 5.

Histogram
The second approach is making a histogram out of each segment. Each histogram shares the same scale level of bins. The size of one bin is max y 1:T −min y 1:T D . Once we normalize the histogram to sum up to 1, this gives a distribution of each segment.

Clustering Features
In this part, we finally make clusters of the features so that we can match each segment to the operation which has been executed when generating the segment. We apply K-means clustering algorithm. Based on the pre-defined distance measure and number of clusters, the K-means algorithm repeats until convergence assigning data points to cluster based on distance and computing the mean of each cluster [20]. We assigned number of clusters K = 3 based on the number of operations (Square, Multiply and optionally idle period). For coefficients of Section 3.4.1, For histogram coefficients of Section 3.4.2, both Euclidean distance and symmertrized divergence Distance(φ i ,φ j ) = KL(φ i |φ j ) + KL(φ j |φ i ) (only for normalized histogram) are defined as distance measures. The performance of K-means algorithm, however, is affected by the initial points. So we run K-means algorithm multiple times and evaluate each run [24]. Only with data we used should we evaluate each run so that any other information is not reflected and result of K-means is evaluated fairly. We adopt Davies-Bouldin index(DB index) to evaluate each run and choose best performing clusters. Desirable clusters, with high inter-cluster distance and low intra-cluster distance, produce low DB index [25]. Figure 6 shows recovering key exponent.

Experiments and Results
The experiments were conducted under the environment below:

•
The window size N w is 1000, N p is dependent on input data, which in our experiment is 1,657,401.
As mentioned, λ controls the number of change points. The best guess about the key exponents without any knowledge is a half of 1s and the other half of 0s. The number of change points will then be 48 (= 2×(16 + 8)) as bit of 0 leads to only square operation and bit of 1 leads to square and multiplication. The best guess for µ is the mean of the time series. Though σ and V should be optimized for the best inference, we have experimentally chosen the values of σ and V among the candidates that were sampled during MCMC process.
Our approach is evaluated with criteria below: • To which level of noise added and inserted to signal (Signal to Noise Ratio, SNR) does the approach work • The quality of clusters with external information The first criterion shows the robustness of the approach to the noise whereas adding noise is often suggested as a countermeasure to the side channel analysis. For the second criterion, the external information (information not used for making clusters) we adopt is the very key exponent we want to estimate.

Comparison with Naive Peak Detection
This experiment focuses on comparing the proposed approach to naive correlation peak detecting method. Figure 7a was obtained by computing covariance with the entire signal and sliding window size of 100. The most general sliding window is starting with the first part and sliding window size was selected as 100 in a 'naive' way. As seen in Figure 7b, peaks were found to satisfy two thresholds: covariance is positive and minimum distance between two nearest peaks is larger than 20 in time scale. Without these thresholds, too many irrelevant peaks were found. Our approach showed better performance in finding the locations of the executed operations and finding that of inserted noise.

Noise Level
We have experimented on 16 levels of noise. We incremented the ratio of the standard deviation of noise and the raw signal linearly by 0.2 up to 3.0. Table 1 shows how consistent our work is when a different level of noise is added. From the table, we see that when the ratio of standard deviation is 1.8, K r 1:T has changed. From that, the number of change point keeps changing although with some level of noise it remained 44. When the noise is added, the change point changes and sometimes its number also changes. For comparing τ 1:K , we have chosen only a standard deviation ratio of 0.2-1.6 since besides those, the numbers of change points have already been changed. Average absolute error on τ 1:K compared to signal without noise is computed as  Table 2 shows that average absolute error is no more than 3, and especially for noise level 0.2, 0.4 and 0.6, the average absolute error is less than 1.

External Information
Next we compared clusters with external information, the actual key exponent. This is not a part of our approach since the actual key exponent is used. This part rather evaluates our approach to compare our estimate with the label. Table 3 shows the the average accuracy of 100 runs of the K-means algorithm. We used polynomial coefficients to 3rd degree (D = 4) and for the histogram we used D = 60. Clustering polynomial least square coefficient features are not affected much by the noise level whereas clustering histogram features are relatively more affected by noise. We made a confusion matrix of actual key exponents and clusters assigned to segments. Accuracy of the cluster is defined as Accuracy(Cluster) = , where C i,j is element of the confusion matrix.
Cluster Idle Cluster Square Cluster Multiply Idle

Recovering Key Exponent
Our approach should be feasible without the external information. That is, we should be able to distinguish more accurate runs of clustering from the others. Figure 9 shows the examples of desirable and undesirable runs of clustering. We sorted 100 DB indices in ascending order and picked 5, 10, and 15 lowest DB indices and corresponding clusters. Table 4 shows the the average accuracy of 5, 10 and 15 runs of the K-means algorithm. We can see huge improvement in most of the cases, especially in the histogram features. This means that even without the external information, we can distinguish the good and bad clusters and find the key exponents.

Data Scale
In this part, we empirically checked the time complexity of our approach. We have set the time series length T = N p /N w differently in each case. Then we checked each case 10 times and box plotted in Figure 10. It shows that time spent in each case is linear to the time series length T. So if the raw data are in a larger sampling rate or a larger key bit size, our approach takes longer to that ratio.

Discussion
• For a longer key bit: If we can assume that the environment that generated the power trace is consistent during the whole time so that there exist certain patterns, we can apply our methods to the first part and extract some patterns from that part. For the rest, using the patterns we can find the key exponents faster. In this manner, we can solve the rest part of the problem in a supervised way, whereas our approach is a totally unsupervised way for finding a key. This will reduce the time spent on analysis even when the key length is relatively long. • Weakness: One major weakness is that our approach is based on finding piece-wise constant parts. If the idle part changes drastically with a magnitude bigger than σ or the idle part has another specific shape, we shall adopt a different model.

•
Further application: the proposed approach of ours can be applied to other related applications. A discrete power system is one of the examples [26,27]. Problems of systems with different power generation models and different assumptions can be solved with the proposed approach.

Conclusions
In this work, we suggested a probabilistic model-based side channel analysis. We modelled a change point detection problem to detect piecewise constants which are not directly related to finding keys, and merged incomplete segments to key-relevant complete segments. We solved this problem with an MCMC approach to find a global optimal solution. From each segment, we extracted features of a fixed dimension and assigned a cluster to each segment. We showed that this cluster is highly related to key exponent of power trace and this approach consistently works even in the presence of the noise. We evaluated our approach with criteria of noise-level-robustness and accuracy of the key. The source code for our work is available at github.com/JeonghwanH/binEXP_CPD.