A Survey of Side-Channel Leakage Assessment

: As more threatening side-channel attacks (SCAs) are being proposed, the security of cryptographic products is seriously challenged. This has prompted both academia and industry to evaluate the security of these products. The security assessment is divided into two styles: attacking-style assessment and leakage detection-style assessment. In this paper, we will focus speciﬁcally on the leakage detection-style assessment. Firstly, we divide the assessment methods into Test Vector Leakage Assessment (TVLA) and its optimizations and summarize the shortcomings of TVLA. Secondly, we categorize the various optimization schemes for overcoming these shortcomings into three groups: statistical tool optimizations, detection process optimizations, and decision strategy optimizations. We provide concise explanations of the motivations and processes behind each scheme, as well as compare their detection efﬁciency. Through our work, we conclude that there is no single optimal assessment scheme that can address all shortcomings of TVLA. Finally, we summarize the purposes and conditions of all leakage detection methods and provide a detection strategy for actual leakage detection. Additionally, we discuss the current development trends in leakage detection.


Introduction
The pervasive nature of information technology has permeated all aspects of work and life. As malicious information security incidents like "Eternal Blue" [1] and "Bvp47" [2] continue to emerge, information security has garnered significant attention. The cryptographic products are commonly known as products that utilize cryptography technology. The security is the fundamental attribute of cryptographic products. However, various cryptographic analysis technologies, such as traditional cryptographic analysis and SCAs, can impact the security of these products. The traditional cryptographic analysis technology mainly includes techniques like differential cryptanalysis [3], linear cryptanalysis [4], correlation analysis [5], etc. On the other hand, SCA techniques encompass power analysis attacks [6] (such as simple power analysis (SPA) [7], differential power analysis (DPA) [6], correlation power analysis (CPA) [8], mutual information analysis (MIA) [9], and SCA-based deep learning [10][11][12][13][14][15][16]), timing attacks [17], fault-based attacks [18], cache attacks [19], etc. Consequently, evaluating the security of cryptographic products has become a crucial task. Two popular security certification standards, namely, Common Criteria (CC) [20] and FIPS [21], have been established to assess the security of cryptographic products. These security certification standards employ two distinct assessment methods: evaluation-style testing (also known as attacking-style assessment) and conformance-style testing (also known as leakage detection-style assessment) [22].
The attacking-style assessment mainly uses various SCA techniques to obtain the information of keys and evaluate the products against SCAs. The attacking-style assessors can directly obtain the key by executing SCAs. The evaluation results facilitate the calculation of security metrics for encryption algorithms and the identification of vulnerabilities within these algorithms and provide guidance for implementing algorithm protection strategies. The attacking-style assessment provides the advantages of rigorous evaluation intensity and easily accessible evaluation results. The attacking-style assessment holds great significance as a method for evaluating side-channel security. However, the use of the attacking-style assessment in primary security assessments is considered unsuitable due to its reliance on SCA technologies, which necessitate evaluators with advanced SCA techniques. This reliance increases both the time and sample complexity of the assessment, subsequently slowing down the assessment speed. Consequently, it fails to keep pace with the rapid cycle of product innovation [23][24][25][26][27]. Instead, the leakage detection-style assessment is proposed as a preliminary method to evaluate the security of cryptographic products. The leakage detection-style assessment primarily relies on information theory or statistical hypotheses to analyze side-channel measurements and determine the presence of any side-channel leakage. The objective of the leakage detection-style assessment is to evaluate whether the device can pass the testing rather than recover the key itself. This type of assessment can be categorized into two main types: leakage assessment based on information theory and assessment based on statistical hypothesis. In recent years, the leakage assessment based on statistical hypotheses has been widely adopted due to its superior efficiency and effectiveness [27]. Various works focusing on leakage assessment using statistical hypotheses have been proposed [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38].
In this paper, we study the works of side-channel leakage assessment and provide succinct accounts of motivations and detection processes behind these assessment methods. We also compare the efficiency of different detection methods and discuss the future development of leakage assessments.
The main contributions of this paper are as follows: (1) We analyze the works of side-channel leakage assessment and classify the leakage detection-style assessment works into two categories: the technology of TVLA and optimizations of TVLA. Additionally, we identify the shortcomings of TVLA. Due to the TVLA's flaws of statistical tool, detection process, and decision strategy, we dividedTVLA's optimization schemes into three groups: the optimization of statistical tool, detection process, and decision strategy. Furthermore, we provide a brief description of the motivation and detection process for each optimization and compare their detection efficiency. (2) Due to the lack of a unified and comprehensive leakage detection assessment method that can address all the TVLA's shortcomings, as well as the variation in optimization methods based on detection purposes and conditions, we present a summary on how to select a suitable leakage detection assessment method depending on specific detection purposes and conditions. Moreover, considering the current state of leakage detection assessment, we discuss potential future trends in this field.
The structure of this paper is as follows. Section 2 provides a brief description of the process, methods, metrics, and shortcomings of the attacking-style assessment. Section 3 focuses on the leakage detection-style assessment and describes the development process, metrics, and detection objectives. Section 4 mainly focuses on the leakage detection assessment based on statistical hypotheses and describes TVLA and its optimization methods. Section 5 highlights the relationship between the attacking-style assessment and leakage detection-style assessment. In Section 6, the current status and future development trends of leakage detection-style assessment are discussed. Finally, Section 7 presents the conclusion of this review. standard serves as a reference criterion for the attacking-style assessment. The estimators can choose any SCA method from the list of threats to conduct SCAs and assess the product's security level. Figure 1 illustrates the process of the attacking-style assessment. In the attacking-style assessment, the evaluator assumes the role of an attacker with prior knowledge of the implementation of cryptographic algorithms.

The Methods of Attacking-Style Assessment
Because the attacking-style assessment heavily relies on SCA technologies, the evaluators must have a proficient understanding of SCA technologies. SCA techniques can be categorized into two groups: profiled attack (PA) and non-profiled attack (NPA). The attackers using PAs must first construct a device model before conducting SCAs. The current methods for PAs include Template Attack (TA) [38][39][40] and the profiled attack-based deep learning [13,41,42]. On the other hand, NPA does not require a model and solely relies on side-channel measurements for conducting SCAs. Common NPA methods include DPA [7], CPA [8], MIA [9], etc.

The Profiled Attack
(1) The Template Attack Time series are the typical representation of side-channel measurements. The attackers utilize these measurements along with their model to conduct side-channel power analysis. TAs are one of the earliest approaches to PAs [38][39][40]. In TAs, it is assumed that the attacker has access to the same device as the target device, enabling them to encrypt In the attacking-style assessment, the evaluator assumes the role of an attacker with prior knowledge of the implementation of cryptographic algorithms.

The Methods of Attacking-Style Assessment
Because the attacking-style assessment heavily relies on SCA technologies, the evaluators must have a proficient understanding of SCA technologies. SCA techniques can be categorized into two groups: profiled attack (PA) and non-profiled attack (NPA). The attackers using PAs must first construct a device model before conducting SCAs. The current methods for PAs include Template Attack (TA) [39][40][41] and the profiled attack-based deep learning [14,42,43]. On the other hand, NPA does not require a model and solely relies on side-channel measurements for conducting SCAs. Common NPA methods include DPA [7], CPA [8], MIA [9], etc.

The Profiled Attack
(1) The Template Attack Time series are the typical representation of side-channel measurements. The attackers utilize these measurements along with their model to conduct side-channel power analysis. TAs are one of the earliest approaches to PAs [39][40][41]. In TAs, it is assumed that the attacker has access to the same device as the target device, enabling them to encrypt any plaintext and collect corresponding side-channel power traces. TAs include two stages: the stage of constructing the template and the stage of matching the template. In the stage of constructing the template, the main objective is to extract trace features and build the template. Assuming we have n traces designated as L = {l 1 , l 2 , . . . , l n }, we divide L into K groups noted as L 1 , L 2 , . . . , L K , where L i is the i-th group with k i . The template of k i is noted as (µ i , C i ), where µ i represents the mean vector and C i is the covariance matrix.
During the stage of matching the template, the attacker utilizes the traces to calculate the probability between the trace and the templates (µ i , C i ).
If P trace; µ j , C j > P(trace; (µ i , C i )), ∀i = j, then we get k guess = k j .
(2) The profiled attack based on deep learning In recent years, deep learning technology has emerged as a popular alternative method in PAs. Specifically, the PA based on deep learning [14,42,43] utilizes the multi-layer perceptron and convolutional neural networks to construct the templates and conduct SCAs.
The PA based on deep learning requires two independent trace sets [42]. One set is the training set, which is used to construct the template. Attackers need to have the keys, plaintexts, and traces for the training set. During the training process, the attackers use a minimum loss function to train the model, allowing the template to achieve better results. The other one is the validation set, which is used to carry out attacks.
Compared to traditional approaches like TAs, the PA based on deep learning can overcome the noise assumption and offers a more efficient and simplified process without extensive preprocessing. However, there are two shortcomings. Firstly, the metrics of deep learning are challenging to apply to SCA scenarios and may provide misleading results [42,44]. Secondly, the effectiveness of PAs based on deep learning will decrease significantly when facing imbalanced data.

The Non-Profiled Attack
The fundamental assumption of NPAs is that the attacker cannot obtain the same device as the target device but has access to an unlimited number of side-channel traces.
(1) Differential Power Analysis The key information is extracted by the distinguisher in NPA. Let X = (X 1 , . . . , X D ) be the input, where X i represents the i-th group. The attacker collects the traces from the process of encrypting D groups' data. The trace of X i is denoted as l i = l i,1 , . . . , l i,n l , where n l represents the length of the trace. The guessing key space is  Map the intermediate value matrix V to matrix H = and calculate the correlation coefficient between h i of H and l j of L. The correlation coefficients r i,j are stored in matrix R.
. . r 1,j r 1,n l r 2,1 r 2,2 . . . r 2,j r 2,n l . . . . . . . . . . . . . . . r i,1 r i,2 . . . r i,j r i,n l r K,1 r K,2 . . . r K,j r K,n l We use correlation coefficients to assess the linear correlation between l j and h i , where i = 1, . . . , K, j = 1, . . . , n l in DPA. As r i,j increases, the level of correspondence between l j and h i also augments. By identifying the highest value of r i,j , the assailant can successfully retrieve the key.
(2) Correlation Power Analysis Correlation Power Analysis (CPA) is a variant of DPA that exploits the correlation between the power traces L and the leakage model F to conduct the attack. Assuming the leakage model is denoted by F, we map the intermediate value matrix V = f (X, K) to the leakage value noted as G K = F(V). Here, X represents the input, K denotes the key space, and f stands for the cryptographic function. Typically, F represents the leakage model based on either the Hamming weight or Hamming distance. Similarly, G K corresponds to the Hamming weights or Hamming distances of V.
The correlation coefficient between the power traces L and G K is as follows: where E(L), E(G K ), E(L·G K ) are the expectations of L, G K , L·G K and σ L , σ G K are the variances of L, G K , respectively. The attackers iterate through the space of possible keys, calculating the correlation coefficient between G K and L to determine whether a key guess is correct. The correlation coefficient of the correct guess key is higher than that of the incorrect guess key. The attackers identify the guess key with the maximum correlation coefficient as the correct key, using the maximum likelihood estimation method.

(3) Mutual Information Analysis
The attacker of MIA uses mutual information or entropy to assess the correlation between the leakage and the intermediate value or between the intermediate value and side-channel measurements in [9]. The fundamental concepts of entropy and mutual information are as follows: Let X = (X 1 , X 2 , . . . , X n ) be the set of discrete random variables, then the entropy of X is where p(x) is the probability distribution of variables.
Let X = (X 1 , X 2 , . . . , X n ), Y = (Y 1 , Y 2 , . . . , Y n ) be the set of discrete random variables, then the conditional entropy H(X|Y) of X under Y is The mutual information between X and Y is I(X; Y) = H(X) − H(X|Y) . Let K, X, and V be the key, plaintext, and intermediate value, respectively. The correct key is denoted as k c , and the leakage function L(V) = f (X, K) is continuous. Let For a given key k, the intermediate values can be obtained using M = (X i , k), where M represents discrete mutual information. When the attack is successful, max k∈K (|D(M(p, k), L)|) = k c , D is the distinguisher.
Different keys result in different values of mutual information. The average mutual information is associated with the conditional entropy H(I|M(X, k)) . A stronger correlation between the measured L and M(X, k) indicates that the guessed key k guess is the correct key.
The development of the attacking-style assessment relies entirely on SCA methods. Essentially, the attacking-style assessment comprises attacks on the targeted devices. Consequently, the assessment itself is considered as an attack.

The Metrics of Attacking-Style Assessment
Due to the assessment evaluation being an attack, the attack metrics are the assessment metrics for attacking-style assessment. Distinguisher scores are commonly employed to sort the candidate key k guess in an SCA. The position of the correct key k c in the sorting results is called the key ranking, noted as rank(k c |L, X) . The metrics based on the key ranking are defined as follows.
(1) The number of samples: Find a minimum positive integer N, when the sample size |L| ≥ N, and there is where K c is key ranking of k c . Especially, when the K c = 1, k c =K c , then rank(k c |L, X) = 1.
(2) Success rate: The success rate of the side-channel attack is the probability of successfully recovering the correct key.
(3) Guessing entropy: The guessing entropy is the mathematical expectation of key ranking GE L,X = E(rank(K c |L, X)).

The Advantages and Shortcomings of Attacking-Style Assessment
With the advancement of side-channel technology, there has been a rise in the proposal of various side-channel attack methods. Evaluators can select an appropriate side-channel attack method to evaluate cryptographic algorithms, considering distinct encryption implementations and attack conditions. Through evaluating the actual results of side-channel attacks, one can determine the security level of the encryption implementation. The attacking-style assessment technique is characterized by its high level of aggression and extensive evaluation, allowing evaluators to directly acquire information about key and design weaknesses. The attacking-style assessment enables a thorough comprehension of system vulnerabilities and weaknesses. Through simulating real-world attack scenarios, it offers valuable insights into the efficacy of security measures and aids in identifying potential areas for improvement. It facilitates the identification of previously unknown vulnerabilities. The capability to analyze attacks in a controlled environment enables the development and implementation of efficient countermeasures. Thus, the attacking-style assessment is better suited for conducting comprehensive analysis and evaluation of cryptographic algorithms, subsequently facilitating vulnerability analysis and the establishment of protective measures. The attacking-style assessment is widely utilized in the security assessment of cryptographic products. However, there are several limitations to its applicability.
First, because new SCA methods are frequently proposed, it is crucial to periodically update the list of attack methods. However, CC standards generally apply to high-security products like bank smart cards and passport ICs. The time and computational complexity involved in security evaluation make it difficult to keep pace with the innovation cycle of new security products [26,45,46].
Second, the evaluators of attacking-style assessments must possess exceptional expertise in SCA methods and measurement techniques [23].
Third, due to the different principles of SCAs, evaluators need to perform multiple SCAs to calculate security metrics. The increasing number of attacks inevitably leads to an increase in computational and time complexity [31,39]. Therefore, the attacking-style assessment is not suitable for the primary security evaluation of cryptographic algorithms. Instead, the leakage detection-style assessment is proposed as a preliminary method to evaluate the security of cryptographic products.

The Goals of Leakage Detection-Style Assessment
The leakage detection-style assessment is conducted by the laboratory to provide the security certification, or conducted by the designers during the design period to highlight and address the potential issues before the product is launched into the market. Consequently, different stages of assessment have different goals. There are four different intentions [47]: certifying vulnerability, certifying security, demonstrating an attack, and highlighting vulnerabilities.
(1) Certifying vulnerability: This involves identifying at least a leakage point in traces. It is crucial to minimize the false positive rate.
(2) Certifying security: The goal here is to find no leakages after thorough testing. In this case, the false negatives become a concern. It is important to design the tests with "statistical power" to ensure a large enough sample size for detecting effects with reasonable probability. Moreover, all possible intermediates and relevant higher-order combinations of points should be targeted.
(3) Demonstrating an attack: The objective is to map a leaking point (or tuple) to its associated intermediate state(s) and perform an attack. The reporting of the outcomes or projections derived from these attacks is of interest. The false positives are undesirable as they represent wasted efforts of the attacker.
(4) Highlighting vulnerabilities: The purpose is to map all exploitable leakage points to intermediate states in order to guide designers in securing the device. This has similarities to certifying security, as both require exhaustive analysis. The false negatives are of greater concern than false positives, as false negatives indicate unfixed side-channel vulnerabilities.

The Process of Leakage Detection-Style Assessment
In leakage detection, the evaluators have the ability to control both input and output variables and obtain the side-channel measurements. The process of leakage detection-style assessment primarily encompasses four stages: collecting power traces, categorizing power traces, calculating the statistical moment, and determining leakage. The process of leakage detection-style assessment is shown in Figure 2.
becomes an ideal method for primary security assessments due to the lowered technical threshold and reduced evaluation time.

The Development of Leakage Detection-Style Assessment
In this paper, we categorize the development of leakage detection-style assessment into three stages: the phase of proposing the leakage assessment concept, the phase of forming leakage assessment methods, and the phase of optimizing the leakage assessment methods. Figure 3 provides a visual representation of these development stages.  Compared with the attacking-style assessment, the leakage detection-style assessment uses the statistical hypothesis or information theory to detect whether there is leakage. The evaluator does not consider specific attack methods, leakage models, or the implementation of encryption algorithms and does not require accurate extraction of leakage characteristics or the key recovery. Consequently, the leakage detection-style assessment becomes an ideal method for primary security assessments due to the lowered technical threshold and reduced evaluation time.

The Development of Leakage Detection-Style Assessment
In this paper, we categorize the development of leakage detection-style assessment into three stages: the phase of proposing the leakage assessment concept, the phase of forming leakage assessment methods, and the phase of optimizing the leakage assessment methods. Figure 3 provides a visual representation of these development stages.
In the phase of proposing the leakage assessment concept, Corona (2001) proposed the idea of security assessment [48]. The failed result indicates the presence of side-channel leakage, whereas the pass result does not imply that there is no leakage but rather indicates that the leakage has not been identified at the specified confidence level α. In [49], a framework for the side-channel security assessment was presented, categorizing security assessments into attacking-style and leakage detection-style assessments. This paper primarily concentrates on the leakage detection-style assessment.
In the phase of forming leakage assessment methods, two groups of leakage detectionstyle assessment methods have emerged based on different theories: the information theory and the statistical hypothesis. In 2010 and 2011, Chothia proposed two leakage assessment methods: one based on discrete mutual information (DMI) [50] and the other based on continuous mutual information (CMI) [51]. Both the DMI and CMI methods utilize information entropy as a testing tool to assess the possibility of side-channel leakage. Additionally, Gilbert et al. utilized the statistical hypothesis as a testing tool in their work [52]. They divided the traces into two groups and performed leakage assessment on the Advanced Encryption Standard (AES) using a t-test [52]. In 2013, based on the research by Gilbert et al. [52], Becker et al. proposed the Test Vector Leakage Assessment (TVLA) technique [53]. They divided the traces into two groups: fixed plaintext traces vs. random plaintext traces. Then, they performed Welch's t-tests on these two groups to detect any mean difference between the trace sets of fixed plaintext and random plaintext. In 2013, Mather et al. conducted a comparison of the detection efficiency between DMI and CMI with TVLA [27]. The results demonstrated that TVLA outperformed DMI and  [23] and studied leakage detection in various scenarios. The non-specific TVLA technology has been widely applied in leakage assessment and is commonly regarded as a preliminary assessment technique in both industry and academia due to its simplicity, efficiency, and versatility.

The Development of Leakage Detection-Style Assessment
In this paper, we categorize the development of leakage detection-style assessment into three stages: the phase of proposing the leakage assessment concept, the phase of forming leakage assessment methods, and the phase of optimizing the leakage assessment methods. Figure 3 provides a visual representation of these development stages.  However, in order to address the shortcomings of TVLA [22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40]47] (the detailed description of TVLA's drawbacks can be found in Section 4), improve the detection efficiency, reliability of results, and quantify side-channel vulnerability, the researchers have proposed a variety of leakage assessment schemes to enhance the TVLA.
In the phase of optimizing leakage assessment methods, there are three main aspects to consider. Firstly, the optimization of statistical tools aims to replace the t-test of TVLA with alternative statistical tools (such as the paired t-test [26], χ 2 -tests [28], Hotelling T 2 -tests [29], KS tests [30], ANOVA [31], deep learning [36,37], etc.). Secondly, the optimization of the detection process involves improving the current TVLA detection process or proposing a new one [32][33][34]. Finally, the optimization of the decision strategy focuses on introducing a new decision strategy, such as the HD strategy [35], to determine the stage of leakage.

The Leakage Assessment Based on Statistical Hypothesis
This section provides an overview of two perspectives on leakage assessment technologies: the TVLA technology and the TVLA's optimization schemes.

The Test Vector Leakage Assessment
In 2013, the Cybersecurity Research Institute (CRI) introduced the TVLA as a standardized approach for detecting side-channel leakages. This section emphasizes the process of detection, the metrics used for detection, and discusses the limitations of TVLA.

The TVLA Technology
(1) The detection process of TVLA The detection process of TVLA is as follows. In the stage of collecting power traces, N different inputs X are used to collect traces L from the execution process. Let l = l 1 , . . . , l i , . . . , l n l be a trace, where l i represents the measurements of the i-th point, and n l is the length of traces.
In the stage of categorizing power traces, the trace set L is divided into two groups: fixed plaintext trace set L A and random plaintext trace set L B . It is assumed that L A and L B obey the normal distribution N µ A , σ 2 A and N µ B , σ 2 B , respectively. The cardinality, sample mean, and sample variance of L A are denoted as n A , x A , s 2 A , while those of L B are denoted as n B , x B , s 2 B . In the stage of calculating the statistical moment, the null hypothesis H 0 states that there is no side-channel leakage, while the alternative hypothesis H 1 suggests the presence of leakage. Welch's t-test is employed to determine the mean difference between L A and L B . Based on the assumption of H 0 , we calculate the statistical moment T i and probability P i of accepting H 0 .
In the stage of determining leakage, if |T i | > |T th |, then the null hypothesis H 0 is rejected, and it can be concluded that there is side-channel leakage.
(2) The statistical tool of TVLA The statistical tool Welch's t-test is used to test the mean difference between L A and L B in TVLA. The null hypothesis H 0 and alternative hypothesis H 1 in Welch's t-test [33] are as follows: The statistical moment T µ and freedom degree v of the t-test are calculated as follows.
The probability |p| value of accepting hypothesis H 0 is calculated as follows using the probability density function (PDF). where is the gamma function. The threshold 4.5 [54,55] is employed to determine the acceptance of H 0 . If T µ | 4.5 , the hypothesis H 0 is rejected. When v > 1 000, P = 2 × f (4.5, v) < 10 −5 [33], and this implies that the probability of accepting hypothesis H 0 is less than 0.00001, while the probability of rejecting hypothesis H 0 is greater than 0.99999, indicating the presence of side-channel leakage in the device.
(3) The decision strategy of TVLA Due to TVLA being a leakage assessment method for univariate data, when the evaluator obtains the traces of length n L , TVLA is applied to each sample point of the traces. The assessor can obtain n L detection results, and the assessment decision is obtained by combing these results. The min − P strategy is a common decision strategy used in TVLA. Let L = [l 1 , . . . , l n L ] be set of traces, where l i = l i,1 , . . . , l i,j , . . . l l,N . The T µ i represents the statistical moment value of l i at i.
When TVLA is applied to the long traces, the assessor actually conducts multiple (n L times) tests. If any of the tests reject H 0 , it indicates the presence of side-channel leakage. In other words, the statistical moment is compatible with max 1≤i≤n L T µ i ≥ th, or the minimum p-value is less than the threshold α th . This indicates a leakage, meaning that the min − P strategy of TVLA only uses one test result (the minimum p-value) to make a decision regarding long traces.

The Assessment Metrics of TVLA
Due to the fact that TVLA involves the statistical analysis of random variables, it is possible for the detection results to contain errors. Therefore, it becomes imperative to evaluate the detection results. Commonly used metrics are employed to assess the effectiveness of detection methods.
(1) The number of samples: The minimum sample size required to exceed the threshold for statistical moments is an assessment metric used in TVLA. This metric is frequently utilized to compare the detection effectiveness among various assessment methods. In identical conditions, a smaller sample size indicates a higher degree of assessment effectiveness [24][25][26][27][28][29][30][31].
(2) The false positive and false negative: The two types of errors commonly encountered in hypothesis testing are false positives [24] and false negatives [47]. A false positive occurs when the null hypothesis is true, but it is rejected by a t-test, leading to an incorrect conclusion. In the context of leakage detection, a false positive refers to a situation where the device does not have any leaks, yet the TVLA results indicate otherwise. Conversely, a false negative denotes a Type II error, which occurs when the null hypothesis is false, but the t-test fails to reject it. The rate of Type II errors is denoted as β. During leakage assessment, the assessor aims to control the false positive rate at the specified significance level α.
(3) The effect size: The effect size ζ is an indicator [22,47] employed to assess the effectiveness of leakage detection.
Cohen's d [56,57] is a commonly used effect size for comparing differences among groups, mainly applied to t-tests to compare the standard difference between two means. Cohen's d is computed as follows: where x A , x B are the sample mean, s 2 A , s 2 B are the sample variance, and n A , n B are the cardinalities. Cohen establishes the threshold as the criterion for judging the effect size [58]. According to his classification, when d ≤ 0.2, the effect size is considered "small", and when d ≤ 0.8, the effect size is considered "large".
(4) The power: The power is an essential metric that indicates the ability of the assessor to detect a difference at the significance level α, and it is noted as 1 − β. The power should not be less than 75% and is typically required to reach 80% or 90%. The relationship among the variance σ 2 1 , σ 2 2 , the power 1 − β, the significance level α, the effect size ζ, and the number of samples N is as follows: where ζ = µ 1 − µ 2 , and T α/2 and T β are the statistical moments. Equation (14) allows us to obtain any one of the significance level, effect size, or power [47].
The observed power: Because power is a theoretical parameter that cannot be directly obtained from samples, in order to evaluate the reliability of assessment results, the observed power (OP) is proposed in [59] for given α and N. The OP can be considered an approximation of power. If the probability of accepting the null hypothesis P > α, the assessor can determine the reliability of assessment results by checking whether OP is greater than 0.8. Additionally, OP can serve as a means to compare the effectiveness of different assessment methods, while keeping the sample size N constant.

The Drawbacks of TVLA
Generally, TVLA is based on hypothesis testing to determine whether the side-channel measurements expose any secret information, which can help simplify the security assessment process [25,[60][61][62]. However, this complexity reduction is accompanied by an increase in false positives or false negatives [23,36]. The drawbacks of TVLA are as follows: (1) Difficulty interpreting negative outcomes Formally, a statistical hypothesis test can either reject the null hypothesis or fail to reject it. However, it cannot prove or accept the null hypothesis. If the statistical hypothesis test fails to reject the null hypothesis, the security assessment agency must demonstrate that their assessment method is fair. Unfortunately, due to the sample limitations or time constraints, varying levels of expertise, and poor equipment quality, the fairness of leakage assessment may be undermined. As a result, when the negative outcomes occur, it becomes challenging to explain and provide evidence for the fairness of TVLA.
(2) Unreliability of positive outcomes The TVLA is commonly utilized for univariate analysis. However, in actual detection scenarios, multiple tests are necessary. To illustrate this, let us assume that the probability of false positive is denoted as α. The length of traces is represented as n l , and there are n l independent tests. Consequently, the probability of rejecting the null hypothesis for a single test is given by α all = 1 − (1 − α) n l . Ding has emphasized that the threshold of 4.5 for TVLA corresponds to a false positive rate of approximately α ≈ 0.00001. For instance, if there are 1000 independent tests, then α all = 0.0068. On the other hand, if there are 1,000,000 independent tests, then α all = 0.9987 in [63]. Therefore, the products that generate long power traces are more likely to be deemed vulnerable compared to those with shorter traces. Consequently, the positive outcomes may be unreliable.
(3) Impossibility of achieving exhaustive coverage Ideally, an evaluator would prefer to eliminate any possible sensitive dependency across all distributional forms, for all points and tuples of points as a whole, considering all target functions and intermediate states before determining the security of a target device. However, even when the best efforts are made, there are still limitations. Moreover, in order to enhance the scope of detection, extensive tests are required, which lead to an increase in the type I error rate, computational complexity, and sample complexity. Consequently, achieving exhaustive coverage through TVLA is not possible.
(4) The multivariate problems of TVLA In TVLA, it is assumed that the sample points in a trace are independent of each other. Additionally, TVLA is a test for the univariate analysis. However, in reality, numerous examples contradict this assumption. This is especially evident in the case of protective algorithm implementations, where the leakage is mostly observed to be multivariable or horizontal [29,37,47]. Therefore, it is crucial to consider the correlation between multivariate variables in leakage assessment.

(5) The fewer trace groups and dependence of statistical moment
The simplicity and efficiency of TVLA depend on a reduced number of groups, the fixed VS random traces set, and the mean statistical moment [64]. However, in cases where the leakage does not manifest in the mean statistical moment [32] and there is a presence of mean differences among multiple groups, this introduces the risk of both false positives and false negatives [29][30][31][32][33]. The assumption is that the power traces obey a normal distribution in TVLA, while in reality, many examples that contradict this assumption can be observed. Especially for the protective algorithm implementations, it is generally necessary to use the combining functions to preprocess the side-channel traces in TVLA. Additionally, the distribution of samples does not obey the normal distribution after preprocessing [59]. (7) The shortcomings of certifying vulnerability The assessor of TVLA can only answer whether there is a side-channel leakage but cannot provide information regarding the specific location of the leakage or how to exploit it for key recovery. The results obtained from TVLA are insufficient for certifying vulnerability or deducing the relationship between the detected leakage and attack [22].

The Optimizations of TVLA
In order to address the shortcomings of TVLA mentioned above, researchers have proposed various optimization assessment methods. This section provides a summary of optimization methods in three aspects, the optimization of statistical tools, the optimization of assessment processes, and the optimization of decision strategies.

The Optimization of the Statistical Tool
The statistical tools play a crucial role in leakage assessments as they significantly impact the detection results. The researchers have attempted to enhance the detection efficiency and reliability of results in TVLA by utilizing alternative statistical tools instead of Welch's t-test when calculating statistical moments. This section summarizes the optimization methods for statistical tools.

(1) The paired t-test
The motivation: In [26], Adam Ding found that the environmental noise can adversely affect the results of t-tests in actual assessments. In the worst-case scenario, a device with leaks could pass the test solely due to the environmental noise being strong enough to mask them. In order to mitigate the impact of environmental noise, Adam Ding proposed a sidechannel leakage detection based on paired t-tests in [26], where Welch's t-test was replaced with the paired t-tests to eliminate the influence of environmental noise on the results.
The method: In the stage of collecting power traces, a fixed input sequence is proposed to effectively minimize environmental noise. This sequence consists of the repetition of ABBA, such as ABBAABBA. . . ABBAABBA. The power traces with minimal environmental variation are obtained through a paired t-test. In In the stage of determining leakage, use the same threshold and decision strategy as TVLA.
(2) χ 2 -test The motivation: In TVLA, comparing two groups and using the simple mean statistical moment can increase the risk of false negatives [28,49] or fail to detect leakages [40], when the leakage does not occur at the mean statistical moment. In 2018, Moradi proposed the side-channel leakage detection based on the χ 2 -test [28]. In the χ 2 -test, Welch's t-test was replaced with the χ 2 -test to detect whether multiple trace groups originate from the same population. Furthermore, in the χ 2 -test, the frequency of side-channel measurements is stored in histograms, by analyzing the histograms to identify any distribution differences among the trace groups.
The method: In the stages of collecting power traces and determining leakage, the same method as TVLA is adopted. During the stage of categorizing power traces, the traces are divided into r type groups and each type contains c sets in the χ 2 -test. The frequency of side-channel measurements is stored in c, and a contingency table is formed. In this table, the F i,j represents the frequency of the i-th type group and the j-th set. The number of samples is denoted as In the stage of calculating the statistical moment, the null hypothesis H 0 of the χ 2 -test states that all power traces come from the same population. The statistical moment T χ 2 and the degree of freedom v are obtained by (16).
The probability P of accepting H 0 is obtained by (17).
where Γ(·) is the gamma function.
In the stage of determining leakage, use the same threshold and decision strategy as TVLA. (

3) KS test
The motivation: When the leakage does not occur at the mean statistical moment, TVLA is not the optimal choice. Consequently, Zhou X proposed the side-channel leakage detection based on the Kolmogorov-Smirnov (KS) test in [30]. The KS test is a nonparametric statistical test used to determines if two groups of traces originate from the same population by quantifying the distance of cumulative distribution functions between the two groups.
The method: In the stages of collecting power traces, categorizing power traces, and determining leakage, adopt the same method as TVLA. In the stage of calculating the statistical moment, the null hypothesis H 0 of the KS test assumes that L A and L B come from the same population. The alternative hypothesis H 1 states that L A and L B come from different populations. The probability P of accepting H 0 is as follows.
where Z = D n A ,n B √ J + 0.11 √ J + 0.12 , J = n A ·n B n A +n B . D n A ,n B represents the maximum distance of cumulative distribution probabilities between L A and L B , D n A ,n B = sup x L A,n A (x) − L B,n B (x) .

(4) Hotelling T 2 -test
The motivation: TVLA is based on the assumptions of a sample and its detection efficiency strongly depends on parameters such as the signal-to-noise ratio (SNR), degree of dependency, and density. The correct interpretation of leakage detection results requires prior knowledge of these parameters. However, the evaluators often do not have this prior knowledge, which poses a non-trivial challenge. In order to address this issue, Bronchain. O proposed using the Hotelling T 2 -test instead of Welch's t-test in [36]. Additionally, they explored the concept of multivariate detection, which is able to exploit differences more effectively between multiple informative points in the trace compared to concurrent univariate t-tests.
The method: In the stages of collecting power traces and categorizing power traces, adopt the same method as TVLA. In the stage of calculating the statistical moment, the null hypothesis H 0 of the Hotelling T 2 -test is as follows.
then, the statistical moment T 2 is where n A , n B are the cardinality of trace sets L A and L B , the length of traces is n l , and the covariance matrix C is .
The statistical moment T 2 follows the Fisher distribution with degrees of freedom (n l , n A + n B − 2).
then, the probability of accepting hypothesis H 0 is , v 2 2 , I is the regularization incomplete β function, and v 1 and v 2 are the degrees of freedom.
In the stage of determining leakage, use the same threshold and decision strategy as TVLA.

(5) ANOVA
The motivation: The TVLA and the χ 2 -test require a large number of traces to distinguish the difference between leakage points and non-leakage points, and the paired t-test requires careful selection of inputs and low-noise measurements. Wei Yang proposed a novel leakage detection method using analysis of variance (ANOVA) in [31]. In ANOVA, the traces are categorized into multiple groups, and variance analysis is employed to identify the differences among these groups.
The method: The method of collecting power traces and determining leakage is adopted from TVLA. In the stage of categorizing power traces, the power trace set L is divided into r groups. Let N, n i be the cardinality of L, L i , while x represents the sample mean of L and x i represents the sample mean of L i . In the stage of calculating the statistical moment, the null hypothesis H 0 of the ANOVA test assumes that all group traces come from the same population, then the statistical moment T F of ANOVA is calculated as follows.
The probability P of accepting hypothesis H 0 is presented in Equation (27).

(6) The deep learning leakage assessment
The motivation: TVLA is conducted under the assumption that each sample point is independent and that the leakage occurs at the mean statistical moment. However, in reality, many examples contradict this hypothesis. Additionally, unaligned traces can also impact the detection results. Therefore, TVLA is inadequate for addressing horizontal and multivariate leakage, as well as unaligned traces. In order to solve this issue, Moos T proposed the method of Deep Learning Leakage Assessment (DL-LA) [37].
The method: DL-LA maintains the basic idea of TVLA, which involves discriminating between two groups of traces, and enhances the side-channel leakage assessment by training a neural network as a distinguisher to discriminate the two group traces. In the stage of collecting power traces, the same method of trace collection as TVLA is implemented. In the stage of categorizing power traces, the trace set L is divided into the training set and validation set, which do not intersect. The mean µ and standard deviation δ of the training set are calculated and X j i = X j i −µ i δ i is used to standardize both the training set and validation set, where j represents the trace and i represents the sample points of trace. In the stage of calculating the statistical moment, the assessor begins by training the neural network distinguisher, using the training set, and then validating its accuracy using the validation set. If the accuracy of the neural network distinguisher exceeds that of a random guess distinguisher, it can be employed that the neural network distinguisher can be used to discriminate the two groups of traces. The construction of the neural network distinguisher is as follows.
The network is built using Python's Keras library, with Tensor Flow serving as the backend. It comprises four fully connected layers, consisting of neurons with outputs of 120, 90, 50, and 2, respectively. The ReLU function is utilized as the activation function for the input layer and each inner layer, while softmax functions as the activation function for the final layer. The four layers are separated by a Batch Normalization layer. Once the neural network distinguisher is constructed, the assessor proceeds to employ it to conduct the leakage assessment. The null hypothesis H 0 states that the traces can be randomly divided into two groups by the neural network distinguisher, and the total number of correct classifications follows the binomial distribution X ∼ Binom(X, 0.5). The probability that the total number of correct classifications is at least S X in the pure random distinguisher is P(X ≥ S X ), and the probability is calculated as follows.
In the stage of determining leakage, the threshold P th = 10 −5 is set, and if P(X ≥ S X ) > P th , then the hypothesis H 0 is rejected, indicating the presence of sidechannel leakage.
In summary, the various optimization methods (in Table 1) mentioned above are all aimed at optimizing the statistical tool of TVLA. However, it should be noted that each optimization method only addresses certain shortcomings of TVLA. Consequently, there is currently no optimal statistical tool available that can effectively solve all shortcomings associated with the t-test used in TVLA. Therefore, when conducting leakage detection, it is essential to select a statistical tool that is appropriate for the specific environment. If addressing environmental noise, the paired t-test is recommended in place of the t-test. For horizontal and multivariate leakage, the Hotelling T 2 -test or DL-LA are suggested. Furthermore, for multi-group traces or non-mean statistical moment leakage, the χ 2 -test, ANOVA, and KS test are recommended alternatives to the t-test. Hence, it is highly recommended to select the appropriate statistical tool based on the characteristics of the detection environment and the nature of the leakage. This ensures accurate and reliable results.

Tool For TVLA's Shortcoming The Comparison Result with t-Test
The paired t-test [34] The environmental noise negatively affects the results of TVLA.
The paired t-test performs better than the t-test in a noisy environment.
TVLA has only two classifications; the detection results rely on the mean statistical moment.
When the leakage does not occur on the mean statistical moment, the χ 2 -test is better than the t-test.
KS test [37] The detection results rely on the mean statis tical moment.
When the leakage does not occur on the mean statistical moment or the statistical parameters are transformed, the KS test is more robust than the t-test.
Hotelling T 2 -test [36] TVLA cannot be used for multivariate TVLA is based on an independence assumption.
For multivariate leakage, compared with the t-test, the Hotelling T 2 -test can improve the detection efficiency.
When the traces are divided into more groups, the detection efficiency of the ANOVA test is better than the t-test.

DL-LA [14]
TVLA is not suitable for multivariate, horizontal leakage, and unaligned power traces.
For the multivariate, horizontal leakage, or unaligned power traces, DL-LA is better than the t-test.

The Optimization of the Leakage Assessment Process
In TVLA, the efficiency and accuracy of detection results depend on the leakage assessment process. With this in mind, the researchers have proposed a series of suggestions to accelerate the detection process [32,33] or proposed a novel leakage assessment process [34]. This section primarily scrutinizes the optimization methods of the assessment process.
(1) The optimization of TVLA's detection process Melissa Azouaoui studied the literature on leakage assessment and considered the leakage assessment process of TVLA as the combination or iteration of three steps: measurement and preprocessing, leakage detection and mapping, and leakage exploitation, and tried to find whether there are optimal guarantees for these steps in [33].
For measurement and preprocessing, the setting up of measurement devices depends on the expertise [65]. The preprocessing is also similar, and currently the main methods include filtering the noise [64,66] and aligning the power traces [67,68]. The best methods of setting up the measurement devices and preprocessing should be as open and repeatable as possible. Although there are some methods for setting up the measurement devices in FIPS 140 and ISO, there is currently no guaranteed optimal approach for measurement and preprocessing.
For leakage detection and mapping, the statistical hypothesis is commonly employed for comparing the distribution or statistical moment. Despite the existence of numerous methods for leakage detection, consensus on their fairness and optimality has yet to be reached. Moreover, the "budget" of traces plays a significant role in leakage detection, yet there is presently no established threshold for the optimal number of "budget" traces [29].
For leakage exploitation, it is typically divided into three stages: modeling, information extraction, and information processing. However, during the modeling stage, the evaluator utilizes the traces to estimate the optimal model based on the implementation. Nevertheless, as the number of shares increases in the mask scheme, the cost increases and the independence of samples affects the modeling phase. Currently, obtaining the optimal model remains an unresolved issue, and there is no optimal method.
During the actual leakage assessment, there are risks associated with all the aforementioned steps, and currently, there is no guarantee for an optimal leakage assessment process.
(2) A novel framework for explainable leakage assessment Due to the unavailability of leakage detection results in TVLA, the current approach consists of utilizing a specific attack to verify the detection outcomes. Based on this, Gao Si and Elisabeth Oswald introduced a novel leakage assessment process in [34], referred to as the "the process of Gao Si" in this paper. The leakage assessment process of [34] is outlined below.
Step 1: Non-specific detection via key-dependent models. Consider the two nested key-dependent models: The full model L c f fits a model as a function of the key K to the observed data. The null model L 0 now contains only a constant term, which represents the case where there is no dependency on K.
where the coefficients β j are estimated from the traces via least square estimation. The F-test is used to test H 0 (both L c f and L 0 models explain the observed data equally well) versus H 1 (L c f explains the data better than L 0 ). If the F-test finds enough evidence to reject H 0 , we conclude this point's leakage relies on K c , because K c is a part of K. Consequently, the measurement depend on K.
Step 2: Degree analysis Further restricting the degree of the key-dependent model, we determine how large the key guess is required to be to exploit an identified leakage. We obtain the model The F-test is used again to test H 0 (L cr and L c f explain the data equally well) versus H 1 (L c f explains the data better than L cr ).
The goal of the F-test is to have L f (X) = ∑ j β j µ j (X), j ∈ T f with L r (X) = ∑ j β j µ j (X), j ∈ J r ∈ T f . The statistical moment of the F-test is as follows. where is the L c f or L r of x (i) . n f and n r are the sample size of L f and L r . N is the sample size of L. The statistical moment |F| of the F-test obeys the F distribution with the freedom degree of (n f − n r , N − n f ). The threshold of the F-test is and Q F is the quantile function of the central F distribution. If F ≥ F th , the null hypothesis H 0 is rejected, L f (X) explains the data better than L r (X), and the leakage contains the information of the key. If there is enough evidence to reject H 0 , we obtain that a model with only g or fewer key bytes suffices to explain the measurements. By successively reducing g, we can therefore determine the maximum key guess that is required to explain the side channel measurements.
Step 3: Subkey identification By using the technique of further restricting the reduced model, we can narrow down precisely which specific key bytes are required to explain the identified leakage.
Step 4: Converting to specific attacks If an evaluation regime requires an evaluator to demonstrate an actual attack targeting an identified leakage point, a relatively straightforward connection to a concrete profiled attack can be established.
In summary, although optimization methods exist for the measurement process and preprocessing process, there is currently no optimal leakage detection process. In actual leakage detection, the detection process of TVLA is still used to detect leakages. The process of Gao Si is a new detection process to demonstrate that the discovered leakages are key-related and can be exploited by attacks. The approach is a small step towards establishing precise attack vectors for confirmatory attacks.

The Optimization of TVLA's Decision Strategy
(1) The decision strategy of HC For long traces, the detection result is obtained through using the min − P strategy in TVLA. This strategy relies solely on the minimum |p|value to make a decision about leakage, disregarding all other p-values. Ding, A.A proposed the higher critical (HC) strategy in [35], which takes into account the information from all p-values.
The null hypothesis H 0 : there is no leakage point in the trace. The alternative hypothesis H 1 : there is at least one leakage point in the trace. Let the length of traces be n l , there are n l p-values, denoted as p(1), . . . , p(n l ), and the HC strategy is as follows.
Step 2: Calculating the normalized distanceĤC n l ,i of the p-value, and the normalized distance of the HC strategy is the Formula (33).
Step 3: Calculating the statistical moment of the HC strategy with (34).
HC n l ,max = max 1≤i≤ n l 2 HC n l ,i Step 4: Comparing the statistical moment HC n l ,max with the threshold th HC n l ,α of significance level α. If HC n l ,max > th HC n l ,α , then the null hypothesis is rejected, indicating the presence of side-channel leakages. The threshold th HC n l ,α represents the 1 − α quantile of the statistical moment HC n l ,max under the null hypothesis. For large n l , the threshold th HC n l ,α can be approximated using the connection to a Brownian bridge, such as the calculation formula provided in Li and Siegmund [68].
In summary, the HC strategy can combine multiple leakage points to enhance the efficiency of leakage detection. Thus, it can serve as a viable alternative to TVLA's min-P strategy.

The Summary of TVLA's Optimization Schemes
Researchers have proposed various optimization schemes for TVLA, which primarily aim to address its inherent limitations. Currently, there is no comprehensive and universally applicable statistical tool and detection process that can effectively address all the identified limitations. Suitable detection methods for different detection purposes and conditions are summarized in Figure 4.  Therefore, assessors should perform the following process in actual leakage detection.
Firstly, select the leakage detection process based on the purpose of detection. If the aim is to only discover a side-channel leakage, the recommended process is TVLA. If the aim is to detect and utilize the leakages, it is recommended to choose the process of Gao Si.
Secondly, if the TVLA process is chosen, it is recommended to select an appropriate statistical tool based on the evaluator's prior knowledge of the device. If the evaluator has no prior knowledge about the device, a t-test is used initially to detect whether there is a univariate, first-order leakage. If a leakage is detected, the leakage detection process is stopped. Otherwise, the 2 -test, KS test and ANOVA test are used to detect univariate, high-order leakages, while the Hotelling 2 -test and DA-LA are used to test for the presence of multivariate or horizontal leakage. If the evaluator has prior knowledge of the device, the leakage type (univariate or multivariate) is determined based on this knowledge. Then, the detection environment (high noise or low noise) and alignment of power traces are determined. For univariate, first-order leakage (mean statistical moment) and low-noise environments, the t-test is more efficient. For univariate, first-order leakage and high-noise environments, the paired t-tests have better detection efficiency. For univariates where the leakage does not occur at the mean statistical moment, the 2 -test, KS test, and ANOVA test have better detection performance than the t-test. For multivariate and horizontal leakages, it is recommended to choose the Hotelling 2 -test and DA-LA, with DA-LA being more effective when the traces are unaligned.
Finally, in the determination stage, the HC strategy is recommended. Therefore, assessors should perform the following process in actual leakage detection. Firstly, select the leakage detection process based on the purpose of detection. If the aim is to only discover a side-channel leakage, the recommended process is TVLA. If the aim is to detect and utilize the leakages, it is recommended to choose the process of Gao Si.
Secondly, if the TVLA process is chosen, it is recommended to select an appropriate statistical tool based on the evaluator's prior knowledge of the device. If the evaluator has no prior knowledge about the device, a t-test is used initially to detect whether there is a univariate, first-order leakage. If a leakage is detected, the leakage detection process is stopped. Otherwise, the χ 2 -test, KS test and ANOVA test are used to detect univariate, high-order leakages, while the Hotelling T 2 -test and DA-LA are used to test for the presence of multivariate or horizontal leakage. If the evaluator has prior knowledge of the device, the leakage type (univariate or multivariate) is determined based on this knowledge. Then, the detection environment (high noise or low noise) and alignment of power traces are determined. For univariate, first-order leakage (mean statistical moment) and low-noise environments, the t-test is more efficient. For univariate, first-order leakage and high-noise environments, the paired t-tests have better detection efficiency. For univariates where the leakage does not occur at the mean statistical moment, the χ 2 -test, KS test, and ANOVA test have better detection performance than the t-test. For multivariate and horizontal leakages, it is recommended to choose the Hotelling T 2 -test and DA-LA, with DA-LA being more effective when the traces are unaligned.
Finally, in the determination stage, the HC strategy is recommended. Although many leakage detection methods have been proposed, TVLA remains the mainstream detection method. Therefore, the t-test is generally regarded as the mainstream detection tool, with other detection tools considered supplementary. However, the current detection methods cannot definitively conclude that there is no leakage, only provide evidence that no leakage has been detected.

Quantification of Side Channel Vulnerability
The inability of TVLA's detection results to quantify the vulnerability of side channels raises the question of how to establish a relationship between attacking-style assessment and leakage detection-style assessment. Debapriya made an attempt to derive this relationship between TVLA and SCA. Additionally, SR can be calculated directly from TVLA, effectively connecting CC and FIPS 140-3 based on the provided intermediate variables and leakage model in [22].
The derivation process of the relationship between TVLA and SR is as follows. Let L = f (X, k) be the normalized leakage model, where E(L) = 0, Var(L) = E L 2 = 1. Y represents the measurements, and it can be defined as Y = L + N , where is the scale factor and N ∼ N 0, σ 2 represents the noise. Taking S-box as an example, the n-bit Hamming weight model can be expressed as f (X, k) = 2 √ n HW(sbox(X ⊗ k)) − n 2 ( * ). Firstly, link TVLA and NICV. We obtained . If . If the side-channel traces are divided into q groups, then where Then Finally, export SR using SNR. The relationship between SNR and SR is where = k c , k g 1 , . . . , k c , k g 2 n −1 T , k c , k g = E l(X, k c ) − l X, k g i 2 .
. . . . . . * * k c , k g 2 n −1 , k g 1 . . . * * k c , k g 2 n −1 , k g 2 n −1    * * k c , k g i , k g j = 4E((l(X, k c ) − E(l(X, k c ))) 2 l(X, k c ) − l X, k g i l(X, k c ) − l X, k g j , where Φ [S] (µ) is the cumulative distributive function of the multivariate normal distribution with mean vector µ and covariance C, k c is the correct key, and k g i with 1 ≤ i ≤ 2 n − 1. Figure 5 presents the process of hybrid side-channel testing. The process consists of non-specific TVLA → specific TVLA → SR → evaluation results. First, the evaluator performs a non-specific TVLA on the target device, followed by the calculation of SR using specific TVLA. The side-channel vulnerabilities of the target device are assessed without the need for actual attacks. If SR is below the security limit, the device is considered safe; otherwise, it is considered vulnerable to SCA.
where Φ [ ] ( ) is the cumulative distributive function of the multivariate normal distribution with mean vector and covariance , is the correct key, and with 1 ≤ ≤ 2 − 1. Figure 5 presents the process of hybrid side-channel testing. The process consists of non-specific TVLA → specific TVLA → SR → evaluation results. First, the evaluator performs a non-specific TVLA on the target device, followed by the calculation of SR using specific TVLA. The side-channel vulnerabilities of the target device are assessed without the need for actual attacks. If SR is below the security limit, the device is considered safe; otherwise, it is considered vulnerable to SCA. Although this method aims to establish the relationship between TVLA and SR, it provides limited bridging between CC and FIPS. However, this method is based on the assumptions regarding intermediate variables and leakage models, resulting in difficulties in practical detection for evaluators without professional knowledge of these models and values.

Discussion
Based on the research on leakage assessment works, we discuss leakage assessment from two perspectives: the current status and future development trend.
Regarding the current status of leakage assessment, firstly, although many leakage assessment methods have been proposed to address the specific shortcomings of TVLA (multivariate, high-order leakage, noise issues, or independence of traces), there is currently no unified standardized method to guide all assessors in conducting the assessment step by step. Secondly, in actual leakage detection, the evaluator or agencies often spend a significant amount of time collecting the traces, regardless of the detection method used. The goal is to evaluate the security of products and identify potential vulnerabilities and Although this method aims to establish the relationship between TVLA and SR, it provides limited bridging between CC and FIPS. However, this method is based on the assumptions regarding intermediate variables and leakage models, resulting in difficulties in practical detection for evaluators without professional knowledge of these models and values.

Discussion
Based on the research on leakage assessment works, we discuss leakage assessment from two perspectives: the current status and future development trend.
Regarding the current status of leakage assessment, firstly, although many leakage assessment methods have been proposed to address the specific shortcomings of TVLA (multivariate, high-order leakage, noise issues, or independence of traces), there is currently no unified standardized method to guide all assessors in conducting the assessment step by step. Secondly, in actual leakage detection, the evaluator or agencies often spend a significant amount of time collecting the traces, regardless of the detection method used. The goal is to evaluate the security of products and identify potential vulnerabilities and provide further design to achieve the required security level. However, current leakage detection focuses solely on discovering leakages and does not uncover vulnerabilities or guide the design process. From an investment cost perspective, the income obtained from leakage detection is low.
In terms of the future development trend of leakage assessment, researchers are currently attempting to link the non-specific leakage detection results with key guessing in order to address the issue of unusable detection results. This approach would allow the detection results to reveal information about the key or causes of leakage, which can aid designers in constructing attacks or defenses. Therefore, we anticipate the formation of a unified and fair method for security assessment. Additionally, in recent years, deep learning technology has been applied to leakage detection. Compared to the traditional leakage assessment technology, the deep learning leakage assessment offers simplicity and high level of statistical confidence. However, it lacks the ability to provide superior security metrics. Therefore, the further exploration is needed on how to effectively apply the deep learning technology to side-channel leakage assessment.

Conclusions
In this paper, we conducted a comprehensive study on the leakage detection-style assessment. These leakage detection-style assessment methodologies can be classified into two categories: TVLA and its optimizations. We identified the drawbacks of TVLA and categorized the optimization schemes aimed at addressing these limitations into three groups: statistical tool optimization, detection process optimization, and decision strategy optimization. We gave succinct descriptions of the leakage detection assessment motivations and detection processes and compared the efficiency of different optimization schemes. Based on our classification and summary, we concluded that there is no single optimal scheme that can effectively address all the shortcomings of TVLA. Different optimization schemes are proposed for specific purposes and detection conditions. We summarized the purposes and conditions for all TVLA optimizations and proposed a selection strategy for leakage detection-style assessment schemes. According to the selection strategy, the leakage detection process should be chosen based on the specific detection purpose. For discovering side-channel leakages, TVLA is recommended, while the detection process of Gao Si is suitable for discovering and utilizing leakages. Additionally, the appropriate statistical tool should be selected. T-tests and paired t-tests are for detecting univariate, first-order leakage. In low-noise environments, the t-test is more suitable, whereas in high-noise environments, the paired t-test demonstrates better detection efficiency. If the leakage does not occur at the mean statistical moment, the χ 2 -test, KS test and ANOVA test outperform the t-test for detecting univariate, high-order leakages. For testing multivariate or horizontal leakages, the Hotelling T 2 -test and DA-LA are recommended. If the traces are unaligned, DA-LA is more effective. Lastly, the HC strategy is recommended for the determination stage. Based on the current status of leakage detection-style assessment, we discuss the development trend of leakage detection. Researchers are increasingly interested in linking leakage detection with key guessing to address the issue of unusable detection results. The aim is to establish a unified and fair method for security assessment.