A Comprehensive Study of the Key Enumeration Problem

In this paper, we will study the key enumeration problem, which is connected to the key recovery problem posed in the cold boot attack setting. In this setting, an attacker with physical access to a computer may obtain noisy data of a cryptographic secret key of a cryptographic scheme from main memory via this data remanence attack. Therefore, the attacker would need a key-recovery algorithm to reconstruct the secret key from its noisy version. We will first describe this attack setting and then pose the problem of key recovery in a general way and establish a connection between the key recovery problem and the key enumeration problem. The latter problem has already been studied in the side-channel attack literature, where, for example, the attacker might procure scoring information for each byte of an Advanced Encryption Standard (AES) key from a side-channel attack and then want to efficiently enumerate and test a large number of complete 16-byte candidates until the correct key is found. After establishing such a connection between the key recovery problem and the key enumeration problem, we will present a comprehensive review of the most outstanding key enumeration algorithms to tackle the latter problem, for example, an optimal key enumeration algorithm (OKEA) and several nonoptimal key enumeration algorithms. Also, we will propose variants to some of them and make a comparison of them, highlighting their strengths and weaknesses.


Introduction
A side-channel attack may be defined as any attack by which an attacker is able to obtain private information of a cryptographic algorithm from its implementation instead of exploiting weaknesses in the implemented algorithm itself. Most of these attacks are based on a divide-and-conquer approach through which the attacker obtains ranking information about the chunks of the secret key and then uses such information to construct key candidates for that key. This secret key is the result of the concatenation of all the key parts, while a chunk candidate is a possible value of a key part that is chosen because the attack suggests a good probability for that value to be correct. Particularly, we will focus on a particular side-channel attack, known as cold boot attack. This is a data remanence attack in which the attacker is able to read sensitive data from a source of computer memory after supposedly having been deleted. More specifically, exploiting the data remanence property of dynamic random-access memories (DRAMs) , an attacker with physical access to a computer, may procure noisy data of a secret key from main memory via this attack vector. Hence, after obtaining such data, the attacker's main task is to recover the secret key from its noisy version. As it will be revealed by the literature in Section 2, the research effort, after the initial work showing the practicability of cold boot attacks [1], has focused on designing tailor-made algorithms for efficiently recovering keys from noisy versions for a range of different cryptographic schemes whilst exploring the limits of how much noise can be tolerated.

1.
We present the key recovery problem in a general way and establish a connection between the key recovery problem and the key enumeration problem.

2.
We describe the most outstanding key enumeration algorithms methodically and in detail and also propose variants to some of them. The algorithms included in this study are an optimal key enumeration algorithm (OKEA); a bounded-space near-optimal key enumeration algorithm; a simple stack-based, depth-first key enumeration algorithm; a score-based key enumeration algorithm; a key enumeration algorithm using histograms; and a quantum key enumeration algorithm. For each studied algorithm, we describe its inner functioning, showing its functional and qualitative features, such as memory consumption, amenability to parallelization; and scalability. 3.
Finally, we make an experimental comparison of all the implemented algorithms, drawing special attention to their strengths and weaknesses. In our comparison, we benchmark all the implemented algorithms by running them in a common scenario to measure their overall performance.
Note that the goal of this research work is not only to study the key enumeration problem and its connection to the key recovery problem but also to show the gradual development of designing key enumeration algorithms, i.e., our review also focuses on pointing out the most important design principles to look at when designing key enumeration algorithms. Therefore, our review examines the most outstanding key enumeration algorithms methodically, via describing their inner functioning, the algorithm-related data structures, and the benefits and drawbacks from using such data structures. Particularly, this careful examination shows us that, by properly using data structures and by making the restriction on the order in which the key candidates are enumerated less strict, we may devise better key enumeration algorithms in terms of overall performance, scalability, and memory consumption. This observation is substantiated in our experimental comparison. This paper is organised as follows. In Section 2, we will first describe the cold boot attack setting and the attack model we will use throughout this paper. In Section 3,we will describe the key recovery problem in a general way and establish a connection between the key recovery problem and the key enumeration problem. In Section 4, we will examine several key enumeration algorithms to tackle the key enumeration problem methodically and in detail, e.g., an optimal key enumeration algorithm (OKEA), a bounded-space near-optimal key enumeration algorithm, a quantum key enumeration algorithm, and variants of other key enumeration algorithms. In Section 5, we will make a comparison of them, highlighting their strengths and weaknesses. Finally, in Section 6, we will draw some conclusions and give some future research lines.

Cold Boot Attacks
A cold boot attack is a type of data remanence attack by which sensitive data are read from a computer's main memory after supposedly having been deleted. This attack relies on the data remanence property of DRAMs that allows an attacker to retrieve memory contents that remain readable in the seconds to minutes after power has been removed. Since this attack was first described in the literature by Halderman et al. nearly a decade ago [1], it has received significant attention. In this setting, more specifically, an attacker with physical access to a computer can retrieve content from a running operating system after performing a cold reboot to restart the machine, i.e., not shutting down the operating system in an orderly manner. Since the operating system was shut down improperly, it will skips file system synchronization and other activities that would occur on an orderly shutdown. Therefore, following a cold reboot, such an attacker may use a removable disk to boot a lightweight operating system and then copy stored data in memory to a file. As another option or possibility, such an attacker may take the memory modules off the original computer and quickly put them in a compatible computer under the attacker's control, which is then started and put into a state of readiness for operation in order to access the memory content. Also, this attacker may perform a further analysis against the data that was dumped from memory to find various sensitive information, such as cryptographic keys contained in it [1]. This task may be performed by making use of various forms of key finding algorithms [1]. Unfortunately for such an attacker, the bits in memory will degrade once the computer's power is interrupted. Therefore, if the adversary retrieves any data from the computer's main memory after the power is cut off, the extracted data will probably have random bit variations. This is, the data will be noisy, i.e., differing from the original data.
The lapse of time for which cell memory values are maintained while the machine is off depends on the particular memory type and the ambient temperature. In fact, the research paper [1] reported the results of multiple experiments that show that, at normal operating temperatures (25.5 • C to 44.1 • C), there is little corruption within the first few seconds but this phase is then followed by a quick decay. Nevertheless, by employing cooling techniques on the memory chips, the period of mild corruption can be extended. For instance, by spraying compressed air onto the memory chips, they achieved an experiment at −50 • C and showed that less than 0.1% of bits degrade within the first minute. At temperatures of approximately −196 • C, attained by the use of liquid nitrogen, less than 0.17% of bits decay within the first hour. Remarkably, once power is switched off, the memory will be divided into regions and each region will have a "ground state", which is associated with a bit. In a 0 ground state, the 1 bits will eventually decay to 0 bits while the probability of a 0 bit switching to a 1 bit is very small but not vanishing (a common probability is circa 0.001 [1]). When the ground state is 1, the opposite is true.
From the above discussion, it follows that only a noisy version of the original key may be retrievable from main memory once the attacker discovers the location of the data in it, so the main task of the attacker then is to tackle the mathematical problem of recovering the original key from a noisy version of that key. Therefore, the centre of interest of the research community after the initial work pointing out the feasibility of cold boot attacks [1] has been to develop bespoke algorithms for efficiently recovering keys from noisy versions of those keys for a range of different cryptographic schemes whilst exploring the limits of how much noise can be tolerated.
Heninger and Shacham [2] focused on the case of RSA keys, introducing an efficient algorithm based on Hensel lifting to exploit redundancy in the typical RSA private key format. This work was followed up by Henecka, May, and Meurer [3] and by Paterson, Polychroniadou, and Sibborn [4], with both research papers also paying particular attention to the mathematically highly structured RSA setting. The latter research paper, in particular, indicated the asymmetric nature of the error channel intrinsic to the cold boot setting and presented the problem of key recovery for cold boot attacks in an information theoretic manner.
On the other hand, Lee et al. [5] were the first to discuss cold boot attacks in the discrete logarithm setting. They assumed that an attacker had access to the public key g x , a noisy version of the private key x, and that such an attacker knew an upper bound for the number of errors in the private key. Since the latter assumption might not be realistic and the attacker did not have access to further redundancy, their proposed algorithm would likely be unable to recover keys in the true cold boot scenario, i.e., only assuming a bit-flipping model. This work was followed up by Poettering and Sibborn [6]. They exploited redundancies present in the in-memory private key encodings from two elliptic curve cryptography (ECC) implementations from two Transport Layer Security (TLS) libraries, OpenSSL and PolarSSL, and introduced cold boot key-recovery algorithms that were applicable to the true cold boot scenario.
Other research papers have explored cold boot attacks in the symmetric key setting, including Albrecht and Cid [7], who centred on the recovery of symmetric encryption keys in the cold boot setting by employing polynomial system solvers, and Kamal and Youssef [8], who applied SAT solvers to the same problem.
Finally, recent research papers have explored cold boot attacks on post-quantum cryptographic schemes. The paper by Albrecht et al. [9] evaluated schemes based on the ring-and module-variants of the Learning with Errors (LWE) problem. In particular, they looked at two cryptographic schemes: the Kyber key encapsulation mechanism (KEM) and New Hope KEM. Their analysis focused on two encodings to store LWE keys. The first encoding stores polynomials in coefficient form directly in memory, while the second encoding performs a number theoretic transform (NTT) on the key before storing it. They showed that, at a 1% bit-flip rate, a cold boot attack on Kyber KEM parameters had a cost of 2 43 operations when the second encoding is used for key storage compared to 2 70 operations with the first encoding. On the other hand, the paper by Paterson et al. [10] focused on cold boot attacks on NTRU. Particularly the authors of the research paper [10] were the first that used a combination of key enumeration algorithms to tackle the key recovery problem. Their cold boot key-recovery algorithms were applicable to the true cold boot scenario and exploited redundancies found in the in-memory private key representations from two popular NTRU implementations. This work was followed up by that of Villanueva-Polanco [11], which studied cold boot attacks against the strongSwan implementation of the BLISS signature scheme and presented key-recovery algorithms based on key enumeration algorithms for the in-memory private key encoding used in this implementation.

Cold Boot Attack Model
Our cold boot attack model assumes that the adversary can procure a noisy version of the encoding of a secret key used to store it in memory. We further assume that the corresponding public parameters are known exactly, without noise. We do not take into consideration here the significant problem of how to discover the exact place or position of the appropriate region of memory in which the secret key bits are stored, though this would be a consideration of great significance in practical attacks. Our goal is then to recover the secret key. Note that it is sufficient to obtain a list of key candidates in which the true secret key is located, since we can always test a candidate by executing known algorithms linked to the scheme we are attacking.
We assume throughout that a 0 bit of the original secret key will flip to a 1 with probability α = P(0 → 1) and that a 1 bit of the original private key will flip with probability β = P(1 → 0). We do not assume that α = β; indeed, in practice, one of these values may be very small (e.g., 0.001) and relatively stable over time while the other increases over time. Furthermore, we assume that the attacker knows the values of α and β and that they are fixed across the region of memory in which the private key is located. These assumptions are reasonable in practice: one can estimate the error probabilities by looking at a region where the memory stores known values, for example, where the public key is located, and where the regions are typically large.

Some Definitions
We define an array A as a data structure consisting of a finite sequence of values of a specified type, i.e., A = [a 0 , . . . , a n A −1 ]. The length of an array, n A , is established when the array is created. After creation, its length is fixed. Each item in an array is called an element, and each element is accessed by its numerical index, i.e., A[i] = a i , with 0 ≤ i < n A . Let A 0 = [a 0 0 , . . . , a 0 n 0 −1 ] and A 1 = [a 1 0 , . . . , a 1 n 1 −1 ] be two arrays of elements of a specified type. The associative operation is defined as follows.
Both a list L and a table T are defined as a resizable array of elements of a specified type. Given a list L = [e 0 , . . . , e n l −1 ], this data structure supports the following methods. • The method L.size() returns the number of elements in this list, i.e., the value n l . • The method L.add(e n l ) appends the specified element e n l to the end of this list, i.e., L = [e 0 , e 1 , . . . , e n l ] after this method returns. • The method L.get(j), with 0 ≤ j < L.size(), returns the element at the specified position j in this list, i.e., e j . • The method L.clear() removes all the elements from this list. The list will be empty after this method returns, i.e., L = [].

Problem Statement
Let us suppose that a noisy version of the encoding of the secret key r = b 0 b 1 b 2 . . . b W can be represented as a concatenation of N = W/w chunks, each on w bits. Let us name the chunks r 0 , r 1 , . . . , r N −1 so that r i = b i·w b i·w+1 . . . b i·w+(w−1) . Additionally, we suppose there is a key-recovery algorithm that constructs key candidates c for the encoding of the secret key and that these key candidates c can also be represented by concatenations of chunks c 0 , c 1 , . . . , c N −1 in the same way.
The method of maximum likelihood (ML) estimation then suggests picking as c the value that maximizes P(c|r). Using Bayes' theorem, this can be rewritten as P(c|r) = P(r|c)P(c)

P(r)
. Note that P(r) is a constant and that P(c) is also a constant, independent of c. Therefore, the ML estimation suggests picking as c the value that maximizes P(r|c) = (1 − α) n 00 α n 01 β n 10 (1 − β) n 11 , where n 00 denotes the number of positions where both c and r contain a 0 bit and where n 01 denotes the number of positions where c contains a 0 bit and r contains a 1 bit, etc. Equivalently, we may maximize the log of these probabilities, viz. log(P(r|c)) = n 00 log(1 − α) + n 01 log α + n 10 log β + n 11 log(1 − β). Therefore, given a candidate c, we can assign it a score, namely S r (c) := log(P(r|c)).
Assuming that each of the, at most, 2 w candidate values for chunk c i , 0 ≤ i < N , can be enumerated, then its own score also can be calculated as S r i (c i ) = n i 00 log(1 − α) + n i 01 log α + n i 10 log β + n i 11 log(1 − β), where the n i ab values count occurrences of bits across the i th chunks c i and r i . Therefore, we have S r (c) = ∑ N −1 i=0 S r i (c i ). Hence, we may assume we have access to N lists of chunk candidates, where each list contains up to 2 w entries. A chunk candidate is defined as a 2-tuple of the form (score, value), where the first component score is a real number (candidate score) while the second component value is an array of w-bit strings (candidate value). The question then becomes can we design efficient algorithms that traverse the lists of chunk candidates to combine chunk candidates c i , obtaining complete key candidates c having high total scores obtained by summation? This question has been previously addressed in the side-channel analysis literature, with a variety of different algorithms being possible to solve this problem and the related problem known as key rank estimation [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. Let .value . . . c i n j n .value). Note that when i 0 = 0, i 1 = 1, . . . , i N −1 = N − 1, c will be a full key candidate.

Definition 1.
The key enumeration problem entails traversing the N lists L i , 0 ≤ i < N , while picking a chunk candidate c i j i from each L i to generate full key candidates c = combine(c i 0 j 0 , . . . , c i n j n ). Moreover, we call an algorithm generating full key candidates c a key enumeration algorithm (KEA).
Note that the key enumeration problem has been stated in a general way; however, there are many other variants to this problem. These variants relate to the manner in which the key candidates are generated by a key enumeration algorithm.
A different version of the key enumeration problem is enumerating key candidates c such that their total accumulated scores follow a specific order. For example, for many side-channel scenarios, it is necessary to enumerate key candidates c starting at the one having the highest score, followed by the one having the second highest score, and so on. In these scenarios, we need a key enumeration algorithm to enumerate high-scoring key candidates in decreasing order based on their total accumulated scores. For example, such an algorithm would allow us to find the top M highest scoring candidates in decreasing order, where 1 ≤ M 2 W . Furthermore, such an algorithm is known as an optimal key enumeration algorithm.
Another version of the same problem is enumerating all the key candidates c such that their total accumulated scores satisfy a specified property rather than a specific order. For example, for some side-channel scenarios, it would be useful to enumerate all key candidates of which their total accumulated scores lie in an interval [B 1 , B 2 ]. In these scenarios, the key enumeration algorithm has to enumerate all key candidates of which their total accumulated scores lie in that interval, however such enumeration may be not performed in a specified order; still, it does need to ensure that all fitting key candidates will be generated once it has completed. This is, the algorithm will generate all the key candidates of which their total accumulated scores satisfy the condition in any order. Such an algorithm would allow us to find the top M highest scoring candidates in any order if the interval is well defined, for example. Moreover, such an algorithm is commonly known as a nonoptimal key enumeration algorithm.
We note that the key enumeration problem arises in other contexts. For example, in the area of statistical cryptanalysis. In particular, the problem of merging two lists of subkey candidates was encountered by Junod and Vaudenay [27]. The small cardinality of the lists (2 13 ) was such that the simple approach that consists of merging and sorting the lists of subkeys was tractable. Another related problem is list decoding of convolutional codes by means of the Viterbi algorithm [28]. However, such algorithms are usually designed to output a small number of most likely candidates determined a priori, whilst our aim is at algorithms able to perform long enumerations, i.e., only those key enumeration algorithms designed to be able to perform enumerations of 2 30 or more key candidates.

Key Enumeration Algorithms
In this section, we review several key enumeration algorithms. Since our target is algorithms able to perform long enumerations, our review procedure consisted of examining only those research works presenting key enumeration algorithms designed to be able to perform enumerations of 2 30 or more key candidates. Basically, we reviewed research proposals mainly from the side-channel literature methodically and in detail, starting from the research paper by Veyrat-Charvillon et al. [18], which was the first to look closely at the conquer part in side-channel analysis with the goal of testing several billions of key candidates. Particularly, its authors noted that none of the key enumeration algorithms proposed in the research literature until then were scalable, requiring novel algorithms to tackle the problem. Hence, they presented an optimal key enumeration algorithm that has inspired other more recent proposals.
Broadly speaking, optimal key enumeration algorithms [18,28] tend to consume more memory and to be less efficient while generating high-scoring key candidates, whereas nonoptimal key enumeration algorithms [12][13][14][15][16][17]26,29] are expected to run faster and to consume less memory. Table 1 shows a preliminary taxonomy of the key enumeration algorithms to be reviewed in this section. Each algorithm will be detailed and analyzed below according to its overall performance, scalability, and memory consumption.

An Optimal Key Enumeration Algorithm
We study the optimal key enumeration algorithm (OKEA) that was introduced in the research paper [18]. We will firstly give the basic idea behind the algorithm by assuming the encoding of the secret key is represented as two chunks; hence, we have access to two lists of chunk candidates.

Setup
be the two lists respectively. Each list is in decreasing order based on the score component of its chunk candidates. Let us define an extended candidate as a 4-tuple of the form C := (c 0 j 0 , c 1 j 1 , j 0 , j 1 ) and its score as c 0 j 0 .score + c 1 j 1 .score. Additionally, let Q be a priority queue that will store extended candidates in decreasing order based on their score.
This data structure Q supports three methods. Firstly, the method Q.poll() retrieves and removes the head from this queue Q or returns null if this queue is empty. Secondly, the method Q.add(e) inserts the specified element e into the priority queue Q. Thirdly, the method Q.clear() removes all the elements from the queue Q. The queue will be empty after this method returns. By making use of a heap, we can support any priority-queue operation on a set of size n in O(log 2 (n)) time.
Furthermore, let X and Y be two vectors of bits that grow as needed. These are employed to track an extended candidate C in Q. C is in Q only if both X j 0 and Y j 1 are set to 1. By default, all bits in a vector initially have the value 0.

Basic Algorithm
At the initial stage, queue Q will be created. Next, the extended candidate (c 0 0 , c 1 0 , 0, 0) will be inserted into the priority queue and both X 0 and Y 0 will be set to 1. In order to generate a new key candidate, the routine nextCandidate, defined in Algorithm 1, should be executed.
Let us assume that m 0 , m 1 > 1. First, the extended candidate (c 0 0 , c 1 0 , 0, 0) will be retrieved and removed from Q, and then, X 0 and Y 0 will be set to 0. The two if blocks of instructions will then be executed, meaning that the extended candidates (c 0 1 , c 1 0 , 1, 0) and (c 0 0 , c 1 1 , 0, 1) will be inserted into Q. Moreover, the entries X 0 , X 1 , Y 0 , and Y 1 will be set to 1, while the other entries of X and Y will remain as 0. The routine nextCandidate will then return c 0,0 = combine(c 0 0 , c 1 0 ), which is the highest score key candidate, since L 0 and L 1 are in decreasing order. At this point, the two extended candidates (c 0 1 , c 1 0 , 1, 0) and (c 0 0 , c 1 1 , 0, 1) (both in Q) are the only ones that can have the second highest score. Therefore, if Algorithm 2 is called again, the first instruction will retrieve and remove the extended candidate with the second highest score, say (c 0 0 , c 1 1 , 0, 1), from Q and then the second instruction will set X 0 and Y 1 to 0. The first if condition will be attempted, but this time, it will be false since X 1 is set to 1. However, the second if condition will be satisfied, and therefore, (c 0 0 , c 1 2 , 0, 2) will be inserted into Q and the entries X 0 and Y 2 will be set to 1. The method will then return c 0,1 = combine(c 0 0 , c 1 1 ), which is the second highest score key candidate. Algorithm 1 outputs the next highest-scoring key candidate from L 0 and L 1 . (c 0 j 0 , c 1 j 1 , j 0 , j 1 ) ← Q.poll(); 3: if (j 0 + 1) < L 0 .size() and X j 0 +1 = 0 then 5: c 0 j 0 +1 ← L 0 .get(j 0 + 1); 6: Q.add((c 0 j 0 +1 , c 1 j 1 , j 0 + 1, j 1 )); 7: end if 9: if (j 1 + 1) < L 1 .size() and Y j 1 +1 = 0 then 10: Q.add((c 0 j 0 , c 1 j 1 +1 , j 0 , j 1 + 1)); 12: end if 14: return c j 0 ,j 1 = combine(c 0 j 0 , c 1 j 1 ); 15: end function At this point, the two extended candidates (c 0 1 , c 1 0 , 1, 0) and (c 0 0 , c 1 2 , 0, 2) (both in Q) are the only ones that can have the third highest score. As for why, we know that the algorithm has generated c 0,0 and c 0,1 so far. Since L 0 and L 1 are in decreasing order, we have that either c 0,0 .score ≥ c 0,1 .score ≥ c 1,0 .score ≥ c 0,2 .score or c 0,0 .score ≥ c 0,1 .score ≥ c 0,2 .score ≥ c 1,0 .score. Also, any other extended candidate yet to be inserted into Q cannot have the third highest score for the same reason. Consider, for example, (c 0 1 , c 1 1 , 1, 1): this extended candidate will be inserted into Q only if (c 0 1 , c 1 0 , 1, 0) has been retrieved and removed from Q. Therefore, if Algorithm 1 is executed again, it will return the third highest scoring key candidate and have the extended candidate with the fourth highest score placed at the head of Q. In general, the manner in which this algorithm travels through the m 0 × m 1 matrix of key candidates guarantees to output key candidates in a decreasing order based on their total accumulated score, i.e., this algorithm is an optimal key enumeration algorithm.
Regarding how fast queue Q grows, let N s Q be the number of extended candidates in Q after the function nextCandidate has been called s ≥ 0 times. Clearly, we have that N 0 Q = 1, since Q only contains the extended candidate (c 0 0 , c 1 0 , 0, 0) after initialisation. Also, N m 1 ·m 2 Q = 0 because, after m 1 · m 2 calls to the function, there will be no more key candidates to be enumerated. Note that, during the execution of the function nextCandidate, an extended candidate will be removed from Q and two new extended candidates might be inserted into Q. Considering the way in which an extended candidate is inserted into the queue, Q may contain at most one element in each row and column at any stage; hence, N s Q ≤ min(m 0 , m 1 ) for 0 ≤ s ≤ m1 · m2.

Complete Algorithm
Note that Algorithm 1 works properly if both input lists are in decreasing order. Hence, it may be generalized to a number of lists greater than 2 by employing a divide-and-conquer approach, which works by recursively breaking down the problem into two or more subproblems of the same or related type until these become simple enough to be solved directly. The solutions to the subproblems are then combined to give a solution to the original problem [30]. To explain the complete algorithm, let us consider the case when there are five chunks as an example. We have access to five lists of chunk candidates L i , 0 ≤ i < 5, each of which has a size of m i . We first call initialise(0, 4), as defined in Algorithm 2. This function will build a tree-like structure from the five given lists (see Figure 1).
..,q and N q+1,..., f are the children nodes, Q i,..., f is a priority queue, X i,..., f and Y i,..., f are bit vectors, and L i,..., f a list of chunk candidates. Additionally, this data structure supports the method size(), which returns the maximum number of chunk candidates that this node can generate. This method is easily defined in a recursive way: if N i,..., f is a leaf node, then the method will return L i,..., f .size() or else, the method will return N i,...,q .size() × N q+1,..., f .size(). To avoid computing this value each time this method is called, a node will internally store the value once it has been computed for the first time. Hence, the method will only return the stored value from the second call onwards. Furthermore, the function getCandidate(N i,..., f , j), as defined in Algorithm 3, returns the j th best chunk candidate (chunk candidate of which its score rank is j) from the node N i,..., f .
In order to generate the first N best key candidates from the root node R, with R := N 0,...,4 , we simply run nextCandidate(R), as defined in Algorithm 4, N times. This function internally calls the function getCandidate with suitable parameters each time it is required. Calling getCandidate(N i,..., f , j) may cause this function to internally invoke nextCandidate(N i,..., f ) to generate ordered key candidates from the inner node N i,..., f on the fly. Therefore, any inner node N i,..., f should keep track of the chunk candidates returned by getCandidate(N i,..., f , j) when called by its parent; otherwise, the j best chunk candidates from N i,..., f would have to be generated each time such a call is done, which is inefficient. To keep track of the returned chunk candidates, each node N i,..., f updates its internal list L i,..., f (see lines 5 to 7 in Algorithm 3).
Algorithm 2 creates and initialises each node of the tree-like structure.

Memory Consumption
Let us suppose that the encoding of a secret key is W = 2 a+b bits in size and that we set w = 2 a ; therefore, N = 2 b . Hence, we have access to N lists L i , 0 ≤ i < 2 b , each of which has m i chunk candidates. Suppose we would like to generate the first N best key candidates. We first invoke initialise(0, N − 1) (Algorithm 2). This call will create a tree-like structure with b + 1 levels starting at 0.

•
The root node R : This tree will have 2 0 + 2 1 + · · · + 2 b = 2 b+1 − 1 nodes. Let M k be the number of bits consumed by chunk candidates stored in memory after calling the function nextCandidate with R as a parameter k times. A chunk candidate at level 0 ≤ λ ≤ b is of the form (score, [e 0 , . . . , e 2 b−λ −1 ]) with score being a real number and e l being bit strings. Let B λ be the number of bits a chunk candidate at level λ occupies in memory.
First note that invoking initialise(0, N − 1) causes each internal node's list to grow, since 1. At creation of nodes L i (lines 2 to 4), L i is created by setting L i 's internal list to L i and by setting L i 's other components to null.

2.
At creation of both R and nodes N i d λ , for 0 < λ < b − 1 and 0 ≤ i d < 2 λ , the execution of the function getCandidate (lines 9 to 10) makes their corresponding left child (right child) store a new chunk candidate in their corresponding internal list. That is, for . Suppose the best key candidate is about to be generated, then nextCandidate(R) will be executed for the first time. This routine will remove the extended candidate (c x 0 , c y 0 , 0, 0) out of R's priority queue. If it enters the first if (lines 4 to 8), it will make the call getCandidate(N 0 1 , 1) (line 5), which may cause each node, except for the leaf nodes, of the left sub-tree to store at most a new chunk candidate in its corresponding internal list. Hence, retrieving the chunk candidate c x 1 may cause at most 2 λ−1 chunk candidates per level λ, 1 ≤ λ < b, to be stored. Likewise, if it enters the second if (lines 9 to 13), it will call the function getCandidate(N 1 1 , 1) (line 10), which may cause each node, except for the leaf nodes, of the right sub-tree to store at most a new chunk candidate in its corresponding internal list. Therefore, retrieving the chunk candidate c y 1 (line 10) may cause at most 2 λ−1 chunk candidates per level λ, 1 ≤ λ < b, to be stored. Therefore, after generating the best key candidate, p Let us assume that k − 1 key candidates have already been generated; therefore, M k−1 bits are consumed by chunk candidates in memory, with Let us now suppose the k th best key candidate is about to be generated; then, the method nextCandidate(R) will be executed for the k th time. This routine will remove the best extended candidate (c x j x , c y j y , j x , j y ) out of the R's priority queue. It will then attempt to insert two new extended candidates into R's priority queue. As seen previously, retrieving the chunk candidate c x j x +1 may cause at most 2 λ − 1 chunk candidates per level λ, 1 ≤ λ < b, to be stored. Likewise, retrieving the chunk candidate c y j y +1 may also cause at most 2 λ−1 chunk candidates per level λ, 1 ≤ λ < b, to be stored. Therefore, after generating the k th best key candidate, p (k) λ ≤ 2 λ chunk candidates per level λ, 1 ≤ λ < b, will be stored in memory; hence, bits are consumed by chunk candidates stored in memory.
It follows that, if N key candidates are generated, then bits are consumed by chunk candidates stored in memory in addition to the extended candidates stored internally in the priority queue of the nodes R and N i d λ . Therefore, this algorithm may consume a large amount of memory if it is used to generate a large number of key candidates, which may be problematic.

A Bounded-Space Near-Optimal Key Enumeration Algorithm
We next will describe a key enumeration algorithm introduced in the research paper [13]. This algorithm builds upon OKEA and can enumerate a large number of key candidates without exceeding the available space. The trade-off is that the enumeration order is only near-optimal rather than optimal as it is in OKEA. We firstly will give the basic idea behind the algorithm by assuming the encoding of the secret key is represented as two chunks; hence, we have access to two lists of chunk candidates.

Basic Algorithm
be the two lists, and let ω > 0 be an integer such that ω | m 0 and ω | m 1 . Each list is in decreasing order based on the score component of its chunk candidates. Let us set m min = min(m 0 , m 1 ) and define R k 0 ,k 1 as where k 0 , k 1 are positive integers. The key space is divided into layers layer ω k of width ω. Figure 2 depicts each layer as a different shade of blue. Formally, The remaining layers are defined as follows. Figure 2. Geometric representation of the key space divided into layers of width ω = 3.
The ω-layer key enumeration algorithm: Divide the key space into layers of width ω. Then, go over layer ω k , one by one, in increasing order. For each layer ω k , enumerate its key candidates by running OKEA within the layer layer ω k . More specifically, for each layer ω k , 1 ≤ k ≤ m min ω , the algorithm inserts the two corners, i.e., the extended candidates (c 0 , into the data structure Q. The algorithm then proceeds to extract extended candidates and to insert their successors as usual but limits the algorithm to not exceed the boundaries of the layer layer ω k when selecting components of candidates. For the remaining layers, if any, the algorithm inserts only one corner, either the extended candidate (c 0 (k−1)·ω , c 1 0 , (k − 1) · ω, 0) or the extended candidate (c 0 0 , c 1 (k−1)·ω , 0, (k − 1) · ω), into the data structure Q and then proceeds as usual while not exceeding the boundaries of the layer. Figure 2 also shows the extended candidates (represented as the smallest squares in a strong shade of blue within a layer) to be inserted into Q when a certain layer will be enumerated.

Complete Algorithm
When the number of chunks is greater than 2, the algorithm applies a recursive decomposition of the problem (similar to OKEA). Whenever a new chunk candidate is inserted into the candidate set, its value is obtained by applying the enumeration algorithm to the lower level. We explain an example to give an idea of the general algorithm. Let us suppose the encoding of the secret key is divided into 4 chunks; then, we have access to 4 lists of chunk candidates, each of which is of size m i with ω | m i .
To generate key candidates, we need to generate the two lists of chunk candidates for the lower level L 0,1 and L 2,3 on the fly as far as required. For this, we maintain a set of next potential candidates, for each dimension, Q 0,1 and Q 2,3 , so that each next chunk candidate obtained from Q 0,1 (or Q 2,3 ) is stored in the list L 0,1 (or L 2,3 ). Because the enumeration is performed by layers, the sizes of the data structures Q 1,2 and Q 3,4 are bounded by 2 · ω. However, this is not the case for the lists L 0,1 and L 2,3 , which grow as the number of candidates enumerated grows, hence becoming problematic as seen in Section 4.1.4.
To handle this, each layer ω k is partitioned into squares of size ω × ω. The algorithm still enumerates the key candidates in layer ω 1 first, then in layer ω 2 , and so on, but in each layer ω k , the enumeration will be square-by-square. Figure 3 depicts the geometric representation of the key enumeration within layer 3 3 , where a square (strong shade of blue) within a layer represents the square being processed by the enumeration algorithm. More specifically, for given nonnegative integers I and J, let us define S w I,J as Let us set m min = min(m 0 · m 1 , m 2 · m 3 ); hence, The remaining layers, if any, are also partitioned in a similar way. The in-layer algorithm then proceeds as follows. For each layer ω k , 1 ≤ k ≤ m min ω , the in-layer algorithm first enumerates the candidates in the two corner squares S = S ω k−1,0 ∪ S ω 0,k−1 by applying OKEA on S. At some point, one of the two squares is completely enumerated. Assume this is S ω k−1,0 . At this point, the only square that contains the next key candidates after S ω k−1,0 is the successor S ω k−1,1 . Therefore, when one of the squares is completely enumerated, its successor is inserted in S, as long as S does not contain a square in the same row or column. For the remaining layers, if any, the in-layer algorithm first enumerates the candidates in the square S = S ω k−1,0 (or S ω 0,k−1 ) by applying OKEA on it. Once the square is completely enumerated, its successor is inserted in S, and so on. This in-layer partition into squares reduces the space complexity, since instead of storing the full list of chunk candidates of the lower levels, only the relevant chunk candidates are stored for enumerating the two current squares.
Because this in-layer algorithm enumerates at most two squares at any time in a layer, the tree-like structure is no longer a binary tree. A node N i,..., f is now extended to an 8-tuple of the form and N q+1,..., f b for b = 0, 1 are the children nodes used to enumerate at most two squares in a particular layer, Q i,..., f is a priority queue, X i,..., f and Y i,..., f are bit vectors, and L i,..., f is a list of chunk candidates. Hence, the function that initialises the tree-like structure is adjusted to create the two additional children for a given node (see Algorithm 5).
Algorithm 5 creates and initialises each node of the tree-like structure.
if S I,J is completely enumerated then 6: if I = J or (I > last J and J = last J ) or (J > last I and I = last I ) then 9: if (j x + 1) < (last I + 1) · ω then 10: end if 15: if (j y + 1) < (last J + 1) · ω then 16: end if 21: else 22: if no candidates in same row/column as Successor(S I,J ) then 23: (c x k , c y l , k, l) ← getHighestScoreCandidate(Successor(S I,J )); 24: .., f l ← 1; 26: end if 27: end if 28: else 29: if (j x + 1, j y ) ∈ S I,J and X i,..., f j x +1 is set to 0 then 30: , j x + 1, 2); 31: .., f j y ← 1; 33: end if 34: if (j x , j y + 1) ∈ S I,J and X i,..., f j y +1 is set to 0 then 35: if I = J then 36:  (N i,..., f , j, sw) is also adjusted so that each node's internal list L i,..., f has at most ω chunk candidates at any stage of the algorithm (see Algorithm 6). This function internally makes the call to restart(N i,..., f ) if sw = 0. The call to restart(N i,..., f ) causes N i,..., f to restart its enumeration, i.e., after restart(N i,..., f ) has been invoked, calling nextCandidate(N i,..., f ) will return the first chunk candidate from N i,..., f . Also, the function getHighestScoreCandidate(S ω I,J ) returns the highest-scoring extended candidate from the square S ω I,J . Note this function is called to get the highest-scoring extended candidate from the successor of S ω I,J . At this point, the content of the internal list of N The original authors of the research paper [13] suggest having OKEA run in parallel per square within a layer, but this has a negative effect on the algorithm's near-optimality property and even on its overall performance since there are squares within a layer that are strongly dependent on others, i.e., for the algorithm to enumerate the successor square, say, S I,J+1 within a layer, it requires having information that is obtained during the enumeration of S I,J . Hence, this strategy may incur extra computation and is also difficult to implement.

Variant
As a variant of this algorithm, we propose to slightly change the definition of layer. Here, a layer consists of all the squares within a secondary diagonal, as shown in Figure 4. The variant will follow the same process as the original algorithm, i.e., enumeration layer by layer starting at the first secondary diagonal. Within each layer, it will first enumerate the two square corners S = S k−1,0 ∪ S 0,k−1 by applying OKEA on it. Once one of two squares is enumerated, let us say S k−1,0 , its successor S k−2,1 will be inserted in S as long as such insertion is possible. The algorithm will continue the enumeration by applying OKEA on the updated S and so on. This algorithm is motivated by the intuition that enumerating secondary diagonals may improve the quality of order of output key candidates, i.e., it may be closer to optimal. This variant, however, may have a potential disadvantage in the multidimensional case because it strongly depends on having all the previously enumerated chunk candidates of both dimension x and y stored. To illustrate this, let us suppose that this square S k−2,1 is to be inserted. Then, the algorithm needs to insert its highest-scoring extended candidate, (c x (k−2)·ω , c y ω , (k − 2) · ω, ω), into the queue. Hence, the algorithm needs to somehow have both c x (k−2)·ω and c y ω readily accessible when needed. This implies the need to store them when they are being enumerated (in previous layers). Comparatively, the original algorithm only requires having the ω previously generated chunk candidates of both dimension x and y stored, which is advantageous in terms of memory consumption.

A Simple Stack-Based, Depth-First Key Enumeration Algorithm
We next present a memory-efficient, nonoptimal key enumeration algorithm that generates key candidates of which their total scores are within a given interval [B 1 , B 2 ] that is based on the algorithm introduced by Martin et al. in the research paper [16]. We note that the original algorithm is fairly efficient while generating a new key candidate; however, its overall performance may be negatively affected by its use of memory, since it was originally designed to store each new generated key candidate, each of which is tested only once the algorithm has completed the enumeration. Our variant, however, makes use of a stack (last-in-first-out queue) during the enumeration process. This helps in maintaining the state of the algorithm. Each newly generated key candidate may be tested immediately, and there is no need for candidates to be stored for future processing.
Our variant basically performs a depth-first search in an undirected graph G originated from the N lists of chunk candidates L i = [c i 0 , c i n , . . . , c i m i −1 ]. This graph G has ∑ N −1 i=0 m i vertices, each of which represents a chunk candidate. Each vertex v i j is connected to the vertices v i+1 At any vertex v i j , the algorithm will check if c i j .score plus an accumulated score is within the given interval [B 1 , B 2 ]. If so, it will select the chunk candidate c i j for the chunk i and travel forward to the vertex v i+1 0 , or else, it will continue exploring and attempt to travel to the vertex v i j+1 . Otherwise, it will travel backwards to a vertex from the previous chunk v i−1 k , 0 ≤ k < m i−1 , when there is no suitable chunk candidate for the current chunk i.
As can be noted, this variant uses a simple backtracking strategy. In order to speed up the pruning process, we will make use of two precomputed tables minArray(maxArray). The entry minArray[i](maxArray[i]) holds the global minimum (maximum) value that can be reached from chunk i to chunk N − 1. In other words,

Setup
We now introduce a couple of tools that we will use to describe the algorithm, using the following notations. S will denote a stack. This data structure supports two basic methods [30]. Firstly, the method S.pop() removes the element at the top of this stack and returns that element as the value of this function. Secondly, the method S.push(e) pushes e onto the top of this stack. This stack S will store 4-tuples of the form (score, i, j, indices), where score is the accumulated score at any stage of the algorithm, i and j are the indices for the chunk candidate c i j , and indices is an array of positive integers holding the indices of the selected chunk candidates, i.e., the chunk candidate c k indices[k] is assigned to chunk k and for each k, 0 ≤ k ≤ i.

Complete Algorithm
Firstly, at the initialisation stage, the 4-tuple (0, 0, 0, []) will be inserted into the stack S. The main loop of this algorithm will call the function nextCandidate (S, B 1 , B 2 ), defined in Algorithm 8, as long as the stack S is not empty. Specifically the main loop will call this function to obtain a key candidate of which its score is in the range [B 1 , B 2 ]. Algorithm 8 will then attempt to find such a candidate, and once it has found such a candidate, it will return the candidate to the main loop (at this point, S may not be empty). The main loop will get the key candidate, process or test it, and continue calling the function nextCandidate(S, B 1 , B 2 ) as long as S is not empty. Because of the use of the stack S, the state of Algorithm 8 will not be lost; therefore, each time the main loop calls it, it will return a new key candidate of which its score lies in the interval [B 1 , B 2 ]. The main loop will terminate once all possible key candidates of which their scores are within the interval [B 1 , B 2 ] have already been generated, which will happen once the stack is empty.  S.push((aScore, i, j + 1, indices)); 6: end if 7: uScore ← aScore + c i j .score; 8: maxS ← uScore + maxArray[i + 1]; 9: minS ← uScore + minArray[i + 1]; 10: if maxS ≥ B 1 and minS ≤ B 2 then 11: if uScore ≤ B 2 then 12: if i = N − 1 then 13: if B 1 ≤ uScore then 14: indices ← indices [j];
Suppose now that the algorithm is about to execute the k th while iteration during which the first valid key candidate will be returned. Therefore, N k−1 S = 1 + (−1 + l 1 ) + (−1 + l 2 ) + (−1 + l 3 ) + (−1 + l 4 ) + · · · + (−1 + l k−1 ) ≤ N . During the execution of the k th while iteration, a 4-tuple will be removed and only a new 4-tuple will be considered for insertion in the stack. Therefore, we have that N k Applying a similar reasoning, we have N n S ≤ N for n > k.

Parallelization
One of the most interesting features of the previous algorithm is that it is parallelizable. The original authors suggested as a parallelization method to run instances of the algorithm over different disjoint intervals [16]. Although this method is effective and has a potential advantage as the different instances will produce nonoverlapping lists of key candidates with the instance searching over the first interval producing the most-likely key candidates, it is not efficient since each instance will inevitably repeat a lot of the work done by the other instances. Here, we propose another parallelization method that partitions the search space to avoid the repetition of work.
Suppose that we want to have t parallel, independent tasks T 1 , T 2 , T 3 , . . . , T t to search over a given interval in parallel. Let L i = [c i 0 , c i 1 , . . . , c i m i −1 ] be the list of chunk candidates for chunk i, 0 ≤ i ≤ N − 1. We first assume that t ≤ m 0 , where m 0 is the size of L 0 . In order to construct these tasks, we partition L 0 into t disjoint, roughly equal-sized sublists L 0 j , 1 ≤ j ≤ t. We set each task T j to perform its enumeration over the given interval but only consider the lists of chunk candidates L 0 j , L 1 , . . . , L N −1 . Note that the previous startegy can be easily generalised for m 0 < t ∏ N −1 k=0 m k . Indeed, first, find the smallest integer l, with 0 < l < N − 1, such that ∏ l−1 k=0 m k < t ≤ ∏ l k=0 m k . We then construct the list of chunk candidates L 0,...,l as follows. For each (l + 1)-tuple (c 0 j 0 , c 1 j 1 , . . . , c l j l ), with c k j k ∈ L k , 0 ≤ j k < m k , 0 ≤ k ≤ l, the chunk candidate c j 0 ,j 1 ,...,j l is constructed by calculating c j 0 ,j 1 ,...,j l .score = ∑ l k=0 c k j k .score and by setting c j 0 ,j 1 ,...,j l .value = [c 0 j 0 .value, . . . , c l j l .value], and then, c j 0 ,j 1 ,...,j l is added to L 0,...,l . We then partition L 0,...,l into t disjoint, roughly equal-sized sublists L 0,...,l j , 1 ≤ j ≤ t and finally set each task T j to perform its enumeration over the given interval but only consider the lists of chunk candidates L 0,...,l j , L l+1 , . . . , L N −1 . Note that the workload assigned to each enumerating task is a consequence of the selected method for partitioning the list L 0,...,l .
Additionally, both parallelization methods can be combined by partitioning the given interval [B 1 , B 2 ] into n s disjoint subintervals and by searching each such subinterval with t k tasks, hence amounting to ∑ n s k=1 t k enumerating tasks.

Threshold Algorithm
Algorithm 8 shares some similarities with the algorithm Threshold introduced in the research paper [14], since Threshold also makes use of an array ( partialSum) similar to the array minArray to speed up the pruning process. However, Threshold works with nonnegative integer values (weights) rather than scores. Threshold restricts the scores to weights such that the smallest weight is the likeliest score by making use of a function that converts scores into weights [14]. Threshold then enumerates all the key candidates of which their accumulated total weight lies in a range of the form [0, W t ), where W t is a parameter. To do so, it performs a similar process to Algorithm 8 by using its precomputed table (partialSum) to avoid useless paths, hence improving the pruning process. This enumeration process performed by Threshold is described in Algorithm 9.
According to its designers, this algorithm may perform a nonoptimal enumeration to a depth of 2 40 if some adjustments are made in the data structure L used to store the key candidates. However, its primary drawback is that it must always start enumerating from the most likely key. Consequently, whilst the simplicity and relatively strong time complexity of Threshold is desirable, in a parallelized environment, it can only serve as the first enumeration algorithm (or can only be used in the first search task). Threshold, therefore, was not implemented and, hence, is not included in the comparison made in Section 5.  for j = 0 to m i do 3: newW ← w + c i j .score; 4: if (newW + partialSum[i]) > W t ) then 5: break; 6: else 7: if i = N − 1 then 8:

A Weight-Based Key Enumeration Algorithm
In this subsection, we will describe a nonoptimal enumeration algorithm based on the algorithm introduced in the research paper [12]. This algorithm differs from the original algorithm in the manner in which this algorithm builds a precomputed table (iRange) and uses it during execution to construct key candidates of which their total accumulated score is equal to a certain accumulated score. This algorithm shares similarities with the stack-based, depth-first key enumeration algorithm described in Section 4.5 because both algorithms essentially perform a depth-first search in the undirected graph G. However, this algorithm controls pruning by the accumulated total score that a key candidate must reach to be accepted. To achieve this, the scores are restricted to positive integer values (weights), which may be derived from a correlation value in a side-channel analysis attack. This algorithm starts off by generating all key candidates with the largest possible accumulated total weight W 1 and then proceeds to generate all key candidates of which their accumulated total weight are equal to the second largest possible accumulated total weight W 2 , and so forth, until it generates all key candidates with the minimum possible accumulated total weight W N . To find a key candidate with a weight equal to a certain accumulated weight, this algorithm makes use of a simple backtracking strategy, which is efficient because impossible paths can be pruned early. The pruning is controlled by the accumulated weight that must be reached for the solution to be accepted. To achieve a fast decision process during backtracking, this algorithm precomputes tables for minimal and maximal accumulated total weights that can be reached by completing a path to the right, like the tables minArray and maxArray introduced in Section 4.5. Additionally, this algorithm precomputes an additional The algorithm uses these indices to construct a chunk candidate with an accumulated score w from chunk i to chunk N − 1.
In order to compute this .size() > 0. This helps in constructing a key candidate with an accumulated score w from chunk 0 to chunk N − 1. In particular, TWeights may be set to [W 1 , W 2 , . . . , W N ], i.e., the array containing all possible accumulated scores that can be reached from chunk 0 to chunk N − 1.
Furthermore, the order in which the elements in the array TWeights are arranged is important. For this array [W 1 , W 2 , . . . , W N ], for example, the algorithm will first enumerate all key candidates with accumulated weight W 1 and then all those with accumulated weight W 2 and so on. This guarantees a certain quality, since good key candidates will be enumerated earlier than worse ones. However, key candidates with the same accumulated weight will be generated in no particular order, so a lack of precision in converting scores to weights will lead to some decrease of quality.
Algorithm 11 enumerates key candidates for given weights. We will now analyse Algorithm 11. Suppose that w ∈ TWeights; hence, iRange[0][w].size() > 0. The algorithm will then set k[0] to (0, e , and then set cw to w (lines 3 to 5). We claim that the main while loop (lines 6 to 23) at each iteration will compute k[i] for 0 ≤ i ≤ N − 1 such that the key candidate c constructed at line 12 will have an accumulated score w.
Let us set cw 0 = w.  Since there may be more than one key candidate with an accumulated score w, the second inner while loop (lines 14 to 19) will backtrack to a chunk 0 ≤ i < N , from which a new key candidate with accumulated score w can be constructed. This is done by simply moving backwards (line 15) and updating cw i+1 to cw

1.
If there is such an i, then the instruction at line 21 1)). This means that the updated value for the second component of k[i] will be a valid index in L i , so c i k[i].e2 will be the new chunk candidate for chunk i. Then, the first inner while loop (lines 7 to 11) will again execute and compute the indices for the remaining chunk candidates in the lists L i+1 , . . . , L N −1 such that the resulting key candidate will have the accumulated score w.

2.
Otherwise, if i < 0, then the main while loop (lines 6 to 23) will end and w will be set to a new value from TWeights, since all key candidates with an accumulated score w have just been enumerated.

Parallelization
Suppose we would like to have t tasks T 1 , T 2 , T 3 , · · · , T t executed in parallel to enumerate key candidates of which the accumulated total weights are equal to those in the array TWeights. We can split the array TWeights into t disjoint sub-arrays TWeights i and then set each task T i to run Algorithm 11 through the sub-array TWeights i . As an example of a partition algorithm to distribute the workload among the tasks, we set the sub-array TWeights i to contain elements with indices congruent to i mod t from TWeights. Additionally, note that, if we have access to the number of candidates to be enumerated for each score in the array TWeights beforehand, we may design a partition algorithm for distributing the workload among the tasks almost evenly.

Run Times
We assume each list of chunk candidates L i = [c i 0 , c i 1 , . . . , c i m i −1 ], 0 ≤ i < N , is in decreasing order based on the score component of its chunk candidates. Regarding the run time for computing the tables maxArray and minArray, note that each entry of the table minArray(maxArray) can be computed as explained in Section 4.5. Therefore, the run time of such an algorithm is Θ(N ).
Regarding the run time for computing iRange, we will analyse Algorithm 10. This algorithm is composed of three For blocks. For each i, 0 ≤ i < N , the For loop from line 4 to line 15 will be executed r i times, where r i = maxArray[i] − minArray[i] + 1. For each iteration, the innermost For block (lines 6 to 11) will execute simple instructions m i times. Therefore, once the innermost block has finished, its run time will be T 3 · m i + C 3 , where T 3 and C 3 are constants. Then, the if block (lines 12 to 14) will be attempted and its run time will be C 2 , where C 2 is another constant. Therefore, the run time for an iteration of the For loop (lines 4 to 15) will be T 3 · m i + C 2 + C 3 . Therefore, the run time of Algorithm 10 is ∑ N −1 i=0 r i (T 3 · m i + C 2 + C 3 ). More specifically, As noted, this run time depends heavily on r i = maxArray[i] − minArray[i] + 1. Now, the size of the range [minArray[i], maxArray[i]] relies on the scaling technique used to get a positive integer from a real number. The more accurate the scaling technique is, the more different integer scores there will be. Hence, if we use an accurate scaling technique, we will probably get larger r i .
We will analyse the run time for Algorithm 11 to generate all key candidates of which their total accumulated weight is w. Let us assume there are N w key candidates of which their total accumulated score is equal to w.
First, the run time for instructions at lines 3 to 5 is constant. Therefore, we will only focus on the while loop (lines 6 to 23). In any iteration, the first inner while loop (lines 7 to 11) will execute and compute the indices for the remaining chunk candidates in the lists L i , . . . , L N −1 , with i starting at any number in [0, N − 2], such that the resulting key candidate will have the accumulated score w. Therefore, its run time is at most C · (N − 1), where C is a constant, i.e., it is O(N ). The instruction at line 12 will combine all chunks from 0 to N − 1, and hence, its run time is also O(N ). The next instruction Test(c) will test c, and its run time will depend on the scenario in which the algorithm is being run. Let us assume its run time is O(T(N )), where T is a function.
Regarding the second inner while loop (lines 14 to 19), this loop will backtrack to a chunk i with 0 ≤ i < N , from which a new key candidate with accumulated score w can be constructed. This is done by simply moving backwards while computing some simple operations. Therefore, the run time for the second inner while loop is at most D · (N − 1), where D is a constant, i.e., it is O(N ). Therefore, the run time for generating all key candidates of which the total accumulated score is w will be O(N w · (N + T(N ))).

Memory Consumption
Besides the precomputed tables, it is easy to see that Algorithm 11 makes use of negligible memory while enumerating key candidates. Indeed, testing key candidates is done on the fly to avoid storing them during enumeration. However, the table iRange may have many entries.
Let N e be the number of entries of the  L (i,w) is non-empty. After the iteration for i has been executed, the table iRange will have |W i | new entries, each of which will point to a non-empty list, with 0 < |W i | ≤ r i . Therefore, N e = 1 + ∑ N −1 i=0 |W i | after Algorithm 10 has completed its execution. Note that |W i | may increase if the range [minArray[i], maxArray[i]] is large. The size of this interval relies on the scaling technique used to get a positive integer from a real number. The more accurate the scaling technique is, the more different integer scores there will be. Hence, if we use an accurate scaling technique, we will probably get larger r i , making it likely for |W i | to increase. Therefore, the table iRange may have many entries.
Regarding the number of bits used in memory to store the table iRange, let us suppose that an integer is stored in B int bits and that a pointer is stored in B p bits. Once Algorithm 10 has completed its execution, we know that iRange[i][w] will point to the list L (i,w) , with 0 ≤ i ≤ N and w ∈ W i . Moreover, by definition, we know that the list L (N ,0) will be the list [0], while any other list L (i,w) , 0 ≤ i < N and w ∈ W i , will have n (i,w) entries, with 1 ≤ n (i,w) ≤ m i . Therefore, the number of bits iRange occupies in memory after Algorithm 11 has completed its execution is

A Key Enumeration Algorithm using Histograms
In this subsection, we will describe a nonoptimal key enumeration algorithm introduced in the research paper [17].

Setup
We now introduce a couple of tools that we will use to describe the sub-algorithms used in the algorithm of the research paper [17], using the following notations: H will denote a histogram, N b will denote a number of bins, b will denote a bin, and x will denote a bin index.

Linear Histograms
The function H i = createHist(L i , N b ) creates a standard histogram from the list of chunk candidates L i with N b linearly spaced bins.
Given a list of chunk candidates L i , the function createHist will first calculate both the minimum score min and maximum score max among all the chunk candidates in L i . It will then partition the interval I = [min, max] into subintervals I 0 = [min, min + σ), I 1 = [min + σ, min + 2σ), . . . , . It then will proceed to build the list L H i of size N b .
The entry 0 ≤ x < N b of L H i will point to a list that contains all chunk candidates from L i such that their scores lie in I x . The returned standard histogram H i is therefore stored as the list L H i of which its entries will point to lists of chunk candidates. For a given bin index x, L H i .get(x) outputs the list of chunk candidates contained in the bin of index x of H i . Therefore, H i [x] = L H i .get(x).size() is the number of chunk candidates in the bin of index x of H i . The run time for createHist (L i

Convolution
This is the usual convolution algorithm which computes H 1:2 = conv(H 1 , H 2 ) from two histograms H 1 and H 2 of sizes n 1 and n 2 , respectively, where H 1: x n j −1 for j = 1, 2. In order to get H 1:2 , we multiply the two polynomials of degree-bound n = max(n 1 , n 2 ) in time Θ(nlogn), with both the input and output representations in coefficient form [30]. The convoluted histogram H 1:2 is therefore stored as a list of integers.

Getting the Size of a Histogram
The method size() returns the number of bins of a histogram. This method simply returns L.size(), where L is the underlying list used to represent the histogram.

Getting Chunk Candidates from a Bin
Given a standard histogram H i and an index 0 ≤ x < H i .size(), the method H i .get(x) outputs the list of all chunk candidates contained in the bin of index x of H i , i.e., this method simply returns the list L H i .get(x).

Complete Algorithm
This key enumeration algorithm uses histograms to represent scores, and the first step of the key enumeration is a convolution of histograms modelling the distribution of the N lists of scores. This step is detailed in Algorithm 12.
Algorithm 12 computes standard and convoluted histograms. Based on this first step, this key enumeration algorithm allows enumerating key candidates that are ranked between two bounds R 1 and R 2 . In order to enumerate all keys ranked between the bounds R 1 and R 2 , the corresponding indices of bins of H 0:N −1 have to be computed, as described in Algorithm 13. It simply sums the number of key candidates contained in the bins starting from the bin containing the highest scoring key candidates until we exceed R 1 and R 2 and returns the corresponding indices x start and x stop . x start ← start; 9: while cnt start < R 2 do 10: start ← start − 1 ; 11: cnt start ← cnt start + H 0:N −1 [start]; 12: end while 13: x stop ← start; 14: return x start , x stop ; 15: end function Given the list of histograms of scores H and the indices of bins of H 0:N −1 between which we want to enumerate, the enumeration simply consists of performing a backtracking over all the bins between x start and x stop . More precisely, during this phase, we recover the bins of the initial histograms (i.e., before convolution) that were used to build a bin of the convoluted histogram H 0:N −1 . For a given bin b with index x of H 0:N −1 , we have to run through all the non-empty bins b 0 , . . . , b N −1 of indices x 0 , . . . , x N −1 of H 0 , . . . , H N −1 such that x 0 + . . . + x N −1 = x. Each b i will then contain at least one and at most m i chunk candidates of the list L i that we must enumerate. This leads to storing a table kf of N entries, each of which points to a list of chunk candidates. The list pointed to by the entry kf[i] holds at least one and at most m i chunk candidates contained in the bin b i of the histogram H i . Any combination of these N lists, i.e., picking an entry from each list, results in a key candidate.
Algorithm 14 describes more precisely this bin decomposition process. This algorithm simply follows a recursive decomposition. That is, in order to enumerate all the key candidates within a bin b of index x of x ← x − 1; 11: end while 12: else 13: x ← H i .size() − 1; 14: while (x ≥ 0) and (x + H 0:i−1 .size()) ≥ x bin do 15: if H i [x] > 0 and H 0:i−1 [x bin − x] > 0 then 16: kf[i] ← H i .get(x); 17: DecomposeBin(H, i − 1, x bin − x, kf); 18: end if 19: x ← x − 1; Suppose we would like to have t tasks T 1 , T 2 , T 3 , · · · , T t executing in parallel to enumerate key candidates that are ranked between two bounds R 1 and R 2 in parallel. We can then calculate the indices x start and x stop and then create the array X = [x start , x start − 1, . . . , x stop ]. We then partition the array X into t disjoint sub-arrays X i and finally set each task T i to call the function decomposeBin for all the bins of H 0:N −1 with indices in X i .
As has been noted previously, the algorithm employed to partition the array X directly allows efficient parallel key enumeration, where the amount of computation performed by each task may be well balanced. An example of a partition algorithm that could almost evenly distribute the workload among the tasks is as follows: 1.
Set i to 0.

2.
If X is non-empty, pick an index x in X such that H 0:N −1 [x] is the maximum number or else return X 1 , X 2 , . . . , X t .

3.
Remove x from the array X, and add it to the array X i+1 . 4.
Update i to (i + 1) mod t, and go back to Step 2.
Algorithm 15 processes table kf. Besides the precomputed histograms, which are stored as arrays in memory, it is easy to see that this algorithm makes use of negligible memory (only table kf) while enumerating key candidates. Additionally, it is important to note that each time the function processKF is called, it will need to generate all key candidates obtained by picking chunk candidates from the N lists pointed to by the entries of kf and to process all of them immediately, since the table kf may have changed. This implies that, if the processing of key candidates is left to be done after the complete enumeration has finished, each version of the table kf would need to be stored, which, again, might be problematic in terms of memory consumption.
Regarding how many bits in memory the precomputed histograms consumes, we will analyse Algorithm 12. First, note, for a given list of chunk candidates L i and N b , the function createHist(L i , N b ) will return the standard histogram H i . This standard histogram will be stored as the list L H i of size N b . An entry x of L H i will point to a list of chunk candidates. The total number of chunk candidates held by all the lists pointed to by the entries of L H i is m i . Therefore, the number of bits to store the list L H i is B p · N b + B c · m i , where B p is the number of bits to store a pointer and B c is the number of bits to store a chunk candidate (score, [e]). The total number of bits to store all lists Concerning the convoluted histograms, let us first look at H 0:1 = conv(H 0 , H 1 ). We know that H 0:1 is stored as a list of integers and that these entries can be seen as the coefficients of the resulting polynomial from multiplying the polynomial Therefore, the list of integers used to store H 0:1 has 2 · N b − 1 entries. Following a similar reasoning to the previous one, we can conclude that the list of integers used to store H 0:2 = conv(H 2 , H 0:1 ) has 3 · N b − 2 entries. Therefore, for a given i, 1 ≤ i ≤ N − 1 , the list of integers used to store H 0: The total number of entries of all the convoluted histograms H 0:1 , H 0:2 , . . . , H 0:N −1 is As expected, the total number of entries strongly depends on the values N b and N . If an integer is stored in B int bits, then the number of bits for storing all the convoluted histograms is

Equivalence with the Path-Counting Approach
The stack-based key enumeration algorithm and the score-based key enumeration algorithm can be also used for rank computation (instead of enumerating each path, the rank version counts each path). Similarly, the histogram algorithm can also be used for rank computation by simply summing the size of the corresponding bins in H 0:N −1 . These two approaches were believed to be distinct from each other. However, Martin et al. in the research paper [31] showed that both approaches are mathematically equivalent, i.e., they both compute the exact same rank when choosing their discretisation parameter correspondingly. Particularly, the authors showed that the binning process in the histogram algorithm is equivalent to the "map to weight" float-to-integer conversion used prior to their path counting algorithm (Forest) by choosing the algorithms' discretisation parameter carefully. Additionally, in this paper, a performance comparison between their enumeration versions was carried out. The practical experiments indicated that Histogram performs best for low discretisation and that Forest wins for higher parameters.

Variant
A recent paper by Grosso [26] introduced a variant of the previous algorithm. Basically, the author of [26] makes a small adaptation of Algorithm 14 to take into account the tree-like structure used by their rank estimation algorithm. Also, the author claims this variant has an advantage over the previous one when the memory needed to store histograms is too large.

A Quantum Key Search Algorithm
In this subsection, we will describe a quantum key enumeration algorithm introduced in the research paper [29] for the sake of completeness. This algorithm is constructed from a nonoptimal key enumeration algorithm, which uses the key rank algorithm given by Martin et al. in the research paper [16] to return a single key candidate (the r th ) with a weight in a particular range. We will first describe the key rank algorithm. This algorithm restricts the scores to positive integer values (weights) such that the smallest weight is the likeliest score by making use of a function that converts scores into weights [16].
Assuming the scores have already been converted to weights, the rank algorithm first constructs a matrix b with size of N × W 2 for a given range [W 1 , W 2 ) as follows. For i = N − 1 and 0 ≤ w < W 2 , the entry b i,w contains the number of chunk candidates such that their total score plus w lies in the given range. Therefore, b i,w is given by the number of chunk candidates c i j , 0 ≤ j < m i , such that On the other hand, for i = N − 2, N − 3, . . . , 0, and 0 ≤ w < W 2 , the entry b i,w contains the number of chunk candidates that can be constructed from the chunk i to the chunk N − 1 such that their total score plus w lies in the given range. Therefore, b i,w may be calculated as follows.
Algorithm 16 describes precisely the manner in which the matrix b is computed. Once matrix b is computed, the rank algorithm will calculate the number of key candidates in the given range by simply returning b 0,0 . Note that b 0,0 , by construction, contains the number of chunk candidates, with initial weight 0, that can be constructed from the chunk 0 to the chunk N − 1 such that their total weight lies in the given range. Algorithm 17 describes the rank algorithm.
Algorithm 16 creates the matrix b. for i = N − 2 to 0 do 12: for w = 0 to W 2 − 1 do 13: for j = 0 to m i − 1 do 14: if w + c i j .score < W 2 then With the help of Algorithm 17, an algorithm for requesting particular key candidates is introduced, which is described in Algorithm 18. It returns the r th key candidate with weight between W 1 and W 2 . Note that the correctness of the function getKey follows from the correctness of b and that the algorithm is deterministic, i.e., given the same r, it will return the same key candidate k. Also, note that the r th key candidate does not have to be the r th most likely key candidate in the given range.
Equipped with the getkey algorithm, the authors of [29] introduced a nonoptimal key enumeration algorithm to enumerate and test all key candidates in the given range. This algorithm works by calling the function getKey to obtain a key candidate in the given range until there are no more key candidates in the given range. Also, for each obtained key candidate k, it is tested by using a testing function T returning either 1 or 0. Algorithm 19 precisely describes how this nonoptimal key enumeration algorithm works.

27: end function
Combining together the function keySearch with techniques for searching over partitions independently, the authors of the research paper [29] introduced a key search algorithm, described in Algorithm 20. The function KS works by partitioning the search space into sections of which the size follows a geometrically increasing sequence using a size parameter a = O(1). This parameter is chosen such that the number of loop iterations is balanced with the number of keys verified per block. while True do 5: k ← getKey(b, W 1 , W 2 , r); 6: if k = ⊥ then 7: break; 8: end if 9: if T(k) = 1 then 10: break; 11: end if 12: r ← r + 1; 13: end while 14: return k;  Choose W e such that rank(0, W e ) is approx e; 6: while W 1 ≤ W e do 7: k ← keySearch(W 1 , W 2 , T); 8: if k = ⊥ then 9: return k; 10: end if 11: step ← step + 1; 12: W 1 ← W 2 ; 13: Choose W 2 such that rank(W 1 , W 2 ) is approx a step ; 14: end while 15: return ⊥;

16: end function
Having introduced the function KS, the authors of the research paper [29] transformed it into a quantum key search algorithm that heavily relies on Grover's algorithm [32]. This is a quantum algorithm to solve the following problem: Given a black box function which returns 1 on a single input x and 0 on all other inputs, find x. Note that, if there are N possible inputs to the black box function, the classical algorithm uses O(N) queries to the black box function since the correct input might be the very last input tested. However, in a quantum setting, a version of Grover's algorithm solves the problem using O(N 1/2 ) queries, with certainty [32,33]. Algorithm 21 describes the quantum search algorithm, which achieves a quadratic speedup over the classical key search (Algorithm 20) [29]. However, it would require significant quantum memory and a deep quantum circuit, making its practical application in the near future rather unlikely.

Comparison of Key Enumeration Algorithms
In this section, we will make a comparison of the previously described algorithms. We will show some results regarding their overall performance by computing some measures of interest.

Implementation
All the algorithms discussed in this paper were implemented in Java. This is because the Java platform provides the Java Collections Framework to handle data structures, which reduces programming effort, increases speed of software development and quality, and is reasonably performant. Furthermore, the Java platform also easily supports concurrent programming, providing high-level concurrency application programming interfaces (APIs).

Scenario
In order to make a comparison, we will consider a common scenario in which we will run the key enumeration algorithms to measure their performance. Particularly, we generate a random secret key encoded as a bit string of 128 bits, which is represented as a concatenation of 16 chunks, each on 8 bits.
We use a bit-flipping model, as described in Section 3.2. We particularly set α and β to particular values, namely 0.01 and 0.01, respectively. We then create an original key k (AES key) by picking a random value for each chunk i, where 0 ≤ i < 16. Once this key k has been generated, its bits will be flipped according to the values α and β to obtain a noisy version of it, r. We then use the procedure described in Section 3.2 to assign a score to each of the 256 possible candidate values for each chunk i. Therefore, once this algorithm has ended its execution, there will be 16 lists, each having 256 chunk candidates.
These 16 lists are then given to an auxiliary algorithm that does the following. For 0 ≤ i < 16, this algorithm outputs 2 e , with 1 ≤ e ≤ 8 chunk candidates for the chunk i, ensuring that the original chunk candidate for this chunk is one of the 2 e chunk candidates. This is, the secret key k is one out of all the 2 16·e key candidates. Therefore, we finally have access to 16 lists, each having 2 e chunk candidates, on which we run each of the key enumeration algorithms. Additionally, on execution, the key candidates generated by a particular key enumeration algorithm are not "tested" but rather "verified" by comparing them to the known key. Note that this is done only for the sake of testing these algorithms; however, in practice, it may be not possible to have such an auxiliary algorithm and the key candidates have to be tested rather than verified.

Results per Algorithms
In order to measure the key enumeration algorithms' overall performance, we simply generate multiple random instances of the scenario. Once a random instance has been generated, each key enumeration algorithm is run for a fixed number of key candidates. For each run of any algorithm, some statistics are collected, particularly the elapsed time to enumerate a fixed number of key candidates. This was done on a machine with an Intel Xeon CPU E5-2667 v2 running at 3.30 GHz with 8 cores. The set of simulations are run by setting e to 3. Therefore, each list has a size of 8 chunk candidates.
By running the optimal key enumeration algorithm (OKEA) from Section 4.1, we find the following issues: it is only able to enumerate at most 2 30 key candidates, and its overall performance decreases as the number of key candidates to enumerate increases. In particular, the number of key candidates considered per millisecond per core ranges from 2336 in a 2 20 enumeration through 1224 in a 2 25 enumeration to 582 in a 2 30 key enumeration. The main reason for this is that its memory usage grows rapidly as the number of key candidates to generate increases. Indeed, using terminology from Section 4.1.4, we have W = 128 = 2 7 , w = 8 = 2 3 , so a = 3, b = 4, so this instance of OKEA creates a tree composed of the root node R, the internal nodes N i d λ for 0 < λ ≤ 3, 0 ≤ i d < 2 λ , and the leaf nodes L i for 0 ≤ i < 16.
A chunk candidate is a 2-tuple of the form (score, value), where score is a float and value is an integer array. Both a float variable and an integer variable are stored in 32 bits. Now, at level 4, the value has only one entry; therefore, B 4 = 32 + 32 = 64. At level 3, the value has 2 entries; therefore, B 3 = 32 + 2(32) = 96. At level 2, the value has 4 entries; therefore, B 2 = 5(32) = 160. Finally, at level 1, the value has 8 entries; therefore, B 1 = 9(32) = 288. After N key candidates have been generated, the number of bits M N used to store chunk candidates by the algorithm will be We also need to include the number of bits used to store extended candidates internally in each priority queue N i d λ .Q for 0 < λ ≤ 3, 0 ≤ i d < 2 λ and the priority queue R.Q. Therefore, we conclude that, despite all the efforts made for implementing this algorithm in an ingenious way, the algorithm's scalability is mostly affected by its inherent design rather than by a particular implementation.
On the other hand, the bounded-space key enumeration algorithm (BSKEA) with ω = 4, described in Section 4.2, is able to enumerate 2 30 , 2 33 , 2 36 key candidates. However, it has a dramatic decrease in its overall performance as the number of key candidates to enumerate increases, similar to OKEA's behaviour. In particular, it is able to enumerate about 4800 key candidates per millisecond per core on average in a 2 30 enumeration, but this value drops to about 1820 key candidates on average in a 2 36 enumeration. The possible reasons for this behaviour are its intrinsic design, its memory consumption, and its implementation. The variant of the bounded-space key enumeration algorithm, introduced in Section 4.4.2, has the same problem as OKEA, i.e., its overall performance (hence, its scalability) is degraded by its excessive memory consumption and it is only able to enumerate up to 2 30 key candidates.
Regarding the key enumeration algorithm using histograms from Section 4.8, we first analyse the algorithm computing the histograms, i.e., Algorithm 12, and the algorithm computing x start , x stop . These two algorithms were run for N b = 10, 20, . . . , 100, R 1 = 1 and R 2 = 2 30 for 100 times. We notice that the run time increases as N b increases, especially for Algorithm 12, as Figure 5 shows. On the other hand, the other algorithm shows some negligible variations in its run time. Moreover, as expected, we note that the parameter N b makes the number of bins of H 0:N −1 increase; therefore, setting this parameter to a proper value helps in guaranteeing the number of key candidates to enumerate, while running through the enumeration bounds x start , x stop will be closer to R 2 − R 1 + 1 = 2 30 = 1, 073, 741, 824. Table 2 shows the number of bins of H 0:N −1 and the total number of key candidates to be enumerated between bounds x start , x stop on average.  Concerning the memory consumed by the arrays used to store histograms, we know that the total number of bits to store all lists L H i , 0 ≤ i < 16 is given by Equation (1) from Section 4.8.4. Therefore, we set B p , which is the number of bits to store a pointer, to 32 bits and set B c , the number of bits to store a chunk candidate (score, value), to 64. Therefore, N · B p · N b + B c · ∑ N −1 i=0 = 512 · N b + 8192. Now, the number of bits for storing all the convoluted histograms is given by Equation (2) from Section 4.8.4. We set B int = 32; therefore, 32 · (N b − 1) (15) (16) 2 + (32 · 15) · N b = 3840 · (N b − 1) + 480 · N b . Table 3 shows the number of bits for storing both standard histograms and convoluted histograms for values N b = 10, 30, 50, 70, and 100. We now report results concerning the enumeration algorithm of KEA with histograms, i.e., Algorithm 14. To run this algorithm, we first set the parameter R 1 to 1, R 2 to 2 z , where z = 30, 33, 36, and N b to 60. Once the pre-computation algorithms have ended their execution, we run Algorithm 14 for each index bin in the range calculated by Algorithm 13. Therefore, we find that this algorithm is able enumerate 2 30 , 2 33 , 2 36 key candidates and that its enumeration rate is between 3500 and 3800 key candidates per millisecond per core. Additionally, as seen, its memory consumption is low.
Concerning the stack-based key enumeration algorithm from Section 4.5, we first compute suitable values for B 1 and B 2 by employing the convoluted histogram H 0:N −1 generated by Algorithm 12. We then run Algorithm 8 with parameters B 1 and B 2 but limit the enumeration over this interval to not exceed the number of key candidates to enumerate; this number is obtained from the previous enumeration. Therefore, we find that this algorithm is able to enumerate 2 30 , 2 33 , 2 36 key candidates and that its enumeration rate is between 3300 and 3500 key candidates per millisecond per core.
Regarding its memory consumption, the stack-based key enumeration algorithm only uses two precomputed arrays, minArray and maxArray, both of which have N + 1 = 17 double entries. Additionally, as pointed out in Section 4.5.3, at any stage of the algorithm, there are at most 16 4-tuples stored in the stack S. Note that a 4-tuple consists of a double entry, two int entries, and an entry holding an int array indices. This array, indices, may have at most 16 entries, each holding an integer value. Therefore, its memory consumption is low.
Lastly, concerning the score-based key enumeration algorithm from Section 4.7, we first run its pre-computation algorithms, i.e., the algorithms for computing the tables minArray, maxArray, and iRange. As was pointed out in Section 4.7.4, the size of table iRange, hence the run time for calculating it, depends heavily on the scaling technique used to get a positive integer (weight) from a real number (score). We particularly use score · 10 s with s = 4 to get an integer score (weight) from a real-valued score. We find that the table iRange has around 15, 066 entries on average. Each of these entries point to a list of integers of which the number of entries is about 4 on average. Therefore, we have that the number of bits to store this table is 64 + (32 · 5)(15, 066) = 2, 410, 624 on average. Furthermore, we run Algorithm 11 but limit it to not exceed the number of key candidates to enumerate. As a result, we find that this algorithm can enumerate between 2600 and 3000 key candidates per millisecond per core.

Discussion
From the results discussed in Section 5, it can be seen that all key enumeration algorithms except for the optimal key enumeration algorithm (OKEA) and the variant of BSKEA have a much better overall performance and are able to enumerate a higher number of key candidates. In particular, we find that all of them are able to enumerate 2 30 , 2 33 , 2 36 key candidates, while OKEA and the variant of BSKEA are only able to enumerate up to 2 30 . Their poor performance is caused by their excessive consumption of memory. In particular, OKEA is the most memory-consuming algorithm, hence degrading its overall performance and scalability. In general, scalability is low in optimal key enumeration algorithms [18,28], considering that not too many candidates can be enumerated as a result of the exponential growth in their memory consumption. However, by relaxing the restriction on the order in which the key candidates will be enumerated, we are able to design nonoptimal key enumeration algorithms, having better overall performance and scalability. In particular, relaxing this restriction on the order allows for the construction of parallelizable and memory-efficient key enumeration algorithms, as was evinced in this paper and the results previously described. Moreover, all the algorithms save for OKEA [12][13][14][15][16][17] as described in this paper are nonoptimal ones, and their respective descriptions and empirical results show that they are expected to have a better overall performance and to consume much less computational resources. Table 4 briefly summarises some qualitative and functional attributes of the described algorithms.
Additionally, note that, when an array is used to store a private key and each entry of this array contains much more data than required in the sense that the number of bits used to store a reduced set of numbers is greater than required, this redundancy as well as the small number of candidates per chunk allow us to generate more "reliable" scores for the candidates per chunk (which would make the key enumeration algorithms find the correct key after enumerating much fewer candidates). From an implementer's view, this may be mitigated by reducing the redundancy used to store a particular private key.

Conclusions
In this paper, we investigated the key enumeration problem, since there is a connection between the key enumeration problem and the key recovery problem. The key enumeration problem arises in the side-channel attack literature, where, for example, the attacker might procure scoring information for each byte of an AES key from a power analysis attack [34] and then want to efficiently enumerate and test a large number of complete 16-byte candidates until the correct key is found.
In summary, we first stated the key enumeration problem in a general way and then studied and analysed several algorithms to solve this problem, such as the optimal key enumeration algorithm (OKEA); the bounded-space near-optimal key enumeration algorithm; the simple stack-based, depth-first key enumeration algorithm; the score-based key enumeration algorithm; and the key enumeration algorithm using histograms. For each studied algorithm, we described its inner functioning, showing its functional and qualitative features, such as memory consumption, amenability to parallelization, and scalability. Furthermore, we proposed variants of some of them and implemented all of them on Java. We then experimented with them and made an experimental comparison of all of them, drawing special attention to their strengths and weaknesses.
As a future research, it would be interesting to find cryptanalysis scenarios to which we could apply key enumeration algorithms together with other techniques. For example, we can think of evaluating the post-quantum cryptographic schemes submitted to the second round of the National Institute of Standards and Technology (NIST) post-quantum cryptography standardization process in the cold boot attack setting [10]. Furthermore, we can think of exploring the use of key enumeration algorithms in cache attacks to achieve full key recovery when insufficient information is gathered [35].
Funding: This research was funded by Colciencias grant number 568 and the APC was funded by Universidad del Norte.

Conflicts of Interest:
The authors declare no conflict of interest.