Improvements on Making BKW Practical for Solving LWE †

: The learning with errors (LWE) problem is one of the main mathematical foundations of post-quantum cryptography. One of the main groups of algorithms for solving LWE is the Blum–Kalai–Wasserman (BKW) algorithm. This paper presents new improvements of BKW-style algorithms for solving LWE instances. We target minimum concrete complexity, and we introduce a new reduction step where we partially reduce the last position in an iteration and ﬁnish the reduction in the next iteration, allowing non-integer step sizes. We also introduce a new procedure in the secret recovery by mapping the problem to binary problems and applying the fast Walsh Hadamard transform. The complexity of the resulting algorithm compares favorably with all other previous approaches, including lattice sieving. We additionally show the steps of implementing the approach for large LWE problem instances. We provide two implementations of the algorithm, one RAM-based approach that is optimized for speed, and one ﬁle-based approach which overcomes RAM limitations by using ﬁle-based storage.


Introduction
Since a large-scale quantum computer easily breaks both the problem of integer factoring and the discrete logarithm problem [1], public-key cryptography needs to be based on other underlying mathematical problems. In post-quantum cryptography-the research area studying such replacements-lattice-based problems are the most promising candidates. In the NIST post-quantum standardization competition, 5 out of 7 finalists and 2 out of 8 alternates are lattice-based [2].
The learning with errors problem (LWE) introduced by Regev in [3], is the main problem in lattice-based cryptography. It has a theoretically very interesting average-case to worst-case reduction to standard lattice-based problems. It has many cryptographic applications, including, but not limited to, the design of fully homomorphic encryption schemes (FHE). An interesting special case of LWE is the learning parity with noise problem (LPN), introduced in [4], which has interesting applications in light-weight cryptography.
Considerable cryptanalytic effort has been made when it comes to algorithms for solving LWE. These can be divided into three categories: lattice-reduction, algebraic methods, and combinatorial methods. The algebraic methods were introduced by Arora and Ge in [5], and further considered in [6]. For very small noise, these methods perform very well, but otherwise the approach is inefficient. The methods based on lattice-reduction are The combinatorial algorithms are all based on the Blum-Kalai-Wasserman (BKW) algorithm, and these will be the focus of this paper.
For surveys on the concrete and asymptotic complexity of solving LWE, see [7][8][9], respectively. In essence, BKW-style algorithms have a better asymptotic performance than lattice-based approaches for parameter choices with large noise. Unlike lattice-based approaches, BKW-style algorithms pay a penalty when the number of samples is limited (like in the Darmstadt challenges and many cryptographic schemes). A recent example of a scheme allowing for a very large number of LWE samples can be found in [10].

Related Work
The BKW algorithm was originally developed as the first subexponential algorithm for solving the LPN problem [11]. In [12], the algorithm was improved, introducing new concepts like LF2 and the use of the fast Walsh-Hadamard transform (FWHT) for the distinguishing phase. A new distinguisher using subspace hypothesis testing was introduced in [13,14].
The BKW algorithm was first applied to the LWE problem in [15]. This idea was improved in [16], where the idea of lazy modulus switching (LMS) was introduced. The idea was improved in [17,18], where [17] introduced so called coded-BKW steps. The idea of combining coded-BKW or LMS with techniques from lattice sieving [19] leads to the next improvement [20]. This combined approach was slightly improved in [9,21]. The distinguishing part of the BKW algorithm for solving LWE was improved by using the fast Fourier transform (FFT) in [22]. One drawback of BKW is its high memory-usage. To remedy this, time-memory trade-offs for the BKW algorithm were recently studied in [23][24][25]. A recent fast implementation of the BKW algorithm for solving LPN is in [26].

Contributions
In this paper, we introduce a new BKW-style algorithm, including the following. • A generalized reduction step that we refer to as smooth-LMS, allowing us to use non-integer step sizes. These steps allow us to use the same time, space, and sample complexity in each reduction step of the algorithm, which improves performance compared to previous work. • A binary-oriented method for the guessing phase, transforming the LWE problem into an LPN problem. While the previous FFT method guesses a few positions of the secret vector and finds the correct one, this approach instead finds the least significant bits of a large amount of positions using the FWHT. This method allows us to correctly distinguish the secret with a larger noise level, generally leading to an improved performance compared to the FFT-based method. In addition, the FWHT is much faster in implementation. • Concrete complexity calculations for the proposed algorithm showing the lowest known complexity for some parameter choices selected as in the Darmstadt LWE Challenge instances, but with unrestricted number of samples. • Two implementations of the algorithm that follow two different strategies in memorymanagement. One is fast, light, and uses solely RAM-memory. The latter follows a file-based strategy to overcome the memory limitations imposed by using only RAM. The file read/write is minimized by implementing the algorithm in a clever way. Simulation results on solving larger instances are presented and verifies the previous theoretical arguments.

Organization
We organize the rest of the paper as follows. We introduce some necessary background in Section 2. In Section 3, we cover previous work on applying the BKW algorithm to the LWE problem. Then, in Section 4, we introduce our new Smooth-LMS reduction method. Next, in Section 5, we go over our new binary-oriented guessing procedure. Sections 6 and 7 cover the complexity analysis and implementation of our algorithm, respectively. Section 8 describes our experimental results using the implementation. Finally, the paper is concluded in Section 9.

Notation
Throughout the paper, we use the following notations. • We write log(·) for the base 2 logarithm. • In the n-dimensional Euclidean space R n , by the norm of a vector x = (x 1 , x 2 , . . . , x n ), we consider its L 2 -norm, defined as The Euclidean distance between vectors x and y in R n is defined as x − y . • Elements in Z q are represented by the set of integers in [− q−1 2 , For an [N, k] linear code, N denotes the code length, and k denotes the dimension.

The LWE and LPN Problems
The LWE problem [3] is defined as follows.
Definition 1. Let n be a positive integer, q a prime, and let X be an error distribution selected as the discrete Gaussian distribution on Z q with variance σ 2 . Fix s to be a secret vector in Z n q , chosen from some distribution (usually the uniform distribution). Denote by L s,X the probability distribution on Z n q × Z q obtained by choosing a ∈ Z n q uniformly at random, choosing an error e ∈ Z q from X and returning (a, z) = (a, a, s + e) in Z n q × Z q . The (search) LWE problem is to find the secret vector s given a fixed number of samples from L s,X .
The definition above gives the search LWE problem, as the problem description asks for the recovery of the secret vector s. Another version is the decision LWE problem, in which case the problem is to distinguish between samples drawn from L s,X and a uniform distribution on Z n q × Z q . Let us also define the LPN problem, which is a binary special case of LWE.
Definition 2. Let k be a positive integer, let x be a secret binary vector of length k and let X ∼ Ber η be a Bernoulli distributed error with parameter η > 0. Let L x,X denote the probability distribution on F k 2 × F 2 obtained by choosing g uniformly at random, choosing e ∈ F 2 from X and returning (g, z) = (g, g, x + e) The (search) LPN problem is to find the secret vector x given a fixed number of samples from L x,X .
Just like for LWE, we can also, analogously, define decision LPN. Previously, analyses of algorithms solving the LWE problem have used two different approaches. One being calculating the number of operations needed to solve a certain instance for a particular algorithm, and then comparing the different complexity results. The other is asymptotic analysis. Solvers for the LWE problem with suitable parameters are expected to have fully exponential complexity, bounded by 2 cn as n tends to infinity, where the value of c depends on the algorithms and the parameters of the involved distributions. In this paper, we focus on the complexity computed as the number of arithmetic operations in Z q , for solving particular LWE instances (and we do not consider the asymptotics).

Discrete Gaussian Distributions
We define the discrete Gaussian distribution over Z with mean 0 and variance σ 2 , denoted D Z,σ as the probability distribution obtained by assigning a probability propor-tional to exp(−x 2 /(2σ 2 )) to each x ∈ Z. Then, the discrete Gaussian distribution X over Z q with variance σ 2 (also denoted X σ ) can be defined by folding D Z,σ and accumulating the value of the probability mass function over all integers in each residue class modulo q. It makes sense to consider the noise level as α, where σ = αq. We also define the rounded Gaussian distribution on Z q . This distribution samples values by sampling values from the continuous Gaussian distribution with mean 0 and variance σ 2 , rounding to the closest integer, and then folding the result to the corresponding value in Z q . We denote it byΨ σ,q .
If two independent X 1 and X 2 are drawn from X σ 1 and X σ 2 respectively, we make the heuristic assumption that their sum is drawn from X √ σ 2 1 +σ 2 2 . We make the corresponding assumption for the rounded Gaussian distribution.

The LWE Problem Reformulated
Assume that m samples (a 1 , z 1 ), (a 2 , z 2 ), . . . , (a m , z m ), are collected from the LWE distribution L s,X , where a i ∈ Z n q , z i ∈ Z q . Let z = (z 1 , z 2 , . . . , z m ) and y = (y 1 , y 2 , . . . , y m ) = sA. We have where A = a T 1 a T 2 · · · a T m , z i = y i + e i = s, a i + e i and e i $ ← X . The search LWE problem is a decoding problem, where A serves as the generator matrix for a linear code over Z q and z is a received word. Finding the secret vector s is equivalent to finding the codeword y = sA for which the Euclidean distance ||y − z|| is minimal. In the sequel, we adopt the notation a i = (a i1 , a i2 , . . . , a in ).
Using this transformation, each entry in the secret vector s is now distributed according to X . The fact that entries in s are small is a very useful property in several of the known reduction algorithms for solving LWE.
The noise distribution X is usually chosen as the discrete Gaussian distribution or the rounded Gaussian Distribution from Section 2.3.

Sample Amplification
In some versions of the LWE problem, such as the Darmstadt Challenges [29], the number of available samples is limited. To get more samples, sample amplification can be used. For example, assume that we have M samples (a 1 , b 1 ), (a 2 , b 2 ), ..., (a M , b M ). Then, we can form new samples, using an index set I of size k, as Given an initial number of samples M, we can produce up to 2 k−1 ( M k ) samples. This comes at a cost of increasing the noise level (standard deviation) to √ k · σ. This also increases the sample dependency.

Iterating and Guessing
BKW-style algorithms work by combining samples in many steps in such a way that we reach a system of equations over Z q of the form z = sA + E, where E = (E 1 , E 2 , . . . , E m ) and the entries E i , i = 1, 2, . . . , m are sums of not too many original noise vectors, say E i = ∑ 2 t j=1 e i j , and where t is the number of iterations. The process also reduces the norm of column vectors in A to be small. Let n i , i = 1, 2, . . . , t denote the number of reduced positions in step i and let N i = ∑ i j=1 n j . If n = N t ; then, every reduced equation is of form for i = 1, 2, . . . , m. The right hand side can be approximated as a sample drawn from a discrete Gaussian, and if the standard deviation is not too large, then the sequence of samples z 1 , z 2 , . . . can be distinguished from a uniform distribution. We will then need to determine the number of required samples to distinguish between the uniform distribution on Z q and X σ . Relying on standard theory from statistics, using either previous work [30] or Bleichenbacher's definition of bias [31], we can find that the required number of samples is roughly where C is a small positive constant, whose value was studied in [32]. Initially, an optimal but exhaustive distinguisher was used [33]. While minimizing the sample complexity, it was slow and limited the number of positions that could be guessed. This basic approach was improved in [22], using the FFT. This was, in turn, a generalization of the corresponding distinguisher for LPN, which used the FWHT [12]. It was shown in [32] that the FFT distinguisher matches the sample complexity of the optimal distinguisher.

Plain BKW
The basic BKW algorithm was originally developed for solving LPN in [11]. It was first applied to LWE in [15]. The reduction part of this approach means that we reduce a fixed number b of positions in the column vectors of A to zero in each step. In each iteration, the dimension of A is decreased by b and after t iterations the dimension has decreased by bt.

Coded-BKW and LMS
LMS was introduced in [16] and improved in [18]. Coded-BKW was introduced in [17]. Both methods reduce positions in the columns of A to a small magnitude, but not to zero, allowing the reduction of more positions per step. In LMS this is achieved by mapping samples to the same category if the n i considered positions give the same result when integer divided by a suitable parameter p. In coded-BKW this is instead achieved by mapping samples to the same category if they are close to the same codeword in an [n i , k i ] linear code, for a suitable value k i . Samples mapped to the same category give rise to new samples by subtracting them. The main idea [17,18] is that positions in later iterations do not need to be reduced as much as the first ones, giving different n i values in different steps.

LF1, LF2, Unnatural Selection
Each step of the reduction part of the BKW algorithm consists of two parts. The first samples are mapped to categories depending on their position values on the currently relevant n i positions. Next, pairs of samples within the categories are added/subtracted to reduce the current n i positions to form a new generation of samples. This can be done in a couple of different ways.
Originally, this was done using what is called LF1. Here, we pick a representative from each category and form new samples by adding/subtracting samples to/from this sample. This approach makes the final samples independent, but also gradually decreases the sample size. In [12], the approach called LF2 was introduced. Here, we add/subtract every possible pair within each category to form new samples. This approach requires only 3 samples within each category to form a new generation of the same size. The final samples are no longer independent, but experiments have shown that this effect is negligible.
In [16], unnatural selection was introduced.The idea is to produce more samples than needed from each category, but only keep the best samples, typically the ones with minimum norm on the current N i positions in the columns of A.

Coded-BKW with Sieving
When using coded-BKW or LMS, the previously reduced N i−1 positions of the columns of A increase in magnitude, with an average factor √ 2 in each reduction step. This problem was addressed in [20] by using unnatural selection to only produce samples that kept the magnitude of the previous N i−1 positions small. Instead of testing all possible pairs of samples within the categories, this procedure was sped-up using lattice sieving techniques of [19]. This approach was slightly improved in [9,21].

BKW-Style Reduction Using Smooth-LMS
In this section, we introduce a new reduction algorithm solving the problem of having the same complexity and memory usage in each iteration of a BKW-style reduction. The novel idea is to use simple LMS to reduce a certain number of positions and then partially reduce one extra position. This allows for balancing the complexity among the steps, and hence to reduce more positions in total.

A New BKW-Style Step
Assume having a large set of samples written as before in the form z = sA + e mod q. Assume also that the entries of the secret vector s are drawn from some restricted distribution with small standard deviation (compared to the alphabet size q). If this is not the case, the transformation from Section 3.2 should be applied. Moreover, in case the later distinguishing process involves some positions to be guessed or transformed, we assume that this has been already considered, and all positions in our coming description should be reduced.
The goal of this BKW-type procedure is to make the norms of the column vectors of A small by adding and subtracting equations together in a number of steps. Having expressions of the form z i = sa i + E i mod q, if we can reach a case where ||a i || is not too large, then sa i + E i can be considered as a random variable drawn from a discrete Gaussian distribution X σ . Furthermore, X σ mod q can be distinguished from a uniform distribution over Z q , if σ is not too large. Now, let us describe the new reduction procedure. Fix the number of reduction steps to be t. We will also fix a maximum list size to be 2 v , meaning that A can have at most 2 v columns. In each iteration i, we are going to reduce some positions to be upper limited in magnitude by C i , for i = 1, ..., t. Namely, these positions that are fully treated in iteration i will only have values in the set {−C i + 1, . . . , 0, 1, . . . , C i − 1} of size 2C i − 1. We do this by dividing up the q possible values into intervals of length C i . We also adopt the notation β i = q/C i , which describes the number of intervals that we divide the positions up into. We assume that β i > 2.

First Step
In the first iteration, assume that we have stored A. We first compute the required compression starting in iteration 1 by computing C 1 (we will explain how later). We then evaluate how many positions n 1 that can be fully reduced by computing n 1 = v/ log β 1 . The position n 1 + 1 can be partially reduced to be in an interval of size C 1 fulfilling β 1 · β n 1 1 · 3/2 ≤ 2 v , where β 1 = q/C 1 . Now, we do an LMS step that "transfers between iterations" in the following way.
We run through all the columns of A. For column i, we simply denote it as x = (x 1 , x 2 , . . . , x n ), and we compute: The vector K i = (k 1 , k 2 , . . . , k n 1 +1 ) is now an index to a sorted list L, storing these vectors (The point of inverting all position values if x 1 < 0 is to make sure that samples that are reduced when added should be given the same index.
the absolute value of position n 1 + 1 is < C 1 .

Next Steps
We now describe all the next iterations, numbered as l = 2, 3, . . . , t. Iteration l will involve positions from N l−1 + 1 = ∑ l−1 i=1 n i + 1 to N l + 1. The very first position has possibly already been partially reduced, and its absolute value is < C l−1 , so the interval for possible values is of the size 2C l−1 − 1. Assume that the desired interval size in iteration l is C l . In order to achieve the corresponding reduction factor β l , we split this interval in β l = (2C l−1 − 1)/C l subintervals. We then compute how many positions n l that can be fully reduced by computing n l = (v − log β l )/ log β l . The position N l + 1 can finally be partially reduced to be in an interval of size C l fulfilling β l · β Similar to iteration 1, we run through all the columns of A. For each column i in the matrix A denoted as x, we do the following. For each vector position in {N l−1 + 1, . . . , N l + 1}, we compute (here div means integer division) The vector K = (k 1 , k 2 , . . . , k n l+1 ) is again an index to a sorted list L, keeping track of columns (here, the point of inverting all position values if x N l−1 +1 < 0 is to make sure that samples that get reduced when added should be given the same index. For example, (x N l−1 +1 , x N l−1 +2 , . . . , x N l +1 ) and (−x N l−1 +1 , −x N l−1 +2 , . . . , −x N l +1 ) are mapped to the same category). So again, we assign L(K) = L(K) ∪ {i}. After we have inserted all column indices into the list L, we go to the combining part.
As in the first step, we build a new A as follows. Run through all indices K and if |L(K)| ≥ 2 combine every pair of vectors by adding/subtracting them to form a column in the new matrix A. Stop when the number of new columns has reached 2 v .
For the last iteration, since N t is the last row of A, one applies the same step as above, but without reducing the extra position. After t iterations, one gets equations in the form (2), where the a i vectors in A have reduced the norm.

Smooth-Plain BKW
The procedure described above also applies to plain BKW steps. For example, if in the first iteration, one sets C 1 = 1 and C 1 > 1, then each column vector x of A will be reduced such that x 1 = . . . = x n 1 = 0 and x n 1 +1 ∈ {−C 1 + 1, . . . , C 1 − 1}. Thus, one can either continue with another smooth-plain BKW step by setting also C 2 = 1 in the second iteration, or switch to smooth-LMS. In both cases, we have the advantage of having x n 1 already partially reduced. Using these smooth-plain steps, we can reduce a couple of extra positions in the plain pre-processing steps of the BKW algorithm.

How to Choose the Interval Sizes C i
To achieve as small a norm of the vectors as possible, we would like the variance of all positions to be equally large, after completing all iterations. Assume that a position x takes values uniformly in the set {−(C − 1)/2, . . . , 0, 1, . . . , (C − 1)/2}, for C > 0. Thus, we have that Var[x] = (C − 1)(C + 1)/12. Assuming C is somewhat large, we approximately get Var[x] = C 2 /12. When subtracting/adding two such values, the variance increases to 2Var[x] in each iteration. Therefore, a reduced position will have an expected growth of √ 2. For this reason, we choose a relation for the interval sizes of the form This makes the variance of each position roughly the same, after completing all iterations. In particular, our vectors ||a i || in A are expected to have norm at most √ nC t / √ 12, and C t is determined according to the final noise allowed in the guessing phase. Ignoring the pre-processing step with smooth-plain BKW steps; the maximum dimension n that can be reduced is then n = N t = ∑ t i=1 n i .

Example 1.
Let q = 1601 and α = 0.005, so σ = αq ≈ 8. Let us compute how many positions can be reduced using 2 v = 2 28 list entries. The idea is that the variance of the right hand side in (2) should be minimized by making the variance of the two terms roughly equal. The error part E i is the sum of 2 t initial errors, so its variance is Var[E i ] = 2 t σ 2 . In order to be able to distinguish the samples according to (3), we set Var[E i ] < q 2 /2. This will give us the number of iterations possible as 2 t σ 2 ≈ q 2 /2 or 2 t ≈ 1601 2 /(2 · 8 2 ) leading to t = 14. Now, we bound the variance of the scalar product part of (2) also to be < q 2 /2, so nσ 2 C 2 t /12 ≈ q 2 /2 leading to C 2 t ≈ 12q 2 /(2nσ 2 ) and C 2 t ≈ 12 · 1601 2 /(2n · 8 2 ) or C t ≈ 80 if n < 38. Then, one chooses C t−1 = C t / √ 2 = 57, and so on.

On Optimizing C l Values
The choice of the parameter C l within a smooth-LMS step can be optimized in order to achieve a lower number of category in the next step. For example, consider q = 1601 and C l = 250. The number of categories for a single position would be q 2C l +1 = 3. Clearly, the same result can be obtained if one chooses C l = 200. The difference is that with this second choice of C l , for the same cost (linear on the number of categories), one gets more reduced samples at the end of the step. Therefore, a lower number of categories is required for the next step.

A Binary Partial Guessing Approach
In this section, we propose a new way of reducing the guessing step to a binary version. In this way, we are able to efficiently use the FWHT to guess many entries in a small number of operations. In Section 6 we do the theoretical analysis and show that this indeed leads to a more efficient procedure than all previous ones.

From LWE to LPN
First, we need to introduce a slight modification to the original system of equations before the reduction part. Assume that we have turned the distribution of s to be the noise distribution, through the standard transformation described in Section 3.2. The result after this is written as before z = sA + e.
Now, we perform a multiplication by 2 to each equation, resulting in since, when multiplied with a known value, we can compute the result modulo q. Next, we apply the reduction steps and make the values in A as small as possible by performing BKW-like steps. In our case, we apply the smooth-LMS step from the previous section, but any other reduction method like coded-BKW with sieving would be possible. If A = a T 1 a T 2 · · · a T m the output of this step is a matrix where the Euclidean norm of each a i is small. The result is written as where E = (E 1 , E 2 , . . . , E m ) and E i = ∑ 2 t j=1 e i j as before.
Finally, we transform the entire system to the binary case by considering where z 0 is the vector of least significant bits in z , s 0 the vector of least significant bits in s, A 0 = (A mod 2) and e denotes the binary error introduced.
We can now examine the error e j in position j of e. In (6), we have equations of the form z j = ∑ i s i a ij + 2E j in Z q , which can be written in integer form as Now, if | ∑ i s i a ij + 2E j | < q/2, then k j = 0. In this case, (8) can be reduced mod 2 without error and e j = 0. In general, the error is computed as e j = k j mod 2. So, one can compute a distribution for e j = k j mod 2 by computing P(k j = x). It is possible to compute such distribution, either making a general approximation or precisely for each specific position j using the known values a j and z j . Note that the distribution of e j depends on z j . Note also that if a j is reduced to a small norm and the number of steps t is not too large, then it is quite likely that | ∑ i s i a ij + 2E j | < q/2, leading to P(e j = 0) being large.
For the binary system, we finally need to find the secret value s 0 . Either 1.
there are no errors (or almost no errors), corresponding to P(e j = 0) ≈ 1. Then, one can solve for s 0 directly using Gaussian elimination (or possibly some information set decoding algorithm in the case of a few possible errors).

2.
or the noise is larger. The binary system of equations corresponds to the situation of a fast correlation attack [35], or secret recovery in an LPN problem [11]. Thus, one may apply an FWHT to recover the binary secret values.

Guessing s 0 Using the FWHT
The approach of using the FWHT to find the most likely s 0 in the binary system in (7) comes directly from previous literature on fast correlation attacks [36].
Let k denote an n-bit vector (k 0 , k 1 , . . . , k n−1 ) (also considered as an integer) and consider a sequence X k , k = 0, 1, . . . , N − 1, N = 2 n . It can, for example, be a sequence of occurrence values in the time domain, e.g., X k = the number of occurrences of X = k. The Walsh-Hadamard transform is defined aŝ where w · k denotes the bitwise dot product of the binary representation of the n-bit indices w and k. There exists an efficient method (FWHT) to compute the WHT in time O (N log N).
Given the matrix A 0 , we define X k = ∑ j∈J (−1) z j , where J is the set of all columns of the matrix A 0 that equal k. Then, one computes max w |X w |, and we have that s 0 corresponds tow, such that |Xw| = max w |X w |. In addition,X w is simply the (biased) sum of the noise terms.

Soft Received Information
The bias ofX w actually depends on the value of z j . So, a slightly better approach is to use "soft received information" by defining X k = ∑ j∈J (−1) z j · z j , where z j is the bias corresponding to z j . For each x ∈ {−(q − 1)/2, ..., (q − 1)/2}, the bias x can be efficiently pre-computed so that its evaluation does not affect the overall complexity of the guessing procedure.

Hybrid Guessing
One can use a hybrid approach to balance the overall complexity among reduction and guessing phases. Indeed, it is possible to leave some rows of the matrix A unreduced and apply an exhaustive search over the corresponding positions in combination with the previously described guessing step. Since the overall complexity of the algorithm is additive in reduction and guessing phases, one can use hybrid approach to balance the overall complexity among the two. Moreover, we remark that this exhaustive search can easily benefit from parallelization.

Even Selection
When transforming the system to a binary (7), we can zero out some positions to get an easier problem. This can be achieved by ensuring that, when reducing A , some specific rows of A are small (For a specific entry a j in a vector and the corresponding value s j in the secret, we should have a j · s j < q) and additionally have even coefficients, making sure to have enough samples left for the guessing phase. In this way, we cancel out the corresponding entries of s when reducing modulo 2 and get a smaller binary system of equations.

Retrieving the Original Secret
Once s 0 is correctly guessed, it is possible to obtain a new LWE problem instance with the secret half as big as follows. Write s = 2s + s 0 . DefineÂ = 2A andẑ = z − s 0 A. Then, we have thatẑ = s Â + e.
The entries of s have a bit-size half as large as the entries of s; therefore, (10) is an easier problem than (5). One can apply the procedure described above to (10) and guess the new binary secret s 1 , i.e., the least significant bits of s . The cost of doing this will be significantly smaller, as a shorter secret translates to computationally easier reduction steps. Thus, computationally speaking, the LWE problem can be considered solved once we manage to guess the least significant bits of s. Given the list of binary vectors s 0 , s 1 , s 2 , ..., s d , it is easy to retrieve the original secret s.
Generally, if s d i = 0, then s i ≥ 0, and (s 0 i , s 1 i , ..., s d i ) is nothing else than its binary representation. Conversely, if s d i = 1, then s i < 0. To compute its magnitude in this case, one must look again at (s 0 i , s 1 i , ..., s d i ) and consider that all negative entries of s cannot be reduced any further than −1. Namely, if for example s k is negative and at step j < d is reduced to be −1, then s j k = s (j+1) k = · · · = s d k = 1.
The following example shows how to retrieve the original secret s once the list of least significant bits vectors s 0 , s 1 , ..., s d has been guessed.
Example 2. Assume that the secret has length n = 10 and that its entries' distribution has standard deviation σ = 3. Then, performing the above procedure log 2 4σ = 4 times, with high probability we reduced the secret as much as possible. In the following example, one can note that s 3 determines the sign of s. Therefore, the magnitude of s i is retrieved by looking at s 0 , s 1 , s 2 . Note that, if one performs one more iteration, the corresponding binary secret s 4 will be the the same as s 3 , because we already reached the maximum reduction of s possible.

Analysis of the Algorithm and Its Complexity
In this section, we describe in detail the newly-proposed algorithm called BKW-FWHT with smooth reduction (BKW-FWHT-SR).

The Algorithm
The main steps of the new BKW-FWHT-SR algorithm are described in Algorithm 1. We start by changing the distribution of the secret vector with the secret-noise transformation [27], if necessary. Algorithm 1 BKW-FWHT with smooth reduction (main framework). Input: Matrix A with n rows and m columns, received vector z of length m and algorithm parameters t 1 , t 2 , t 3 , n limit , σ set Step 0: Use Gaussian elimination to change the distribution of the secret vector; Step 1: Use t 1 smooth-plain BKW steps to remove the bottom n pbkw entries; Step 2: Use t 2 smooth-LMS steps to reduce n cod1 more entries; Step 3: Perform the multiplying-2 operations; Step 4: Use t 3 smooth-LMS steps to reduce the remaining n t ≤ n limit entries; Step 5: Transform all the samples to the binary field and recover the partial secret key by the FWHT. We can exhaustively guess some positions.
The general framework is similar to the coded-BKW, with sieving procedure proposed in [20]. In our implementation, we instantiated coded-BKW with sieving steps with smooth-LMS steps discussed before, for the ease of implementation.
The different part of the new algorithm is that after certain reduction steps, we perform a multiplication by 2 to each reduced sample as described in Section 5. We then continue reducing the remain positions and perform the mod 2 operations to transform the entire system to the binary case. Now, we obtain a list of LPN samples and solve the corresponding LPN instance via known techniques, such as FWHT or partial guessing.
One high level description is that we aim to input an LWE instance to the LWE-to-LPN transform developed in Section 5, and solve the instance by using a solver for LPN. To optimize the performance, we first perform some reduction steps to have a new LWE instance with reduced dimension but larger noise. We then feed the obtained instance to the LWE-to-LPN transform.

The Complexity of Each Step
From now on, we assume that the secret is already distributed as the noise distribution, or that the secret-noise transform is performed. We use the LF2 heuristics and assume the the sample size is unchanged before and after each reduction step. We now start with smooth-plain BKW steps and let l red be the number of positions already reduced.

Smooth-Plain BKW Steps
Given m initial samples, we could on average have 2m 3 categories (The number of categories is halved compared with the LF2 setting for LPN. The difference is that we could add and subtract samples for LWE) for one plain BKW step in the LF2 setting. Instead, we could assume for 2 b 0 categories, and thus, the number of samples m is 1.5 × 2 b 0 . Let C pBKW be the cost of all smooth-plain BKW steps, whose initial value is set to be 0. If a step starts with a position never being reduced before, we can reduce l p positions, where l p = b log 2 (q) . Otherwise, when the first position is partially reduced in the previous step, and we need β categories to further reduce this position, we can in total fully reduce l p positions, where l p = 1 + b−log 2 (β ) log 2 (q) . For this smooth-plain BKW step, we compute C pbkw += ((n + 1 − l red ) · m + C d,pbkw ), where C d,pbkw = m is the cost of modulus switching for the last partially reduced position in this step. We then update the number of the reduced positions, l red += l p .
After iterating t 1 times, we could compute C pbkw and l red . We will continue updating l red and denote n pbkw the length reduced by the smooth-plain BKW steps.

Smooth-LMS Steps before the Multiplication of 2
We assume that the final noise contribution from each position reduced by LMS is similar, bounded by a preset value σ set . Since the noise variable generated in the i-th (0 ≤ i ≤ t 2 − 1) Smooth-LMS step will be added by 2 t 2 +t 3 −i times and be multiplied by 2,

12
, where C i,LMS1 is the length of the interval after the LMS reduction in this step. We use β i,LMS1 categories for one position, where β i,LMS1 = q C i,LMS1 . Similar to smooth-plain BKW steps, if this step starts with an new position, we can reduce l p positions, where l p = b log 2 (β i,LMS1 ) . Otherwise, when the first position is partially reduced in the previous step and we need β p,i,LMS1 categories to further reduce this position, we can in total fully reduce l p positions, where . Let C LMS1 be the cost of Smooth-LMS steps before the multiplication of 2, which is innitialized to 0. For this step, we compute C LMS1 += (n + 1 − l red ) · m, and then update the number of the reduced positions, l red += l p . After iterating t 2 times, we compute C LMS1 and l red . We expect that l red = n − n t (n t ≤ n limit ) positions have been fully reduced and will continue updating l red .

Smooth-LMS Steps after the Multiplication of 2
The formulas are similar to those for Smooth-LMS steps before the multiplication of 2. The difference is that the noise term is no longer multiplied by 2, so we have

12
, for 0 ≤ i ≤ t 3 − 1. Moreover, we need to track the a vector of length n t for the latter distinguisher. The cost is We also need to count the cost for multiplying samples by 2 and the mod2 operations, and the LMS decoding cost, which are C mulMod = 2 · (n t + 1) · m, C dec = (n − n pbkw + t 2 + t 3 ) · m.

FWHT Distinguisher and Partial Guessing
After the LWE-to-LPN transformation, we have an LPN problem with dimension n t and m instance. We perform partial guessing on n guess positions, and use FWHT to recover the remaining n FW HT = n t − n guess positions. The cost is, C distin = 2 n guess · ((n guess + 1) · m + n FW HT · 2 n FW HT ).

The Data Complexity
We now discuss the data complexity of the new FWHT distinguisher. In the integer form, we have the following equation, If | ∑ s i a ij + 2E j | < q/2 then k j = 0. Then, the equation can be reduced mod 2 without error. In general, the error is e j = k j mod 2.
We employ a smart FWHT distinguisher with soft received information, as described in Section 5. From [37], we know the sample complexity can be approximated as m ≈ 4 ln(2 n t ) For a different value of z j , the distribution of e j is different. The maximum bias is achieved when z j = 0. In this sense, we could compute the divergence as where e z is the Bernoulli variable conditioned on the value of z and U b the uniform distribution over the binary field.
Following the previous research [15], we approximate the noise ∑ s i a ij + 2E j as discrete Gaussian with standard deviation σ f . If σ f is large, the probability Pr[z = t ] is very close to 1/q. Then, the expectation E z=t,t∈Z q [D(e z=t ||U b ) ] can be approximated as i.e., the divergence between a discrete Gaussian with the same standard deviation and a uniform distribution over 2q, D(X σ f ,2q ||U 2q ). We numerically computed that the approximation is rather accurate when the noise is sufficiently large (see Table 1). In conclusion, we use the formula to estimate the data complexity of the new distinguisher. It remains to control the overall variance σ 2 f . Since we assume that the noise contribution from each reduced position by LMS is the same and the multiplication of 2 will double the standard deviation, we can derive σ 2 f = 4 * 2 t 1 +t 2 +t 3 σ 2 + σ 2 σ 2 set (n − n pbkw ). Note: The final noise is a combination of three parts, the noise from the LWE problem, the LMS steps before the multiplication by 2, and the LMS steps after the multiplication by 2. The final partial key recovery problem is equivalent to distinguishing a discrete Gaussian from uniform with the alphabet size doubled. We see that with the multiplication by 2, the variances of the first and the second noise parts are increased by a factor of 4, but the last noise part does not expand. This intuitively explains the gain of the new binary distinguisher.

In Summary
We have the following theorem to estimate the complexity of the attack.

Numerical Estimation
We numerically estimate the complexity of the new algorithm BKW-FWHT-SR (shown in Table 2). It improves the known approaches when the noise rate (represented by α) becomes larger. We should note that, compared with the previous BKW-type algorithms, the implementation is much easier, though the complexity gain might be mild. Table 2. Estimated time complexity comparison (in log 2 (·)) for solving LWE instances in the TU Darmstadt LWE challenge [29]. Here, an unlimited number of samples is assumed. The last columns show the complexity estimation from the LWE estimator [7]. "ENU" represents the enumeration cost model is employed and "Sieve" represents the sieving cost model is used. Bold-faced numbers are the smallest among the estimations with these different approaches.

Implementations
We produced two different C implementations of the BKW algorithm, as presented in this manuscript. These mainly differ in memory management. The first one, referred to as RBBL (RAM-based BKW for LWE), is light, fast, and relies only on RAM usage. However, this turned out to be a limiting factor for solving hard LWE instances. Therefore, the second implementation, referred to as FBBL (file-based BKW for LWE), follows a design strategy that can be seen as a transition away from such a limiting factor. The main idea is to use a file-based approach to store the samples, moving the memory constraint from RAM to disk capacity. More details are given in the next sections. Both implementations are available as open-source libraries at https://github.com/FBBL, accessed on 25 October 2021. Some simulation results are presented in Section 8.

RAM-Based Implementation
Storing samples directly in RAM allows a simple and fast implementation. The purpose of RBBL is to achieve good results in solving relatively small LWE instances and provide a comparison against the file-based solution FBBL. RBBL supports Smooth-LMS reduction steps, including smooth-plain BKW steps. The FWHT-based guessing method introduced in Sections 4 and 5 is implemented in both its plain and hybrid version.

Memory and Sample Organization
The samples are stored in heap within a single array, allowing for a single (and faster) memory allocation. For each category, we allocate enough memory to fit a fixed number of samples. Therefore, if for a certain category, the number of new samples exceeds the category capacity, some samples are discarded.

Parallelization
RBBL supports parallelization using the POSIX Threads utility. Within the steps of the reduction phase, each thread gets assigned to different sets of categories and performs sums and subtractions of samples independently. A system of mutexes prevents memory corruption caused by two or more threads writing to the same category, and therefore to the same memory cells, at the same time. The guessing phase benefits from parallelization when using the hybrid guesser. All the possible combinations of entries of the bruteforce positions are equally distributed among the available threads, which perform an FWHT for each one of them. Finally, one chooses the guess paired with the highest probability, as explained in Section 5.2.

File-Based Implementation
A file-based implementation is needed when the amount of memory required to store the samples exceeds the available RAM. FBBL supports most known BKW reduction steps, FFT and FWHT-based guessing methods, and hybrid guessing approaches. Different reduction steps can be combined arbitrarily, and the implementation also allows full recovery of the initial secret, including the initial transform and guessing all parts of the secret vector (if all intermediate reduction results have been stored and are compatible). A key success factor in the software design was to avoid unnecessary reliance on RAM, so we have employed file-based storage where necessary and practically possible. We describe below how we dealt with some interesting performance issues.

File-Based Sample Storage
Samples get stored in a file, sorted according to their category. The necessary space for a fixed number of samples is reserved into the file for each category. Then, a storage writer writes down the samples from RAM to file in the space reserved for their categories (possibly discarding some samples if they are full or leaving some empty space otherwise). A category mapping, unique for each reduction type, defines what category index a given sample belongs to (In this section, a category is defined slightly differently from the rest of the paper. A category together with its adjacent category are together what we simply refer to as a category in the rest of the paper).
When samples are combined in the reduction step, we can either subtract or add them together. The subtraction of samples is performed within a single category, while addition needs to take two samples from adjacent categories. For example, considering how plain-BKW reduction works for just one position with modulus q, there are q different categories, one for each position value. Samples in categories a and q − a are adjacent, since they cancel out (at that position) when added together. The category mapping is constructed so that adjacent categories are stored as neighbors on file, in order to maintain a sequential access pattern, avoiding the need for random accesses on file.
Consider one BKW reduction step. A reduction step takes a sample file as input and produces a new sample file as output. The fast dual SSDs of our target machine are used for efficiency here, reading from the source file, writing to the destination file. Sample reading is performed sequentially from beginning to end of file. This is very straightforward, reading a pair of adjacent categories into memory, processing them (the reduction), and writing the resulting samples to (another) file. Writing samples to the destination file is much more elaborate, utilizing as much of the RAM as is available as a buffer, flushing to disk with one sequential write to the destination file. This needs to be done whenever the buffer fills up, generally, but depending on the problem parameters, the amount of RAM and disk memory used, requiring multiple flushes.

Optional Sample Amplification
We support optional sample amplification. That is, if a problem instance has a limited number of initial samples (e.g., the Darmstadt LWE challenge), then it is possible to combine several of these to produce new samples (more, but with higher noise).
While this is very straightforward in theory, we have noticed considerable performance effects when this recombination is performed naïvely. For example, combining triplets of initial samples using a nested loop is problematic in practice for some instances, since some initial samples become over-represented-some samples are used more often than others when implemented this way.
We have solved this by using a linear feedback shift register to efficiently and pseudorandomly distribute the selection of initial samples more evenly.

Employing Meta-Categories
For some LWE problem instances, using a very high number of categories with few samples in each is a good option. This can be problematic to handle in an implementation, but we have used meta-categories to handle this situation. For example, using plain BKW reduction steps with modulus q and three positions, we end up with q 3 different categories. With q large, an option is to use only two out of the three position values in a vector to first map it into one out of q 2 different meta-categories. When processing the (meta-)categories, one then needs an additional pre-processing in the form of a sorting step in order to divide the samples into their respective (non-meta) categories (based on all three position values), before proceeding as per usual.
We have used this implementation trick to, for example, implement plain BKW reduction for three positions. One may think of the process as brute-forcing one out of three positions in the reduction step.

Parallelization
The hybrid FWHT-based guessing method benefits from parallelization similarly to the hybrid guesser in the RBBL implementation. To every available thread we assign an equal number of possible guesses for the bruteforced part. Each thread independently runs an FWHT, and the most likely guess among all threads is chosen as the solution. Due to our focus on storage-related sample processing, we have not yet parallelized the BKW steps. This would require more effort compared to the RBBL case, since one would want to parallelize the writing-to-file part too. It is in our future plan to work on it.

A Novel Idea for Fast Storage Writing
Here, we introduce a new approach to handle file-based sample storage, which is particularly efficient for SSDs. That is, we describe a procedure for writing samples to physical storage in an efficient way for large LWE instances with many samples. It is intended to be used in future implementations.

Intuition
The main idea is to utilize the fact that disk access has a much lower time penalty for SSD disks than for classical mechanical disks. In other words, we make use of the fact that SSDs act more like random access memory.
On a high level, the idea is to bunch several categories together into a well-chosen number of meta-categories. This can be done in a very general way, simply by grouping the category indices into intervals of suitable length (depending on available memory). We then make a rough sorting into these meta-categories, where we keep the samples unsorted (within their respective meta-category). For every such meta-category, we utilise a separate file handle, so we have, say, w different write positions on disk. The gain here is that we can simply employ the rough sorting and then directly flush each newly created sample (or batches of them if buffering is employed) into their respective file by appending. As mentioned, this technique works better for SSD-type disks than mechanical ones, because there is a penalty for physically moving the write head for every append operation. Of course, this can partially be dealt with by buffering the outputs, but this is a trade-off issue between the performance gain of the algorithm itself versus the performance penalties of moving between the w different write positions, which explains why the solution is more interesting for SSD storage.

Technical Description
Assume we have a table T ram in RAM and a larger table on disk denoted T disk . Then, one splits the list of size 2 v into m = 2 v /M different parts, where M is related to the maximum RAM that can be used. Clearly, one such part matches the maximum RAM size, i.e., one part contains as many entries as can be contained in RAM.
Define a simple map φ(K) = p, where p ∈ {0, 1, . . . , m − 1}. A table T ram in RAM is created and contains m storage units denoted T ram p , p ∈ {0, 1, . . . , m − 1}. When reading column i from A, one computes the corresponding index K according to Equation (4) for the column x, and then the file part by φ(K) = p. The column vector x is then stored in the storage units T ram p . Once the table T ram is full, we append the parts to the larger table on disk T disk , again containing m storage units denoted T disk p , p ∈ {0, 1, . . . , m − 1}. Store the content of T ram p in T disk p , appended to previous content, for p = 0, 1 . . . , m − 1. Therefore, the table T ram is now empty and ready to start over again and read columns of A.
In the combining step, we read the content of one T disk p to RAM. However, because the content is not fully sorted, we now address the table T ram by the K values instead. Assume that T ram now is sorted according to K.
Step through T ram , and for all entries with the same K value, we create all combinations of differences between these vectors. Write them to a new A in successive order.

Other Implementation Aspects
Some more things to consider implementation-wise are the following.

Strict Unnatural Selection
There are advantages to performing very strict unnatural selection in the last reduction step, drastically reducing the total number of samples. First of all, this allows us to reduce more positions and/or reduce the positions to a much lower magnitude. Secondly, having much fewer samples speeds up the FWHT distinguisher, allowing us to brute-force guess more positions.
We leave investigating the idea of applying aggressive unnatural selection in each step, in other words coded-BKW with sieving, for future implementation work.

Skipping the All 0 s Guess When Using the FWHT Distinguisher
Consider the guess where all the LSBs of the secret are equal to 0. For that guess, (9) simplifies to Since even values of z j are more common than odd values, this sum is biased, meaning that the FWHT is much more likely to choose this guess. Since it is exceptionally unlikely that this guess is the correct one, we can improve the distinguisher by simply discarding this guess.

FWHT Distinguisher When the RAM Is a Limitation
Suppose that the FWHT distinguisher is applied on n positions, where the 2 n corresponding values are too many to store in RAM. It is possible to do a binary brute-force search over r of the position values of (9) and calculate an FWHT with only n − r bits for each of the possible 2 r bit sequences. This approach reduces the space complexity of the algorithm from O(2 n ) to O(2 n−r ) The time complexity of the normal FWHT distinguisher, including the cost of processing the m samples to calculate the X k values, is O(m + n · 2 n ).
When brute-forcing r positions, we need to iterate over all m samples and calculate a scalar product of r positions, and calculate an FWHT of n − r bits, for each of the 2 r guesses. This leads to a time complexity of O(2 r (rm + (n − r) · 2 n−r )).

On the Minimum Population Size
It turns out that it is possible to use slightly fewer samples than a worst-case analysis would imply. The samples after a reduction step are not completely evenly spread out among the categories. The overproduction of categories with extra samples overcompensates for the lack of production from categories with few samples. The uneven spread of samples is due to two factors.
• By design, a small fraction of the categories will get fewer samples on average. • Even ignoring the first point, the spread will, due to randomness, not be perfectly even.
Let us analyze the situation in detail. Assume that we have N categories, M samples and that the probability of a sample being mapped to category i is p i , where ∑ N i=1 p i = 1. Let X i denote the number of samples in category i and Y i = X i (X i − 1)/2 denote the number of samples produces in category i. With this notation, the expected new samples being produced are: The number of samples in category i are X i ∼ B(M, p i ). Using well-known formulas for the mean and the variance of the binomial distribution, we get Setting the production equal to M and solving for M gives us Notice here that the further away from uniformly random the p i values are, the smaller we need M to be. Assuming that all categories have p i = 1/N, we get We can thus keep the sample size constant using 2N + 1 samples, gaining a factor of 2/3 over a worst-case analysis, which assumes that we need 3N samples. Notice that we gain from this effect even if we use larger values of M than needed, and choose the best samples using unnatural selection.

Experimental Results
In this section, we report some of the experimental results obtained in solving real LWE instances with varying parameters. Our main goal was to confirm our theory and to prove that BKW algorithms can be used in practice to solve relatively large instances. For the case of FBBL, there is still room to run a more optimized code and possibly to make more optimal parameter choices. However, the results show that the BKW algorithm is practical, and that its performances are on a comparable scale to the ones from latticebased approaches.
We considered two different scenarios. In the first case, we assumed for each LWE instance to have access to an arbitrary large number of samples. Here, we create the desired amount of samples ourselves (we used rounded Gaussian noise for simplicity of implementation). In the second case, we considered instances with a limited number of samples. An LWE problem is considered solved when the binary secret is correctly guessed, for reasons explained in Section 5.3.

Target Machine
For our file-based experiments, we assembled a machine, that will be referred as machine A, to achieve a high speed in file reading/writing. We used an ASUS PRIME X399-A motherboard, a 4.0 GHz Ryzen Threadripper 1950X processor, and 128 GiB of 2666 MHz DDR4 RAM. For storage, we used a separate (slow) SSD Samsung 860 QVO for the operating system (Windows 10), an Ultrastar HE12 12TB SATA mechanical disk, and dual (fast) SSDs SAMSUNG 970 EVO Plus 2TiB NVMe M.2 internal. While the machine is built from standard parts with a limited budget, we have primarily attempted to maximize the amount of RAM and the size and read/write speeds of the fast SSDs for the overall ability to solve large LWE problem instances.
For the RAM-based experiments, we switched to a machine equipped with a faster processor. We used a desktop with processor 3.60 GHz Intel Core i7-7700 CPU, running Linux Mint 20 and with 32 GB of RAM. We will refer to this second machine as machine B.

Unlimited Number of Samples
We targeted the parameter choices of the TU Darmstadt challenges [29]. For each instance, we generated as many initial samples as needed according to our estimations. In Example 3, we present our parameter choices for one of these. In Table 3 , we report the details of the largest solved instances. One can see that RBBL achieves considerably faster results than FBBL, both when comparing them on the same machine B, and when FBBL runs on machine A. This gives us an idea of how much the results obtained using FBBL on larger LWE instances could be improved if using RBBL on a machine with a larger RAM available. Example 3. Let us consider an LWE instance with n = 40, q = 1601 and σ = 0.005 · q. To successfully guess the secret, we first performed 8 smooth-plain BKW steps, reducing 17 positions to zero. We used the following parameters. n i = 2, C i = 1, for i = 1, . . . , 8, (C 1 , C 2 , C 3 , C 4 , C 5 , C 6 , C 7 , C 8 ) = (280, 80,20,5,1,178,41,9).
These parameters are chosen in such a way that the number of categories (The number of categories here is the double of what is explained in previous sections since opposite samples are put in different categories in the implementation) is ≈15M in the early stages and ≈23M at most. We started with 16M samples that guaranteed us to end up with enough samples for guessing the right solution. The last position is brute-forced and therefore left untouched at the last reduction step.

Limited Number of Samples
We solved the original TU Darmstadt LWE challenge instance [29] with parameters n = 40, α = 0.005 and the number of samples limited to m = 1600. We did this by forming 140 million samples using sample amplifying with triples of samples, taking 6 steps of smooth-plain BKW on 14 entries, followed by 6 steps of smooth-LMS on 25 entries. The final position was left to brute force. The overall running time, obtained with FBBL on machine A, was 55 min.

Conclusions and Future Work
We introduced a novel and easy approach to implementing a BKW reduction step, which allows balancing the complexity among the iterations, and an FWHT-based guessing procedure able to correctly guess the secret with relatively large noise level. Together with a file-based approach of storing samples, the above define a new BKW algorithm specifically designed to solve practical LWE instances, where the available RAM is typically a limiting factor.
With an implementation of the file-based algorithm, we managed to solve 6 challenges with Darmstadt challenge-type parameters, but with an unlimited number of samples, 3 more challenges than in the conference version of the paper [38]. For the 3 previously solved challenges, we made substantial improvements in runtime.
We also managed to solve the easiest Darmstadt challenge, in its original form.
Furthermore, we implemented a fully RAM-based version of the new algorithm, to compare against the file-based approach, in settings where the available RAM was not a limiting factor. We leave it to future work to experiment with such implementation, with harder LWE instances on machines with larger RAM available.
We did parallelize the FWHT and the RAM-based version of the algorithm, but we leave more parallelization work and other optimization work for the future.
While we managed to substantially improve the implementation results of the conference version of the paper [38], we believe that significant improvements to the algorithm can still be made to reduce the gap compared to lattice-based techniques for solving LWE. For example, it remains to investigate the concrete improvement of employing the sieving aspect of coded-BKW with sieving [20]. Moreover, the investigation of the specific design of the BKW algorithm for handling the problem of few initial samples is left for future work.