1. Introduction
In coding theory, linear codes are used to detect and correct errors in transmitted messages over noisy channels. A linear code C over a finite field is a k-dimensional subspace of the vector space , consisting of codewords. Key parameters of a code include its length n, dimension k, and minimum distance d. With each linear code C, we associate two matrices: generator and parity-check matrix. A matrix G with k rows and n columns is called a generator matrix of a linear code C if the rows of G form a basis of C. An matrix H, such that , is called a parity-check matrix for C.
Let us consider the full vector space . It is partitioned into a disjoint subsets of the form where . These subsets are called cosets of the linear code C. For any vector , we define its syndrome as . There is a one-to-one correspondence between cosets of C and syndrome vectors in .
The
covering radius R of a linear code
C is defined as
i.e., the smallest integer such that every vector in the vector space lies within Hamming distance
R of some codeword. The covering radius
R can be interpreted as the minimal number of columns of
H whose linear combinations generate all syndromes.
The covering radius is a classical and fundamental parameter in coding theory. While the minimum distance d quantifies the error-correction and error-detection capabilities of a code, the covering radius R measures how well the code covers the ambient space and provides complementary information about the global geometry of the code. In classical error detection, the minimum distance d guarantees that any error pattern of weight less than d can be detected. However, in the general decoding and detection setting, the received word may be arbitrary and not necessarily close to a valid codeword. In this broader context, the covering radius plays a crucial role. A small covering radius guarantees that every received word lies within a bounded distance from the code, which implies that undetectable error patterns are confined to well-structured regions of the ambient space.
In erasure channels, some symbols of a transmitted codeword are erased, while the remaining symbols are received correctly [
1]. Decoding from erasures amounts to determining whether the known positions uniquely determine a codeword. A small covering radius guarantees that every received word, including those resulting from erasures, is close to at least one codeword. Moreover, the covering radius is directly connected to the structure of coset leaders and syndrome decoding, which are fundamental tools in erasure recovery for linear codes.
Locally recoverable codes (LRCs) have received significant attention due to their applications in distributed storage systems [
2]. An LRC is a code in which each coordinate can be recovered by accessing only a small number of other coordinates. Although locality is a coordinate-wise property and the covering radius is a global parameter, there exists a strong conceptual connection between the two. Codes with small covering radius induce highly regular partitions of the ambient space. Such structural properties are closely related to the existence of multiple recovery sets for individual symbols, which is a key requirement in the construction of locally recoverable codes.
Covering radius plays a central role in several other areas: lossy compression and vector quantization [
3], in certain information-hiding schemes [
4]. Codes with small covering radius often exhibit strong symmetry and regularity properties, such as complete regularity or uniform packing, which are of independent combinatorial interest [
5].
Despite its importance, computing the covering radius is known to be NP-hard. This makes the development of efficient algorithms, and in particular parallel algorithms, highly relevant both theoretically and practically. Efficient computation of the covering radius enables structural analysis of codes used in distributed storage and data protection, exploration of new code families with strong geometric and spectral properties. Therefore, improving the computational feasibility of determining the covering radius directly supports both the theory and the practice of modern coding techniques.
There are different methods for calculating the covering radius of a linear
code. One algorithm is based on the definition. The naive approach is to compute the distance from each vector to the code
C. Another method uses the parity-check matrix of the code and is based on Lemma 1.1 from [
6]. The covering radius can also be calculated using fast transforms [
7]. Some results on the problem of computing the covering radius can be found in [
8,
9,
10,
11,
12]. Specialized techniques can be used to calculate the covering radius of some classes of codes such as MDS and near MDS codes [
13,
14], Melas codes [
15], Zetterberg codes [
16,
17], etc. Heuristic and probabilistic methods can also be used. However, they do not give us an exact value for
R.
We present a method to parallelize the algorithm for computing the exact covering radius of a linear code using its parity-check matrix. This problem is at least as complex as the problem for computing the minimum distance of a code. Parallel algorithms for computing the minimum distance of a binary code are presented in [
18,
19]. On the other hand, this work is a continuation of a previously developed parallelization of the algorithm for calculating the covering radius using cosets of the code. A disadvantage of that algorithm is that we generate the full vector space
to compute the covering radius. The considered algorithm in this paper only keeps track of generated
syndromes, which however requires a larger amount of memory. This algorithm is more appropriate to be used for codes with larger dimensions. For any parallel implementation, we need to use an optimized sequential algorithm. The basis of the algorithm is the systematic generation of vectors in
by forming linear combinations using
l columns of the matrix
H, where
l increases (
). Each linear combination of
l columns of
H can be represented as an
n-dimensional vector with weight
l. Thus, the algorithm can be represented as generating all vectors
with
. There are different approaches to generate all vectors in a given set. In the case when
, we can use a Gray code for the generation. When
, the classical Gray code cannot be used directly, and therefore some modifications are needed. In this case, a linear combination can be generated by adding and subtracting a vector from the previous linear combination. Here, we have two operations to generate a linear combination. Another approach is to use a helper matrix that contains a linear combination of
s columns of
H in row
s. Using such matrix, we can generate the next linear combination by using just addition at the expense of greater memory complexity. Such an algorithm is less researched compared to the Gray code methods. This is the most efficient method for generating linear combinations, as far as we know. We also need to keep track of the already generated linear combinations which correspond to syndromes for the given code in order to calculate the covering radius. If for a given
l all
syndromes are generated as a linear combination of not more than
l columns of
H, then
is the covering radius of the code.
The focus of the current work consists of an in-depth view of the following problems:
Defining an ordering of the vectors with that correspond to a linear combination of the columns of the parity-check matrix H of a linear code C. In the chosen order, a new linear combination is generated by just adding a vector to a previous linear combination. Furthermore, each vector in the ordering is generated efficiently.
Presenting ranking and unranking algorithms for the given ordering. A ranking function assigns an integer to a given combinatorial object in a chosen ordering. The inverse function that gives the combinatorial object in a specific ordering for a given valid integer (rank) is called unranking function. Such algorithms are typically defined for basic combinatorial objects.
Efficiently partitioning the vectors in the given order into subsets such that the vectors in each subset follow the same order and all subsets have approximately the same cardinality.
Efficient enumeration of the vectors in that correspond to the syndromes for the code.
A key conceptual contribution of this work is the introduction of explicit ranking and unranking functions for the combinatorial objects underlying the covering radius computation. Unlike classical combinatorial generation methods, which rely on sequential or backtracking-based enumeration, ranking and unranking provide a direct bijection between integers and linear combinations of columns of the parity-check matrix. This representation fundamentally changes the algorithmic structure by allowing random access to the search space and enabling its deterministic partitioning into independent subspaces. As a result, the use of ranking and unranking functions is essential for an efficient parallel implementation, since without them a balanced distribution of work among worker processes would not be feasible.
The Message Passing Interface (MPI) was chosen as the parallelization framework for several reasons. First, MPI is a widely adopted standard that provides portability and scalability across a broad range of computing architectures, from multicore systems to large distributed clusters. Second, the explicit data distribution model of MPI allows for an efficient integration of low-level optimizations, such as vectorization, within each process, without interfering with the global parallel structure. Finally, the sequential covering radius algorithm naturally follows a master–worker paradigm, where a central process coordinates the exploration of the syndrome space and multiple workers perform independent computations. This execution model is straightforward to implement in a message-passing environment, while it is considerably more difficult to realize efficiently in shared-memory parallel frameworks.
The remainder of the paper is organized as follows.
Section 2 introduces the necessary definitions and preliminaries. An overview of the parallel frameworks considered in this work is provided in
Section 3. The sequential algorithms are described in
Section 4, followed by their parallelization strategies in
Section 5. Experimental results and performance evaluation are presented in
Section 6. Finally,
Section 7 concludes the paper with a summary and concluding remarks.
2. Preliminaries
Let
be a finite field with
q elements, and let
C be an
linear code over
. The
minimum distance d measures the smallest Hamming distance between distinct codewords:
where
denotes the Hamming distance. The minimum distance is one of the most searched parameters of a linear code. It shows how many errors a code can detect or correct. More precisely, a linear
code can detect up to
errors and correct up to
errors. Another important parameter of the code, that gives additional information, is its
covering radius. The
covering radius R of a code
C is defined as
Equivalently,
R is the smallest integer such that the Hamming spheres with radius
R centered at the codewords cover the entire space
.
The covering radius of a linear code can be computed using its parity-check matrix or coset leaders. Let
denote the
dual code of
C, defined as
where
is the standard inner product over
. The dual code has parameters
. The generator matrix of
is the parity-check matrix of
C such that
If
then the set
is called a coset of
C represented by
v. The vector space
can be partitioned into disjoint cosets, and two vectors
and
are in the same coset if and only if their difference belongs to
C, i.e.,
. A vector of minimum weight within a given coset is called a
coset leader.
For any vector
, we define its
syndrome as
Syndromes provide an algebraic characterization of cosets: two vectors belong to the same coset of
C if and only if they have the same syndromes. Thus, there is a one-to-one correspondence between cosets of
C and syndrome vectors in
. Hence, the syndrome completely identifies the coset to which a vector belongs.
The following theorem gives the connection between the covering radius, parity-check matrix and code leaders of a given linear code C.
Theorem 1
([
20])
. Let C be a linear code with a parity check matrix H. Then,- (i)
is the weight of the coset of largest weight;
- (ii)
is the smallest number s such that every nonzero syndrome is a combination of s or fewer columns of H, and some syndromes require s columns.
Several methods are known for determining the covering radius of a linear code. One approach is based directly on the above theorem, requiring the identification of all coset leaders and their weights. Since there is a one-to-one correspondence between cosets and syndromes, we can also consider an algorithms that generates all syndromes. The covering radius is also interpreted as the smallest integer R such that all -dimensional vectors (syndromes) can be represented as linear combinations of no more than R columns of a chosen parity-check matrix H of the code. In other words, we need to generate the full vector space , which is a computationally heavy problem. Thus, we can also consider a lower bound for R.
The volume of the ball within a
q-ary Hamming sphere of radius
r in
is
This counts the number of vectors within the distance
r of a given center. The union of spheres of radius
R centered at each codeword must cover all vectors in
. Thus,
which can be rearranged to give
Consequently, the covering radius satisfies the combinatorial lower bound
If
, the code is called a
perfect code. For the perfect codes
. This lower bound is called
Sphere Covering Bound and is also given in [
8].
3. Parallel Computing Framework and Vectorization
The Message Passing Interface (MPI) is the de facto standard for parallel computing in distributed-memory environments. Unlike shared-memory models, MPI provides explicit communication primitives that allow processes to exchange data through messages. This design enables programs to scale efficiently across clusters, supercomputers, and heterogeneous systems where each processing unit has its own private memory space. More on parallel programming with MPI can be found in [
21].
Vectorization, on the other hand, provides mechanisms for additional parallelization in algorithms that include vector operations. The main idea of vectorization is to execute an operation over multiple elements of the vector at the same time. This can be accomplished by utilizing extended vector registers that are available in most modern central processing units (CPUs). Vectorization with such registers can be used in combination with other parallel mechanisms and standards such as MPI.
3.1. MPI: The Master–Worker Paradigm
MPI programs follow the Single Program Multiple Data (SPMD) execution model: all processes execute the same program but operate on different portions of the data. Communication is performed through point-to-point or collective operations, such as MPI_Send, MPI_Recv, MPI_Bcast, and MPI_Reduce. This explicit communication model offers fine-grained control over performance and resource usage, making MPI suitable for complex, large-scale scientific computations.
A widely used pattern in MPI applications is the master–worker paradigm. In this model, one process (the master) coordinates the computation, while the remaining processes (the workers) perform the actual computational tasks. The master is responsible for distributing work units, collecting results, and determining global termination conditions. The workers repeatedly receive tasks from the master, execute them independently, and return results.
This paradigm is especially beneficial for applications where the workload is irregular or dynamically generated. It allows the master to balance the load by assigning tasks adaptively, preventing situations where some processes remain idle while others are overburdened. The flexibility of task granularity enables the programmer to tune the communication-to-computation ratio according to the characteristics of the problem and the underlying hardware. Such strategy can be used for different problems such as the classification of combinatorial objects [
22].
MPI supports multiple communication strategies within master–worker systems. Tasks and results may be exchanged using blocking or non-blocking point-to-point communication (MPI_Isend/MPI_Irecv), allowing both the master and workers to overlap computation with communication. Collective operations may be used for global broadcasts (e.g., to distribute shared parameters) or reductions (e.g., to aggregate partial results), although the master–worker pattern often favors point-to-point interactions to minimize synchronization overhead.
3.1.1. Advantages
The master–worker paradigm offers several advantages in distributed-memory systems:
Scalability: Since workers operate independently, the computation can scale to hundreds or thousands of processes with minimal contention.
Dynamic load balancing: The master can adjust the distribution of tasks at runtime, leading to improved utilization of computational resources.
Fault isolation: Errors are often confined to individual workers, and the master can detect failures or timeouts.
Simplicity of design: The structure of the algorithm becomes modular, separating coordination from computation.
3.1.2. Limitations
Despite its advantages, the master–worker model also presents challenges:
Master bottleneck: If the rate of task generation or result collection is high, the master may become a performance bottleneck.
Communication overhead: Fine-grained task distribution may lead to excessive message traffic unless carefully optimized.
Centralized control: The model relies on a single point of coordination, which may limit fault tolerance and scalability in extremely large systems.
3.2. Vectorization with Extended Vector Registers
An essential component of the proposed parallel procedure for computing the covering radius of linear codes over finite prime fields
, with
, is the rapid evaluation of linear combinations of columns of a given parity-check matrix
H. For a column vector
and a partial sum
, the update
is performed repeatedly within the enumeration process. Since all field elements can be represented in a single byte, the operation can be carried out efficiently via byte-wise modular addition within SIMD registers.
In the implementation, each column of
H is packed into a 128-bit register, allowing the simultaneous processing of 16 coordinates. Modular addition in
is realized through a fixed sequence of SIMD operations involving byte-wise addition, comparison against the modulus, construction of a correction mask, and a selective blend to enforce reduction modulo
q. Formally, letting
denote the byte-encoded vectors, the register-level computation implements
where all operations are evaluated componentwise and
is realized via SIMD comparison and mask selection. This avoids lookup-based techniques and ensures that the entire update remains vectorized. Some algorithms for vector operations using vectorization with registers are presented in [
23,
24].
The use of 128-bit SIMD instructions markedly reduces the number of scalar iterations required in conventional implementations. Since an entire column vector fits within a single register, memory traffic is minimal, and repeated access to the same columns during the algorithmic search incurs negligible overhead. The resulting instruction sequence is short, uniform across fields with , and well suited to superscalar execution.
Consequently, the SIMD-enhanced implementation achieves a substantial speedup in the evaluation of the linear combinations that dominate the computational complexity. These optimizations preserve exact arithmetic over while allowing the algorithm to scale to parity-check matrices of nontrivial size and to explore significantly larger search spaces within practical time bounds.
Although wider SIMD extensions such as AVX2 and AVX–512 are available, they were not used in this work due to the interaction between vectorization and process-level parallelism. The proposed algorithm employs a large number of concurrent MPI worker processes, and on many architectures wider SIMD instructions lead to increased instruction latency and reduced core frequency, which can negatively affect overall throughput. Moreover, the computational kernel naturally fits into 128-bit registers, so wider SIMD registers do not significantly reduce the instruction count. For these reasons, SSE4.1 provides a balanced and portable solution, and a comparison with wider SIMD extensions was not pursued.
4. Sequential Algorithm
Let C be a linear code with parity-check matrix . Each syndrome where corresponds to a coset of C in . By Theorem 1, the covering radius is the maximal weight of a coset leader, which is equal to the minimal number of columns of H whose linear combinations generate a syndrome. Thus, a syndrome is defined by a vector v, which corresponds to a linear combination of the columns of H. This gives us a direct algorithm to calculate the covering radius. Furthermore, from the sphere covering bound, we already know a lower bound on the covering radius, which depends only on n, k, and q. The algorithm generates linear combinations using an increasing number of columns starting from the known lower bound . Using ensures that no values smaller than this (which are provably insufficient to cover all syndromes) are tested, improving efficiency. The algorithm terminates when the sets of ≤R columns can generate all syndromes. In each step, the number of column subsets is , . Each subset generates nonzero linear combinations. In the worst case, we generate vectors, which is exponential in n.
In order to efficiently implement the algorithm for computing the covering radius, we introduce a syndrome array to keep track of which syndromes have already been generated. This requires a large amount of memory to be allocated and limits the number of codes for which we can calculate the covering radius using this algorithm. Thus, we can use a reduced syndrome array. Specifically, we use an equivalence relation, defined in the following way: the proportional syndromes s and with are equivalent. Thus, all syndromes proportional to each other belong to the same equivalence class.
From what has been discussed so far, two main subproblems can be distinguished for the efficient calculation of the covering radius of a linear code:
- 1.
Enumeration and efficient generation of linear combinations.
- 2.
Enumeration of all nonproportional syndromes and efficient management of memory resources for the syndrome array.
4.1. Generating Linear Combinations
Let us enumerate the columns of
H as follows:
where
denotes the
i-th column of
H. We consider all linear combinations of at most
L columns of
H, that is, all vectors of the form
where the weight
satisfies
. Vector
v defines a linear combination. Instead of enumerating all vectors
v of weight at most
L, we introduce a structured (ordered) subset
as follows:
Definition 1.
Define to be the ordered set, consisting of all vectors of , satisfying the following properties:
- 1.
For any , .
- 2.
The first nonzero coordinate of is equal to 1.
- 3.
For every with the last nonzero coordinate at position t, the vector and it precedes the vector , where denotes the standard basis vector with 1 in position t and zeros elsewhere.
The second condition ensures that for every nonzero scalar , the vectors v and do not both appear in . Thus, contains exactly one representative from each one-dimensional subspace generated by vectors of weight at most L. From the third property, we have that every linear combination defined by a vector can be obtained from the linear combination defined by a strictly smaller vector using only a single vector addition.
One such ordered set is generated by the execution of nested loops. Let us consider the following Algorithm 1, where the vector defining a linear combination of
i columns (without zero coefficients) is saved in the array
. This algorithm generates the vectors in
and their proportional.
| Algorithm 1 Generation of all vectors with up to L nonzero coordinates from |
- 1:
; - 2:
for
to
n
do - 3:
for to do - 4:
; - 5:
for to n do - 6:
for to do - 7:
; - 8:
⋮ - 9:
for to n do - 10:
for to do - 11:
;
|
Let us consider the matrix whose rows are the nonzero vectors generated by Algorithm 1 in the same order. Then we prove the following lemma.
Lemma 1.
The vectors generated by Algorithm 1 are all vectors over with at most L nonzero coordinates. Moreover, the matrices satisfy the following recurrence relation: Proof. The vectors generated by the algorithm satisfy Properties 1 and 3 of Definition 1. Property 1 follows immediately from the fact that for each choice of positions the algorithm assigns all possible nonzero coefficients from to these positions.
To verify Property 3, observe that each loop modifies exactly one coordinate. Let
be the position whose value changes first during the iteration, and consider the vector
This vector is obtained from the vector
which is generated in the loop iterating over coordinate
. Subsequent vectors such as
, or vectors in which a later coordinate becomes nonzero, e.g.,
are obtained from
by increasing exactly one coordinate. Thus, Property 3 of Definition 1 is satisfied.
Finally, the ordered subset generated by Algorithm 1 also satisfies the recurrence relation given in Equation (
1). In particular, the rows of the matrix
enumerate the vectors in precisely the same order as produced by the nested-loop structure of the algorithm. □
Proposition 1.
Let denote the number of rows in the matrix , i.e.,Then satisfies the recurrence relationfor all integers and , with boundary conditions Proof. Starting from the definition of
, we have
where we used the binomial identity
.
Consider the two sums separately. For the first one, we obtain
For the second sum, substituting
, we get
Combining (3)–(5), we obtain
which is exactly the recurrence relation (
2). The boundary conditions
and
follow directly from the definition, since there are no nonzero vectors with weight 0 or with length 0. This completes the proof. □
For we also have that . Algorithm 1 and Lemma 1 are in reference to a subset of that satisfies properties 1 and 3 of Definition 1. To generate an ordered subset for which all properties hold, we need to consider only the non-proportional vectors. More precisely, we consider only the vectors with the first nonzero coordinate equal to 1. One such subset is generated using the following Lemma 2.
Lemma 2.
Consider the modified version of Algorithm 1, where we add an outer loop that fixes the first nonzero coordinate:
We choose a position and set ;
The next loop index starts from and the remaining nested loops proceed as in Algorithm 1, generating up to additional nonzero coordinates.
Then, this modified algorithm generates exactly all non-proportional vectors of Hamming weight . Moreover, if we denote by the matrix whose rows are the vectors generated by this modified algorithm, ordered according to the iteration of the loops, then satisfies the recurrence relationwhere is the matrix from (1) whose rows contain all vectors of length m and weight at most r over . Proof. We first show that the modified algorithm generates exactly one representative from each proportionality class of nonzero vectors with .
Each proportionality class contains a unique vector whose first nonzero coordinate is equal to 1. In the modified algorithm we choose a position and set . All coordinates before are zero, and the subsequent nested loops (having indices starting from ) generates at most additional nonzero coordinates with arbitrary nonzero coefficients from . Therefore:
Each generated vector has a Hamming weight between 1 and L, since it always has at least the fixed coordinate equal to 1 and at most further nonzero coordinates;
Its first nonzero coordinate is exactly at position , and its value is 1;
Every choice of , subsequent coordinates and the coefficients of the remaining nonzero coordinates is realized by some iteration of the loops.
Hence, every nonzero vector of weight at most L is represented by exactly one vector generated by the algorithm, namely its unique normalization with the first nonzero coordinate equal to 1. This proves the first part of the statement.
For the recurrence (
6), if the first nonzero coordinate is at position
s, then the vector has the form
where
has Hamming weight at most
. The suffix
is precisely a row of
, the matrix that contains all vectors of length
and weight at most
over
.
Thus we can partition the rows of into blocks according to the position s of the first nonzero coordinate:
For
, we obtain the block
For
, we obtain the block
In general, for
, we obtain the block
Finally, the vector with a first nonzero coordinate at position n and weight 1 is given by the last row .
Stacking all these blocks in the order induced by the nested loops yields exactly the block matrix representation in (
6). The base cases
and
follow directly from the construction: for
and
there is only one non-proportional vector, namely (
1), while for
and
we obtain all vectors of weight 1 with first nonzero coordinate equal to (
1), realized in the three blocks of
.
This completes the proof. □
Proposition 2.
Let be the number of the rows in the matrix . Thenand satisfies the recurrence relationfor all integers and , with boundary conditions Proof. Obviously,
Using the binomial identity
, we obtain
For the second sum, set
. Then
Combining (
8) and (
9), we obtain
which is exactly the recurrence (
7).
The boundary conditions follow directly from the definition: for there are no nonzero vectors, so , and for there are no nonzero vectors of positive weight, so . □
Corollary 1.
Let us consider the number of rows in and , denoted and , respectively. Then .
Proof. Follows directly from Lemma 2 and Propositions 1 and 2. □
This structure leads to a highly efficient incremental enumeration method: the entire set of combinations can be generated using at most
L additional column vectors and at most one vector addition per step. Here, one subset
with the desired properties, defined in Definition 1, can be generated by using nested loops. The first loop selects the first nonzero position (and sets it to 1), the second loop selects a second position and cycles over all values in
, and so on. Similarly, the vectors in
can be represented as rows of a matrix
that follows the recurrence relation, defined by Equation (
6).
This recurrence relation gives a natural recursive algorithm for the generation of the vectors in
. The “step back”, presented in every recursive algorithm, represents “return” to the predecessor
for a given
, where
and
t is the last nonzero coordinate of
. We consider the methods presented in [
25] that emulate nested loops. More precisely, we focus on the non-recursive implementation that gives more control over the generation process. In the rest of the paper, we consider the set
that is generated by the recurrence relation, given in Equation (
6). We use this notation to refer to both the ordered set and its matrix representation.
4.2. Ranking and Unranking Algorithms
We can introduce ranking and unranking functions for the ordered set
. Such functions are defined for many combinatorial objects with given order such as permutations, variations, etc.
Table 1 presents the ordered set
for
,
,
and the rank
i of each linear combination represented by vector
for
. The table contains four sets of two columns. The first column in each set gives the rank of the vector given in the following columns.
Proposition 3.
Let us define the rank of a nonzero vector as its row number in . Then Algorithm 2 returns as a result the rank of the vector . Moreover, the mapping , induced by Algorithm 2, is a bijection between and .
The correctness of the ranking algorithm is based on the recurrence relation established in Lemma 2, which characterizes the structural organization of the matrix associated with the set . All vectors under consideration lie in and are assumed to satisfy the condition that their first nonzero coordinate is equal to 1. Because of this constraint, the total number of admissible vectors whose first nonzero coordinate occurs at any prescribed position is known explicitly. This structural information is central to the ranking procedure, even though the implementation itself does not employ recursion; the recurrence serves only as a conceptual description of the underlying combinatorial decomposition.
The algorithm for determining the rank (i.e., the index) of a given vector in the matrix representation of proceeds in two main stages. The first stage computes the number of vectors whose first nonzero coordinate (equal to 1 by definition) appears strictly later than in the vector being ranked. More precisely, if the first nonzero coordinate of the vector occurs at position i, the algorithm sums the known counts of all vectors whose first nonzero coordinate occurs in positions . This determines the initial offset of the vector within the matrix.
The second stage refines this offset by processing the remaining coordinates of the vector in order. At each position, the rank is incrementally updated according to the value of the current coordinate. The process continues coordinate by coordinate until the full vector has been examined.
It is worth noting that, although the recurrence in Lemma 2 conceptually organizes the matrix , the algorithm intentionally avoids recursive evaluation in order to achieve maximal computational efficiency. The number of steps in the ranking procedure is exactly n, yielding a linear-time method whose correctness follows directly from the structural decomposition implied by the lemma.
Proposition 4
(Correctness of unranking for non-proportional vectors)
. Let be the matrix obtained by Lemma 2 and and letdenote the number of vectors in . Let be a valid rank. Then Algorithm 3 returns the unique vector that has row number (rank) in exactly B. | Algorithm 2 Rank function for |
- 1:
procedure rank_function() - 2:
Input: : nonzero vector of weight at most L whose first nonzero coordinate is 1 n: vector length L: maximum weight q: field size - 3:
Output: : index of in - 4:
, , , - 5:
- 6:
for to n do - 7:
if then - 8:
- 9:
, - 10:
while do ▹ Step 1: position of the first nonzero coordinate - 11:
- 12:
- 13:
- 14:
- 15:
▹ Step 2: counting vector - 16:
- 17:
- 18:
while do ▹ Step 3: coefficients and later nonzero positions - 19:
if then - 20:
- 21:
- 22:
else - 23:
- 24:
- 25:
- 26:
- 27:
- 28:
- 29:
return
|
Proof. We show that each step of Algorithm 3 reconstructs the unique vector corresponding to the .
For any non-proportional vector in , the first nonzero coordinate must be equal to 1 and must occur at some position .
For any position , vectors whose first nonzero coordinate is s, are either or of the form , where is a vector of length with at most nonzero coordinates and . The number of possible vectors is exactly .
Algorithm 3 iteratively subtracts these blocks from A until it reaches the unique j for which , and then assigns . Thus, the algorithm finds the correct first nonzero coordinate.
| Algorithm 3 Unrank function for |
- 1:
procedure unrank_function() - 2:
Input: B: rank of the target vector in . q: field size . n: vector length L: maximal number of nonzero coordinates. - 3:
Output: : reconstructed vector. - 4:
- 5:
, , , , - 6:
while do ▹ Step 1: find the first nonzero coordinate - 7:
- 8:
if then - 9:
, , - 10:
else - 11:
, , - 12:
if then - 13:
return - 14:
, - 15:
- 16:
while do ▹ Step 2: determine the remaining coordinates - 17:
- 18:
while do - 19:
if then - 20:
- 21:
if then - 22:
, - 23:
else - 24:
- 25:
if then - 26:
, - 27:
if then - 28:
return - 29:
, , break - 30:
else - 31:
, , break
|
After fixing the first nonzero coordinate, the remaining suffix must contain at most nonzero coordinates. At each position , the algorithm considers all possible values for that position, starting with 1 up to , using variable . We have the following possibilities for the current position i:
The current candidate value is smaller than . Then, the number of vectors sharing all earlier coordinates but having this value at position i is precisely . In this case, the current value of the will be greater—the target vector will follow preceding vectors and thus we subtract that value from the rank.
The current candidate is exactly the value value of . In this case . The algorithm assigns and reduces the remaining number of nonzero coordinates by one. The counter is updated to the exact internal sub-rank inside the remaining suffix.
A possible value for coordinate i is 0. In this case, we have . We assign , increment the iterator and break the inner while loop.
When the remaining reaches zero, the algorithm has fully reconstructed the vector. □
4.3. Generating Ordered Subsets of the Set
The ordered set
can be generated by nested loops when we know the exact values of
n,
L,
q. This, however, is not practical since these values can change dynamically, depending on the problem. Such is the case with computing the covering radius of a linear code, where we would iterate the value of
L. For this purpose, we generate the set
using two auxiliary arrays, as shown in [
25]:
– strictly increasing positions of the nonzero coordinates of ,
– corresponding nonzero coefficients in .
Given such a pair
, the vector
is uniquely reconstructed by
Conversely, any vector
of Hamming weight at most
l can be encoded in this way using at most
l pairs
.
When considering a parallelization of an algorithm, we need to have an efficient way to divide the work evenly among different computational units. Thus, we consider an algorithm that generates a fixed number of vectors in
, starting with vector with
, while also maintaining the order of the set. The goal of the procedure, presented in Algorithm 4 is the following: given an initial rank
B using the unranking Algorithm 3 we initialize
and then we iteratively update
in a purely iterative (non-recursive) manner that emulates the nested loops over positions and coefficients. Algorithm 1 serves as an intuitive illustration of the enumeration order. The non-recursive implementation used in Algorithm 4 follows the same logic and is described in detail in [
25]. This produces the subsequent vectors in the same order as the matrices
or
.
Proposition 5 (Correctness of the non-recursive generator).
Let be the matrix obtained by Lemma 2 and the procedure
unrank_function
implements Algorithm 3. Let B be a valid rank and let be the corresponding vector. If we initialize the arrays and from as in Algorithm 4, then the subsequent updates of inside the outer loop over h and the inner
Repeat
block produce exactly the same sequence of vectors as the original nested-loop algorithm, starting from rank B and continuing with all subsequent vectors in .
Proof. Any vector is uniquely represented by strictly increasing positions and nonzero coefficients , with . Conversely, any such pair determines a unique vector by placing at position and zeros elsewhere. Thus there is a bijection between admissible configurations and vectors in .
The call unrank_function returns with rank B. The subsequent initialization loop extracts all nonzero positions into , and then appends the last nonzero position and its coefficient. The adjustment ensures that the first update in the inner loop restores the original coefficient, so that the first vector visited by the generator is precisely .
| Algorithm 4 Generation of linear combinations starting from rank B |
- 1:
procedure lin_comb_rank_covering_worker() - 2:
Input: n: vector length, L: maximal number of nonzero coordinates q: field size , B: rank of the starting vector - 3:
Global: : current vector in : positions of nonzero coordinates : corresponding coefficients ▹Step 0: obtain the starting vector from its rank - 4:
unrank_function() ▹ now has rank B - 5:
position of the last nonzero coordinate of ▹ Initialize helper arrays up to the last nonzero position - 6:
, , - 7:
for to do - 8:
if then - 9:
, , - 10:
▹ Add the last nonzero position and adjust its coefficient - 11:
▹ Main iterative generator: non-recursive simulation of nested loops - 12:
for to n do - 13:
if then - 14:
, , - 15:
repeat ▹ (1) Increment coefficient or position at depth j - 16:
if then - 17:
- 18:
else - 19:
, ▹ (2) Optionally reconstruct from ▹ (3) Control the depth j (enter/exit virtual nested loops) - 20:
if then - 21:
if then - 22:
, - 23:
if then - 24:
- 25:
else - 26:
if then - 27:
- 28:
until
|
The arrays and serve as a mechanism for emulating the nested loops that appear in Algorithm 1. The array encodes the sequence of active coordinates of the vector generated during the algorithm. By construction, its entries form a strictly increasing sequence ranging from 1 to n, thereby specifying the positions in which nonzero values may occur. In other words, abstracts the control flow of the outer loops, each iteration selecting a new coordinate of to be updated.
The second array, , determines the field value assigned to the coordinate indexed by . For each j, the value ranges from 1 to , where q denotes the cardinality of the underlying finite field . Thus, always encodes a nonzero element of , while identifies the corresponding active coordinate of the vector .
Together, the arrays and reproduce the combinatorial structure of the original nested-loop formulation.
The control logic for j (entering and exiting deeper levels) implements the next outer loops: as long as there is room for more nonzero coordinates (positions strictly less than k and ), a new level is entered by setting , with an appropriate initial coefficient. When reaches n and the coefficient at this level has cycled through all values, one level is exited by decreasing j. This is precisely what happens in a classical nested-loop enumeration when the innermost index reaches its maximum and returns, causing the next outer index to be incremented.
The outer loop over h moves the first nonzero position from its initial value up to n, resetting j and when h changes. Therefore, the combined effect of the outer loop and the inner Repeat loop is to traverse all admissible patterns of positions and coefficients in exactly the same order as the original nested-loop generator. □
4.4. Syndrome Enumeration
Let us now consider the second main aspect of the implementation of the sequential algorithm—the syndrome array representation. We can minimize both the computational and memory costs by considering only nonproportional linear combinations and, therefore, nonproportional syndromes. For a linear
code we have that their number s is
. To compute the covering radius we need to generate all
syndromes and save the newly obtained ones from each step. If we consider an ordering of the syndromes, then we can again use an enumeration (ranking) to represent a syndrome as a single integer. We can use the ordering given in Definition 1, where
. Another more natural enumeration arises from the lexicographical ordering of the nonproportional vectors in
. These vectors can be considered as a vector-columns of the generator matrix of the
simplex code
. There is a recurrence relation for the generator matrix
of a simplex code of dimension
m as shown in (
11):
where
and
. The vector-columns of the generator matrix of the simplex code also represent all points in the projective geometry
. This recurrence relation gives a simple enumeration, including ranking and unranking functions for the
-dimensional vectors.
Proposition 6. Let be a column of the generator matrix of the simplex code , constructed using Equation (11), and let us enumerate the columns of , starting with 1. Then, with every vector v we associate a unique integer , that corresponds to its column number in and , where j is the first nonzero coordinate of v. Moreover, Algorithm 5 returns the value of for a given vector v. Proof. The proof is similar to the proof of Proposition 3. Firstly, we find the first nonzero coordinate as seen in Algorithm 5. Let j be the position of the first nonzero coordinate and the vector is of the form . Thus, we know the number of columns of with coordinates from 1 to that are equal to zero and that number is exactly . This follows from the recursive construction of and the number is the dimension of the simplex code . We add this value to the value for the rank . For vector we have possibilities. These vectors are ordered lexicographically as columns of . Thus, we can calculate the rank of using standard integer encoding in base q with appropriate exponent for q (here the enumeration of the coordinates starts with 0 from left to right). Thus, the position in this subset of columns is exactly and we need to add this number to the value of . Finally, we add 1 since we start the enumeration from of the columns of with 1.
Algorithm 5 implements the given counting steps. In the general case, has first nonzero coordinate . Thus, it is not a column of and we need to normalize it using the inverse element. We use a precomputed table that contains the inverse element for all elements of the field. □
| Algorithm 5 Ranking algorithm for the points in |
- 1:
procedure pointToInt() - 2:
Input: : length of vector, for linear code q: finite field : input vector, where - 3:
Output: r: rank of vector v - 4:
- 5:
- 6:
for to do - 7:
- 8:
- 9:
while do - 10:
- 11:
if then - 12:
return 0 - 13:
if then - 14:
▹ Normalize the vector - 15:
for to do - 16:
- 17:
- 18:
- 19:
for to do - 20:
- 21:
if then - 22:
- 23:
- 24:
return a
|
For the unranking algorithm we follow similar logic as in Proposition 4. We iteratively subtract from the rank a the number of vectors where the first non-zero coordinate is at position . When , for some i, then we have found the first non-zero coordinate and set the corresponding position of v to 1. The rest of the coordinates are calculated as the coefficient q-base integer representation of the decimal integer a. Algorithm 6 implements a procedure to calculate the corresponding vector to a given rank a.
We can save the newly obtained syndromes as integers in an array once we have an enumeration method. A syndrome can be generated as a linear combination of different columns of the parity-check matrix H. However, we only need to keep track of whether a syndrome is generated in the previous step or not. Thus, we can use a single bit to show whether a corresponding syndrome has been generated. We can use a dynamic structure as a bitset or an array of unsigned integers with appropriate indexing, where each bit of a single array element will correspond to a syndrome. If we consider an array of 64-bit integers, then the bit that corresponds to syndrome with rank r will be at element of the array and at bit , using standard bit enumeration for the bits from right to left, starting with 0.
| Algorithm 6 Unranking algorithm for the points in |
- 1:
procedure intToPoint(a,len,q) - 2:
Input: a: rank of vector : length of vector, for linear code q: finite field - 3:
Output: : vector, that corresponds to column a of - 4:
for to do - 5:
- 6:
▹ Find position of the leading 1 - 7:
while do - 8:
- 9:
- 10:
- 11:
- 12:
▹ Decode remaining a in base q - 13:
while do - 14:
- 15:
- 16:
- 17:
return v
|
5. Parallel Implementation
The main challenges with parallelizing the algorithm for computation of the covering radius of a linear code are synchronization and data access. In parallel implementation the computations are divided among different working units that execute calculations at the same time. In our case, the sequential algorithm iterates through the values of l starting with , where l is the number of columns of H that are included in the linear combination. After the generation for a given l is completed, we need to check whether all syndromes have been generated. Thus, in a parallel implementation we need to synchronize all computational units. Furthermore, each computational unit must have access to the syndrome array. For this purpose, we consider a master–worker strategy with MPI.
In the proposed approach, the master process iterates through the values of l, sends it to the workers, receives the newly generated syndromes from the workers, marks them in the syndrome array and ends calculations if all syndromes are generated. The basic outline of the master process is given in Algorithm 7.
The workers compute the rank of the starting linear combination and the number of linear combinations that the process will generate
. For the generation of linear combinations, we also include vectorization using the SSE4.1 instruction set. Implementation of vector addition over prime fields with up to 127 elements using extended registers is presented [
23]. For the generation, the worker processes use Algorithm 4. Since each worker will generate approximately the same number of linear combinations, the workers will send only the newly obtained syndromes in smaller chunks. In this way, the master will not receive all syndromes at the same time and minimize the idle time for both master and worker processes. Algorithm 8 gives an outline of the worker processes workflow. Here, for simplicity of the presentation of the algorithm we use function
GenerateCombinationByIndex (start, R). In practice, we use implementation of Algorithm 4 with appropriate modifications.
| Algorithm 7 Master Process |
- 1:
Input: H: Parity-check matrix, : lower bound - 2:
Output: R: Covering radius - 3:
- 4:
while true do - 5:
Initialize - 6:
Broadcast NEW_R to all workers - 7:
- 8:
while do - 9:
Receive message m from any worker - 10:
if SYNDROME_LIST then - 11:
for all in do - 12:
- 13:
else if DONE then - 14:
- 15:
if all entries in are true then - 16:
Broadcast TERMINATE to all workers - 17:
return R - 18:
|
| Algorithm 8 Worker Process |
- 1:
while true do - 2:
Receive message m from Master - 3:
if TERMINATE then - 4:
return - 5:
else if NEW_R then - 6:
- 7:
- 8:
- 9:
- 10:
- 11:
- 12:
- 13:
- 14:
- 15:
for to do - 16:
▹ is a parity-check matrix - 17:
- 18:
Append to - 19:
if then - 20:
Send SYNDROME_LIST(buffer) to Master - 21:
- 22:
- 23:
if then - 24:
Send SYNDROME_LIST(buffer) to Master - 25:
Send DONE to Master
|
5.1. Computational Aspects
In the proposed approach, each worker process independently determines the subset of linear combinations it must process, based on its rank and the total number of workers. The master process is responsible solely for broadcasting the current radius R to all workers and for maintaining a global array of discovered syndrome indices.
Let
denote the total number of nonzero linear combinations of
l columns of the parity-check matrix
H, and let
P be the total number of worker processes. Worker
computes its processing interval as:
Within this interval, the worker enumerates each linear combination vector, computes its corresponding syndrome
s, maps
s to its rank, and accumulates these ranks in a local buffer. Whenever the buffer reaches a predefined capacity, it is transmitted to the master, thereby reducing communication overhead. Upon completing its interval, the worker sends any remaining indices and a completion signal to the master. This design has several key benefits:
Minimal communication overhead: The master no longer distributes intervals, reducing the number of messages.
Simplicity and scalability: Workers compute their intervals independently, making the algorithm easily scalable to large clusters.
Immediate utilization: Workers can begin computation as soon as they receive R, without waiting for further master instructions.
Memory efficiency: Only the master maintains the global syndrome array; workers require minimal local storage.
In this approach, each worker handles approximately combinations. Assuming uniform cost per linear combination, the parallel time complexity is , neglecting communication overhead. Since, workers send only newly obtained syndromes, after the first step the buffer arrays will not be sent at the same time. Thus, both generation of linear combinations and marking newly obtained syndromes are executed simultaneously. This further decreases the computational complexity.
Let us consider the communication complexity. Each worker sends messages, where B is the buffer size for batching syndrome indices. The total communication overhead is therefore messages per radius l, which is small compared to the computational cost when B is chosen appropriately (e.g., 1000 indices per message).
With this parallel approach we also minimize the memory complexity. Each worker maintains only a local buffer of size B and at most l linear combination vectors, leading to memory per worker. The master maintains a global array of size , which dominates memory usage but resides centrally.
5.2. Comparison with Parallelization in Shared Memory Systems
The proposed MPI-based master–worker design, in which each worker computes its processing interval independently, offers several advantages over a traditional shared-memory implementation with OpenMP for computing the covering radius of linear codes. In OpenMP, parallel threads share the same memory space. Updating a large global array of syndrome indices in parallel may lead to false sharing and contention, particularly if atomic operations or locks are used to synchronize writes. In the MPI approach, the master process centrally manages the global array, while workers accumulate indices locally and send them in batches. This minimizes synchronization overhead and prevents performance degradation caused by concurrent writes to shared memory. While OpenMP threads communicate through shared memory, excessive synchronization (e.g., for marking large arrays) can become a bottleneck. MPI allows explicit control over communication, with workers sending syndrome indices in batched messages. By tuning the buffer size, communication overhead can be significantly reduced, which is particularly advantageous for computations with large global arrays. Furthermore, this strategy can operate across multiple nodes in a cluster, allowing for massively parallel computations. Each worker can reside on a different physical machine, with independent memory, thereby enabling computations that exceed the memory limits of a single node.
6. Experimental Results
We compare the execution time of the vectorized implementation using the sequential implementation with the function for computing the covering radius in the MAGMA computational algebra system V.2.29-4 and specialized package GUAVA Version 3.20 for working with error-correcting codes with the system for computational discrete algebra GAP. In the evaluation of the parallel implementation, we consider the number of worker processes. The computations were executed on Intel Core i9 processor 12900K (Santa Clara, CA, USA) with 3.2 GHz base clock frequency.
Firstly, let us consider the vectorized implementation using the sequential algorithm and SSE 4.1 instruction set that generates linear combinations iteratively. We have chosen this instruction set since it is widely available. In most of the presented cases a column of the parity-check matrix is written in not more than 16 bytes, thus fitting in a single 128-bit register. Furthermore, the use of larger registers can result in lowering the working clock frequency in multithreaded programs [
26].
Table 2 presents the execution times of this implementation and gives the speedup in comparison to the MAGMA and GUAVA functions. All presented execution times are in seconds. In the first three columns are given the code parameters
respectively. Afterward, we give the covering radius for the used codes. The next column gives the execution time of the MAGMA function using the online calculator. In the sixth column we give the execution time for the vectorized implementation with SSE4.1 instruction set. The following column gives the obtained speed-up, calculated by the formula
, where
is the given execution time of MAGMA and
is the execution time using SSE4.1 instructions. The last two columns give the execution time using the integrated function of the Guava package and the speed-up compared to our implementation. The speed-up is calculated analogously, using the formula
, where
is the execution time with GUAVA.
The vectorized implementation achieves a substantial performance improvement, with speedups ranging from 5.2 to 19.6 compared to the MAGMA function. When compared to the GUAVA package, the observed speedup ranges between 12 and 60 times. The improvement is due to a combination of implementation-level and algorithmic optimizations. The dominant factor is SIMD vectorization, which reduces the number of scalar operations in the generation of linear combinations. In addition, we use a lower bound on the covering radius to start the search from and to avoid a subset of redundant checks. Furthermore, the computational time depends not only on the number of syndromes, but also on how “close” the covering radius is to (how many iterations are executed). Finally, the algorithm does not generate the full set : we maintain a global counter of the number of distinct syndromes produced, and terminate as soon as this counter reaches the total number of syndromes, . This stopping criteria prevents unnecessary generation of combinations once the syndrome space has been exhausted and further reduces the overall runtime. This explains the difference in the obtained speed-up compared to both packages.
Let us now consider the implementation with MPI and master–worker strategy.
Table 3 shows the execution times with different number of worker processes and buffer size 1,000,000 elements. We also use SSE extended registers in the implementation. All times are given in seconds. The first four columns show the parameters
q,
n,
k and the calculated
R of the target linear codes. In the following columns, we give the execution times with 1, 2, 4, 8 and 16 workers, respectively. The experimental results show good scalability in most cases when the number of workers increases up to 8. In the cases with 2 and 4 workers, the obtained speedup when comparing to the case with 1 worker is 1.9 and 3.7, respectively. In the case with 8 workers, the observed speedup is between 5.1 and 6.6. A few factors have effect on the speedup, including hardware (only 8 performance cores), the number of iteration needed to compute the covering radius (depends on the code itself), communication overhead, etc. It is important to note that our presented master–worker algorithms are different from the traditional method. As can be seen in Algorithm 7, the master process also executes some work, namely, keeping track of the generated syndromes. Thus, increasing the number of worker processes can result in some communication bottlenecks. This results in a decrease in speedup when the number of working processes grows. One approach to address the communication bottleneck is to use multiple master processes. Each master process is tracking a subset of the syndromes depending on its process id. We have obtained the covering radius when all master processes have received their full subset of syndromes. On the worker side of the computations, we have multiple buffer arrays - one for each master process. A new syndrome is written in one of these arrays depending on its rank. The final result is obtained from the processes using collective communication with
MPI_Reduce.
Table 4 presents the computational times for an implementation with multiple master processes. Here the computations were executed on system Fujitsu Primergy RX 2540 M4 (Fujitsu Limited, Kawasaki, Japan) with 128 GB RAM, CPU 2x Intel Xeon Gold 5118 2.30 GHz 24 cores. The first four columns give the parameters of the codes for which the computations were executed, namely,
q,
n,
k,
R. The next column gives the number of master processes, followed by the columns with execution times in seconds with 1, 2, 4, 8, 16 and 24 worker processes. Computational times are given in seconds. As it can be seen from the table, with one master we observe good scalability up to 12 workers. The increase of the master processes results in slower execution in the case with one worker process. However, it can be seen that the scalability improves compared to the case with one master. This can also be seen in
Figure 1, which shows the speed-up for the experimental results, given in
Table 4. The speedup is calculated using the formula
, where
is the execution time using one worker and
is the time with
W workers
W = 2, 4, 8, 12, 16, 24.
Figure 1 shows that when using hyperthreading (total number of processes is greater than the physical cores e.g.,
), the rate with which the speed-up grows declines.