Some New Constructions of q-ary Codes for Correcting a Burst of at Most t Deletions

Wentu Song; Kui Cai; Tony Q. S. Quek

doi:10.3390/e27010085

,

and

Science, Mathematics and Technology (SMT) Cluster, Singapore University of Technology and Design, Singapore 487372, Singapore

^*

Author to whom correspondence should be addressed.

^†

Part of the results of this article has been presented at the 2024 IEEE International Symposium on Information Theory, Athens, Greece, 7–12 July 2024.

Entropy2025, 27(1), 85;https://doi.org/10.3390/e27010085

This article belongs to the Special Issue Coding Theory and Its Applications

Version Notes

Order Reprints

Review Reports

Abstract

In this paper, we construct q-ary codes for correcting a burst of at most t deletions, where

t, q \geq 2

are arbitrarily fixed positive integers. We consider two scenarios of error correction: the classical error correcting codes, which recover each codeword from one read (channel output), and the reconstruction codes, which allow to recover each codeword from multiple channel reads. For the first scenario, our construction has redundancy

\log n + 8 \log \log n + o (\log \log n)

bits, encoding complexity

O (q^{7 t} n {(\log n)}^{3})

and decoding complexity

O (n \log n)

. For the reconstruction scenario, our construction can recover the codewords with two reads and has redundancy

8 \log \log n + o (\log \log n)

bits. The encoding complexity of this construction is

O (q^{7 t} n {(\log n)}^{3})

, and decoding complexity is

O (q^{9 t} {(n \log n)}^{3})

. Both of our constructions have lower redundancy than the best known existing works. We also give explicit encoding functions for both constructions that are simpler than previous works.

Keywords:

deletion correcting codes; sequence reconstruction; reconstruction codes; burst-deletion

1. Introduction

The study of deletion/insertion correcting codes, which originated in the 1960s, has made a great progress in recent years, encouraged by their application to DNA-based data storage. One of the basic problems in this area is construction of codes with low redundancy and low encoding/decoding complexity, where the redundancy of a q-ary

(q \geq 2)

code

C

of length n is defined as

n - \log_{q} | C |

, measured in q-ary symbols, or

(n - \log_{q} | C |) \log q

, measured in bits (in this paper, for simplicity, for any real

x > 0

, we write

\log_{2} x = \log x

). The optimal redundancy of a q-ary t-deletion correcting code of length n was proved to be asymptotically between

t \log n + t \log q + o (\log q n)

and

2 t \log n + t \log q + o (\log q n)

[1]. In general, codes with a redundancy matching the upper bound can be constructed by graph-coloring method. However, the encoding complexity of such a construction is exponential in n. In practice, the construction of codes with polynomial encoding complexity (also called explicit construction) and low redundancy is an interesting research problem.

The famous VT codes were proved to be a family of single-deletion correcting binary codes, and are asymptotically optimal in redundancy [2]. The VT construction was generalized to nonbinary single-deletion correcting codes in [3] and, recently, a different version of nonbinary VT codes was proposed in [4] using differential vector, with asymptotically optimal redundancy and efficient encoding/decoding. Works on binary and nonbinary codes for correcting multiple deletions can be found in [5,6,7,8,9,10,11,12,13] and the references therein.

Burst deletions and insertions, which means that deletions and insertions occur at consecutive positions in a string, are a class of error that can be found in many applications, such as DNA-based data storage and file synchronization. For the binary case, the maximal cardinality of a t-burst-deletion correcting code (i.e., a code that can correct a burst of exactly t deletions) is proved to be asymptotically upper bounded by

2^{n - t + 1} / n

[14], so its redundancy is asymptotically lower bounded by

\log n + t - 1

. Several constructions of binary codes correcting a burst of exactly t deletions were reported in [15,16], where the construction in [16] achieves an optimal redundancy of

\log n + (t - 1) \log \log n + k - \log k

. A more general class, i.e., codes correcting a burst of at most t deletions, were also constructed in the same paper [16], and this construction was improved in [17] to achieve a redundancy of

⌈ \log t ⌉ \log n + (t (t + 1) / 2 - 1) \log \log n + c_{t}

for some constant

c_{t}

that only depends on t. In [18], by using VT constraint and shifted VT constraint in the so-called

(p, δ)

-dense strings, binary codes correcting a burst of at most t deletions were constructed, with an optimal redundancy of

\log n + t (t + 1) / 2 \log \log n + c_{t}^{'}

, where

c_{t}^{'}

is a constant depending only on t.

In recent parallel works [12,19], q-ary codes correcting a burst of at most t deletions were constructed for even integer

q > 2

, with redundancy

\log n + O (\log q \log \log n)

or, more specifically,

\log n + (8 \log q + 9) \log \log n + γ_{t}^{'} + o (\log \log n)

bits for some constant

γ_{t}^{'}

that only depends on t. The basic techniques in [12,19] are to represent each q-ary string as a binary matrix whose columns are the binary representation of the entries of the corresponding q-ary string. Strings of length n are constructed such that the first row of their matrix representation is

(p, δ)

-dense. Then, the first row of their matrix representation is protected by a binary burst deletion correcting code of length n and the other rows are protected by binary burst deletion correcting codes of length not greater than

2 δ

, which results in a construction with

\log n + O (\log q \log \log n)

bits of redundancy. A different construction of q-ary codes correcting a burst of at most t deletions was reported in a more recent work [4], which has redundancy

\log n + O (t^{2} \log \log n) + O (t \log q)

.

A relaxed model of error correction, called sequence reconstruction, also received great attention from researchers (e.g., see [20,21,22,23,24,25,26,27,28,29,30,31,32]). Unlike the classical error correcting codes, in the sequence reconstruction model, the receiver is allowed to reconstruct the original transmitted sequence from multiple noisy reads (channel outputs). This model is suitable for DNA data storage because current synthesis and sequencing technologies can generate many (possibly erroneous) reads for each DNA strand, and so each stored DNA strand can be recovered by its many erroneous copies. Sequence reconstruction for deletion, insertion, transposition, and substitution was first studied in [20,21], where the minimum number of reads for exact reconstruction of uncoded sequence was computed. Coded sequence reconstruction for deletion channel was considered recently in [23], where it was assumed that a codeword of a single-deletion-correcting code is transmitted over the t-deletion channels, and the minimum number of distinct reads required to uniquely reconstruct the transmitted sequence was computed. The more general problem, i.e., the minimum number of reads for reconstruction of a codeword of an

(ℓ - 1)

-deletion-correcting code of length n transmitted over the t-deletion channels for some

1 \leq ℓ \leq t < n

, was solved in [27]. The dual problem, i.e., designing codes (called reconstruction codes) for reconstruction of a sequence with fixed number of reads for deletion channel, was also considered in recent years. A construction of binary reconstruction codes for two reads and with

\log \log n + O (1)

bits of redundancy under single-deletion channel was presented in [24], and this construction was generalized in [28] to q-ary single-edit channel for

q \geq 2

. Binary reconstruction codes under 2-deletion channel were constructed in [29,30]. It was shown in [30] that

3 \log n + o (\log n)

bits of redundancy is sufficient for two reads, and

\log n + o (\log n)

bits of redundancy is sufficient for five reads. Reconstruction codes under single-burst-insertion/deletion/edit channel were considered in [31], where for the channel suffering from a single burst of at most t deletions, a family of q-ary codes for two reads with

t (t + 1) / 2 \log \log n + O (1)

bits of redundancy were constructed.

In this paper, we propose some new constructions of q-ary codes for correcting a burst of at most t deletions for any fixed positive integers t and

q \geq 2

. We consider both the classical error correcting codes and the reconstruction codes. In our constructions, we consider q-ary

(p, δ)

-dense strings (sequences), which are generalization of the binary

(p, δ)

-dense strings defined in [18], and give an efficient algorithm for encoding and decoding of q-ary

(p, δ)

-dense strings. For the classical burst-deletion correcting codes, a VT-like function is used to locate the deletions within an interval of length not greater than

3 δ

, which results in

\log n

bits of redundancy. In addition, two functions are used to recover the substring destroyed by deletions, which results in

8 \log \log n + o (\log \log n)

bits of redundancy (in this paper, the term

o (\log \log n)

may depends on q and t. However, since q and t are assumed to be fixed positive integers, they are omitted). Thus, the total redundancy of our construction is

\log n + 8 \log \log n + o (\log \log n)

bits. Compared to previous works, the redundancy of our new construction is independent of q and t in the second term. An explicit encoding function is given, which is simpler than previous works and has complexity

O (q^{7 t} n {(\log n)}^{3})

. The decoding complexity is

O (n \log n)

.

We also construct reconstruction codes for correcting a burst of at most t deletions from two reads and with redundancy

8 \log \log n + o (\log \log n)

bits, which is lower than the construction in [31]. We give an explicit encoding function for such codes with encoding complexity

O (q^{7 t} n {(\log n)}^{3})

and decoding complexity

O (q^{9 t} {(n \log n)}^{3})

.

In Section 2, we introduce related notations and concepts, and present some basic constructions that will be used in our new constructions. In Section 3, we study

(p, δ)

-dense q-ary strings. A new construction of classical q-ary burst-deletion correcting codes is given in Section 4, and q-ary codes for correcting burst deletions from two reads is given in Section 5. Finally, the paper is concluded in Section 6.

2. Preliminaries

Let

[m, n] = {m, m + 1, \dots, n}

for any two integers m and n, such that

m \leq n

, and call

[m, n]

an interval. If

m > n

, then let

[m, n] = Ø

. For simplicity, we denote

[n] = [1, n]

for any positive integer n. The size of a set S is denoted by

| S |

.

Given any integer

q \geq 2

, let

Σ_{q} = {0, 1, 2, \dots, q - 1}

. For any sequence (also called a string or a vector)

x \in Σ_{q}^{n}

, n is called the length of

x

, and denote

| x | = n

. We will denote

x = (x_{1}, x_{2}, \dots, x_{n})

or

x = x_{1} x_{2} \dots x_{n}

. For any set

I = {i_{1}, i_{2}, \dots, i_{m}} \subseteq [n]

such that

i_{1} < i_{2} < \dots < i_{m}

, denote

x_{I} = x_{i_{1}} x_{i_{2}} \dots x_{i_{m}}

, and call

x_{I}

a subsequence of

x

. If

I = [i, j]

for some

i, j \in [1, n]

such that

i \leq j

, then

x_{I} = x_{[i, j]} = x_{i} x_{i + 1} \dots x_{j}

is called a substring of

x

. We say that

x

contains

p (

or

p

is contained in

x)

if

p

is a substring of

x

. For any

x \in Σ_{q}^{n}

and

y \in Σ_{q}^{n^{'}}

, we use

x y

to denote their concatenation, i.e.,

x y = x_{1} x_{2} \dots x_{n} y_{1} y_{2} \dots y_{n^{'}}

. We also use notations such as

x_{0}, x_{1}, \dots, x_{k}

to denote substrings of a sequence

x

. For example, the notation

x = x_{1} x_{2} \dots x_{k}

means that the sequence

x

consists of k substrings

x_{1}, x_{2}, \dots, x_{k}

.

Let

t \leq n

be a nonnegative integer. For any

x \in Σ_{q}^{n}

, let

D_{t} (x)

denote the set of subsequences of

x

of length

n - t

, and let

B_{t} (x)

denote the set of subsequences

y

of

x

that can be obtained from

x

by a burst of t deletions, that is,

y = x_{[n] ∖ D}

for some interval

D \subseteq [n]

of length

t

(i.e.,

D = [i, i + t - 1]

for some

i \in [n - t + 1])

. Moreover, let

B_{\leq t} (x) = ⋃_{t^{'} = 0}^{t} B_{t^{'}} (x)

, i.e.,

B_{\leq t} (x)

is the set of subsequences of

x

that can be obtained from

x

by a burst of at most t deletions. Clearly,

B_{1} (x) = D_{1} (x)

and

B_{t} (x) \subseteq D_{t} (x)

for

t \geq 2

.

A code

C \subseteq Σ_{q}^{n}

is said to be a t-deletion correcting code if, for any codeword

x \in C

, given any

y \in D_{t} (x)

,

x

can be uniquely recovered from

y

; the code

C \subseteq Σ_{q}^{n}

is said to be capable of correcting a burst of at most t deletions if, for any

x \in C

, given any

y \in B_{\leq t} (x)

,

x

can be uniquely recovered from

y

. More generally, let

B \in {D_{t}, B_{\leq t}}

and N be a positive integer. A code

C \subseteq Σ_{q}^{n}

is said to be an

(n, N, B)

-reconstruction code if, for any codeword

x \in C

,

x

can be uniquely recovered from any given N distinct sequences in

B (x)

. In this case, we also say that

x

can be uniquely recovered from N reads in

B (x)

. If

N = 1

, then an

(n, N, B)

-reconstruction code degenerates to the classical error correcting code for the error pattern

B

.

For any code

C \subseteq Σ_{q}^{n}

, the redundancy of

C

is defined as

n - \log_{q} | C |

measured in q-ary symbols or

\log q (n - \log_{q} | C |)

measured in bits. Clearly, if there is an encoding function that maps each length-k sequence (message) to a length-n codeword in

C

, then the redundancy of

C

is

n - k

. In this paper, we will always assume that q and t are fixed (i.e., q and t are constant with respect to

n)

.

A convenient way for constructing deletion correcting codes is to construct some sketches such that for sufficiently many sequences, each can be recovered from its (known) sketches and one of its subsequence obtained by (a burst of) at most t deletions. The VT syndrome is a sketch for correcting a single deletion, which is defined as follows. For each

c = (c_{1}, c_{2}, \dots, c_{n}) \in Σ_{2}^{n}

, the VT syndrome of

c

is defined as

VT (c) = \sum_{i = 1}^{n} i c_{i} mod (n + 1) .

It was proved in [2] that for any

c \in Σ_{2}^{n}

, given

VT (c)

and any

y \in D_{1} (c)

, one can uniquely recover

x

.

Suppose

q > 2

. For each

x = (x_{1}, x_{2}, \dots, x_{n}) \in Σ_{q}^{n}

, let

ϕ (x) = (ϕ {(x)}_{1}, ϕ {(x)}_{2}, \dots, ϕ {(x)}_{n}) \in Σ_{2}^{n}

be such that, for each

i \in [2, n]

,

ϕ {(x)}_{i} = 1

if

x_{i} \geq x_{i - 1}

and

ϕ {(x)}_{i} = 0

if

x_{i} < x_{i - 1}

. Moreover, let

ϕ {(x)}_{1} = 0

for all

x \in Σ_{q}^{n} . (

one can also let

ϕ {(x)}_{1} = 1

for all

x \in Σ_{q}^{n})

. Then, there are q-ary codes constructed in [3] for correcting a single deletion.

Lemma 1.

[3] For any

x \in Σ_{q}^{n}

, given

VT (ϕ {(x)}_{[2, n]})

,

Sum (x)

and any

y \in D_{1} (x)

, one can uniquely recover

x

, where

ϕ {(x)}_{[2, n]} = (ϕ {(x)}_{2}, \dots, ϕ {(x)}_{n})

and

Sum (x) = \sum_{i = 1}^{n} x_{i} mod q .

2.1. A Construction of q-ary Burst-Deletion Codes

For codes correcting burst deletions, the following lemma gives a q-ary version the construction in [33] to q-ary codes

(q > 2)

, and will be used in our new construction.

Lemma 2.

For any positive integer m, there exists a function

h_{syn} : Σ_{q}^{m} \to Σ_{q}^{4 \log_{q} m + o (\log_{q} m)},

computable in time

O (q^{t} m^{3})

, such that

h_{syn} (x) \neq h_{syn} (x^{'})

for any distinct

x, x^{'} \in Σ_{q}^{m}

with

B_{\leq t} (x) \cap B_{\leq t} (x^{'}) \neq Ø

.

Proof.

The function

h_{syn}

can be constructed by the syndrome compression technique developed in [33].

For each

x \in Σ_{q}^{m}

, let

N_{t} (x)

be the set of all

x^{'} \in Σ_{q}^{m} ∖ {x}

, such that

B_{\leq t} (x) \cap B_{\leq t} (x^{'}) \neq Ø

. By simple counting, we have

\begin{matrix} | N_{t} (x) | \leq t m^{2} q^{t} . \end{matrix}

(1)

We first construct a function

\bar{h} : Σ_{q}^{m} \to [0, 2^{\bar{R}} - 1]

satisfying: 1)

\bar{R} = \frac{t (t + 1)}{2} (\log m + \log q)

; and 2)

\bar{h} (x) \neq \bar{h} (x^{'})

for any

x \in Σ_{q}^{m}

and any

x^{'} \in N_{t} (x)

. Specifically,

\bar{h}

is constructed as follows: For each

t^{'} \in [t]

and

j \in [t^{'}]

, let

{\bar{h}}_{t^{'}, j} (x) = (VT (ϕ {(x_{I_{t^{'}, j}})}_{[2, m_{t^{'}, j}]}), Sum (x_{I_{t^{'}, j}})),

where

I_{t^{'}, j} = {ℓ \in [m] : ℓ \equiv j mod t^{'}}

and

m_{t^{'}, j} = | I_{t^{'}, j} |

. Then, let

\bar{h} = ({\bar{h}}_{1, 1}, {\bar{h}}_{2, 1}, {\bar{h}}_{2, 2}, \dots, {\bar{h}}_{t, 1}, {\bar{h}}_{t, 2}, \dots, {\bar{h}}_{t, t}) .

Clearly,

| I_{t^{'}, j} | \leq ⌈ \frac{m}{t^{'}} ⌉

and so, when represented as a binary sequence, the length

| \bar{h} (x) |

of

\bar{h} (x)

satisfies the following (throughout this paper, for any given

q \geq 2

, if needed, we will view a positive integer m as a q-ary sequence which is the q-base representation of m, and conversely, we also view a q-ary sequence

z

as a positive integer whose q-base representation is

z

):

\begin{matrix} | \bar{h} (x) | & = \sum_{t^{'} = 1}^{t} \sum_{j = 1}^{t^{'}} |{\bar{h}}_{t^{'}, j} (x)| \\ \leq \sum_{t^{'} = 1}^{t} \sum_{j = 1}^{t^{'}} (\log ⌈\frac{m}{t^{'}}⌉ + \log q) \\ \leq \frac{t (t + 1)}{2} (\log m + \log q) \\ = \bar{R} . \end{matrix}

Hence, viewed as a positive integer, we have

\bar{h} (x) \in [0, 2^{\bar{R}} - 1]

for any

x \in Σ_{q}^{m}

. Moreover, for each

t^{'} \in [t]

, if

y \in B_{t^{'}} (x)

, then we have

y_{I_{t^{'}, j}^{'}} \in D_{1} (x_{I_{t^{'}, j}})

for each

j \in [t^{'}]

, where

I_{t^{'}, j}^{'} = {ℓ \in [n - t^{'}] : ℓ \equiv j (\mod t^{'})}

. By Lemma 1,

x_{I_{t^{'}, j}}

can be recovered from

{\bar{h}}_{t^{'}, j} (x)

and

y_{I_{t^{'}, j}^{'}}

, and so

x

can be recovered from

y

and

\bar{h} (x)

. Equivalently, for any

x^{'} \in N_{t} (x)

, we have

\bar{h} (x) \neq \bar{h} (x^{'})

.

For each

x \in Σ_{q}^{m}

, let

P (x)

be the set of all positive integers j such that j is a divisor of

| \bar{h} (x) - \bar{h} (x^{'}) |

for some

x^{'} \in N_{t} (x)

. By the same discussions as in the proof of [33] (Lemma 4), we can obtain

| P (x) | \leq 2^{\log | N_{t} (x) | + o (\log m)} \leq O (q^{t} m^{3})

. (Note that q and t are assumed to fixed integers and, by (1),

| N_{t} (x) | \leq t m^{2} q^{t})

. So, by brute force search, one can find, in time

2^{\log | N_{t} (x) | + o (\log m)} \leq O (q^{t} m^{3})

, a positive integer

α (x) \leq 2^{\log | N_{t} (x) | + o (\log m)}

such that

α (x) \notin P (x)

. Let

h_{syn} (x) = (α (x), \bar{h} (x) mod α (x))

. Then, we have

h_{syn} (x) \neq h_{syn} (x^{'})

for all

x^{'} \in N_{t} (x)

. Equivalently,

h_{syn} (x) \neq h_{syn} (x^{'})

for any distinct

x, x^{'} \in Σ_{q}^{n}

with

B_{\leq t} (x) \cap B_{\leq t} (x^{'}) \neq Ø

.

Finally, since

α (x) \leq 2^{\log | N_{t} (x) | + o (\log m)}

is a positive integer, and by (1),

| N_{t} (x) | \leq t m^{2} q^{t}

, so viewed as a q-ary sequence, we have

h_{syn} (x) \in Σ_{q}^{4 \log_{q} m + o (\log_{q} m)}

. □

Clearly, for any

x \in Σ_{q}^{m}

, given

h_{syn} (x)

and any

y \in B_{\leq t} (x)

, one can uniquely recover

x

. This is because for any

x \neq x^{'} \in Σ_{q}^{m}

such that

y \in B_{\leq t} (x^{'})

, we have

y \in B_{\leq t} (x) \cap B_{\leq t} (x^{'})

, so

B_{\leq t} (x) \cap B_{\leq t} (x^{'}) \neq Ø

and by Lemma 2, we have

h_{syn} (x) \neq h_{syn} (x^{'})

. Thus,

x

is uniquely determined by

h_{syn} (x)

and

y

.

2.2. Bounded Burst-Deletion Correction

We give a construction for correcting a single burst-deletion given the knowledge that the location of the deleted symbols are within an interval of length

ρ

.

Given a positive integer

ρ

, we define a collection of intervals

L_{ρ} = {L_{j} : j = 1, 2, \dots, ⌈n / ρ⌉ - 1}

such that

L_{j} = {\begin{matrix} [(j - 1) ρ + 1, (j + 1) ρ], for j \in {1, \dots, ⌈n / ρ⌉ - 2}, \\ [(j - 1) ρ + 1, n], for j = ⌈n / ρ⌉ - 1 . \end{matrix}

(2)

The following remark is easy to see.

Remark 1.

The intervals in

L_{ρ}

satisfy the following:

1): For any interval $L \subseteq [n]$ of length $| L | \leq ρ$ , there is a (not necessarily unique) $j_{0} \in [⌈n / ρ⌉ - 1] = {1, 2, \dots, ⌈n / ρ⌉ - 1}$ , such that $L \subseteq L_{j_{0}}$ .
2): $L_{j} \cap L_{j^{'}} = Ø$ for all $j, j^{'} \in [1, ⌈n / ρ⌉ - 1]$ , such that $| j - j^{'} | \geq 2$ .

Construction 1: Let

h : Σ_{q}^{m} \to Σ_{q}^{R_{m}}

be a function for any positive integer m, where

R_{m}

is a positive integer depending on

m, q

and t, such that

h (z) \neq h (z^{'})

for any distinct

z, z^{'} \in Σ_{q}^{m}

with

B_{\leq t} (z) \cap B_{\leq t} (z^{'}) \neq Ø

. Let

L_{ρ} = {L_{j} : j = 1, 2, \dots, ⌈n / ρ⌉ - 1}

, such that each

L_{j}

is defined by (2). For each

x \in Σ_{q}^{n}

and each

ℓ \in {0, 1}

, let

\begin{matrix} {\tilde{h}}_{ρ}^{(ℓ)} (x) = \sum_{\begin{matrix} j \in [1, ⌈n / ρ⌉ - 1} : \\ j \equiv ℓ \mod 2 \end{matrix}} h (x_{L_{j}}) (\mod q^{R_{2 ρ}}) . \end{matrix}

(3)

The modular operation

(\mod q^{R_{2 ρ}})

in (3) is performed on the result of the summation, but not on each

h (x_{L_{j}})

.

Lemma 3.

Suppose

x \neq x^{'} \in Σ_{q}^{n}

. Suppose there exists an interval

L \subseteq [n]

of length

| L | \leq ρ

and two intervals

D, D^{'} \subseteq L

of size

| D | = | D^{'} | \leq t

, such that

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'}

. Then, we have

{\tilde{h}}_{ρ}^{(ℓ)} (x) \neq {\tilde{h}}_{ρ}^{(ℓ)} (x^{'})

for some

ℓ \in {0, 1}

.

Proof.

By (2), for each

j \in {1, 2, \dots, ⌈n / ρ⌉ - 1}

, the length

| L_{j} |

of

L_{j}

satisfies

| L_{j} | \leq 2 ρ

. So, for any

x \in Σ_{q}^{m}

, the length

| h (x_{L_{j}}) |

of

h (x_{L_{j}})

(as a q-ary sequence) satisfies

| h (x_{L_{j}}) | = R_{2 | L_{j} |} \leq R_{2 ρ}

, which implies that (as an integer),

h (x_{L_{j}}) < q^{R_{2 ρ}} .

Since

| L | \leq ρ

, by 1) of Remark 1, there is a

j_{0} \in [⌈n / ρ⌉ - 1] = {1, 2, \dots, ⌈n / ρ⌉ - 1}

, such that

L \subseteq L_{j_{0}}

. Let

ℓ \in {0, 1}

be such that

ℓ \equiv j_{0} \mod 2

, and let

Λ_{ℓ} = {j \in [1, ⌈n / ρ⌉ - 1] : j \equiv j_{0} \mod 2} .

Then, by 2) of Remark 1,

L_{j} \cap L_{j_{0}} = Ø

for all

j \in Λ_{ℓ} ∖ {j_{0}}

. Further, by assumption of

D, D^{'}

and L, we have

x_{L_{j_{0}} ∖ D} = x_{L_{j_{0}} ∖ D^{'}}^{'} \in B_{\leq t} (x_{L_{j_{0}}}) \cap B_{\leq t} (x_{L_{j_{0}}}^{'})

and

x_{L_{j}} = x_{L_{j}}^{'}

for all

j \in Λ_{ℓ} ∖ {j_{0}}

. Therefore, we have

h (x_{L_{j_{0}}}^{'}) \neq h (x_{L_{j_{0}}})

and

h (x_{L_{j}}^{'}) = h (x_{L_{j}}), \forall j \in Λ_{ℓ} ∖ {j_{0}} .

By the above discussions, and by Construction 1, we can obtain that

{\tilde{h}}_{ρ}^{(ℓ)} (x) \neq {\tilde{h}}_{ρ}^{(ℓ)} (x^{'})

, which completes the proof. □

Remark 2.

Let h be the function

h_{s y n}

constructed in Lemma 2. Then, we have

R_{2 ρ} = 4 \log_{q} (2 ρ) + o (\log_{q} (2 ρ))

. Moreover, by Lemma 2, h is computable in time

O (q^{t} {(2 ρ)}^{3})

.

3. Pattern Dense Sequences

The concept of

(p, δ)

-dense sequences was introduced in [18], and was used to construct binary codes with redundancy

\log n + \frac{t (t + 1)}{2} \log \log n + c_{t}

for correcting a burst of at most t deletions, where n is the message length and

c_{t}

is a constant only depending on t. In this section, we generalize the

(p, δ)

-density to q-ary sequences and derive some important properties for these sequences that will be used in our new construction in the next section.

The q-ary

(p, δ)

-dense sequences can be defined similarly to the binary

(p, δ)

-dense sequences as follows.

Definition 1.

Let

d \leq δ \leq n

be three positive integers and

p \in Σ_{q}^{d}

called a pattern. A sequence

x \in Σ_{q}^{n}

is said to be

(p, δ)

-dense if each substring of

x

of length δ contains at least one

p

. The indicator vector of

x

with respect to

p

is a vector

1_{p} (x) = (1_{p} {(x)}_{1}, 1_{p} {(x)}_{2}, \dots, 1_{p} {(x)}_{n}) \in Σ_{2}^{n}

such that, for each

i \in [n]

,

1_{p} {(x)}_{i} = 1

if

x_{[i, i + d - 1]} = p

, and

1_{p} {(x)}_{i} = 0

otherwise.

In this section, we will always let

(d = 2 t)

p = 0^{t} 1^{t}

and view

p = 0^{t} 1^{t} \in Σ_{q}^{2 t}

for any

q \geq 2

. Moreover, from Definition 1, we have the following simple remark.

Remark 3.

Each sequence

x \in Σ_{q}^{n}

can be written as the form

x = x_{0} p x_{1} p x_{2} p \dots x_{m - 1} p x_{m}

for some

m \geq 0

, such that each

x_{i}

,

i \in [0, m]

, is a (possibly empty) string that does not contain

p

. Moreover,

x

is

(p, δ)

-dense if and only if it satisfies the following: (1) the lengths of

x_{0}

and

x_{m}

are not greater than

δ - 2 t

; and (2) the length of each

x_{i}

,

i \in [1, m - 1]

, is not greater than

δ + 1 - 4 t

.

In [18], the VT syndrome of

a_{p} (x)

was used to bound the location of deletions for

(p, δ)

-dense

x

, where

a_{p} (x)

is a vector of length

n_{p} (x) + 1

, whose i-th entry is the distance between positions of the i-th and

(i + 1)

-st 1 in the string

(1, 1_{p} (x), 1)

, and

n_{p} (x)

is the number of 1s in

1_{p} (x)

. In this paper, we prove that the VT syndrome of

1_{p} (x)

plays the same role. Specifically, for each

x \in Σ_{q}^{n}

, let

\begin{matrix} a_{0} (x) = \sum_{i = 1}^{n} 1_{p} {(x)}_{i} \end{matrix}

(4)

and

\begin{matrix} a_{1} (x) = \sum_{i = 1}^{n} i \cdot 1_{p} {(x)}_{i} \end{matrix}

(5)

where

1_{p} (x)

is the indicator vector of

x

with respect to

p

, as defined in Definition 1. Then, we have the following lemma.

Lemma 4.

Suppose

x \in Σ_{q}^{n}

is

(p, δ)

-dense. For any

t^{'} \in [t]

and any

y \in B_{t^{'}} (x)

, given

a_{0} (x) (\mod 4)

and

a_{1} (x) (\mod 2 n)

, one can find, in time

O (n)

, an interval

L \subseteq [n]

of length

| L | \leq 3 δ

, such that

y = x_{[n] ∖ D}

for some interval

D \subseteq L

of size

| D | = t^{'} = | x | - | y |

(in fact, we can require that the length of L is at most δ. However, the proof for

| L | \leq δ

needs more careful discussions).

Proof.

Let

a_{0} (x) = m

and

a_{0} (y) = m^{'}

. Then, by Remark 3,

x

and

y

can be written as the following form:

x = x_{0} 0^{t} 1^{t} x_{1} 0^{t} 1^{t} x_{2} \dots 0^{t} 1^{t} x_{m - 1} 0^{t} 1^{t} x_{m}

and

y = y_{0} 0^{t} 1^{t} y_{1} 0^{t} 1^{t} y_{2} \dots 0^{t} 1^{t} y_{m^{'} - 1} 0^{t} 1^{t} y_{m^{'}}

where

x_{i}

and

y_{j}

do not contain

p = 0^{t} 1^{t}

for each

i \in [0, m]

and

j \in [0, m^{'}]

. We denote

u_{i} = | y_{0} 0^{t} 1^{t} y_{1} 0^{t} 1^{t} \dots y_{i - 1} 0^{t} 1^{t} |, \forall i \in [1, m^{'}]

and

v_{i} = | y_{0} 0^{t} 1^{t} y_{1} 0^{t} 1^{t} \dots y_{i} |, \forall i \in [0, m^{'}] .

Additionally, let

u_{0} = 0

. Clearly, for each

i \in [0, m^{'}]

, we have

u_{i} \leq v_{i}

and

y_{i} = y_{[u_{i} + 1, v_{i}]}

. Moreover, for each

i \in [0, m^{'} - 1]

, each

j_{i} \in [u_{i}, v_{i}]

and

j_{i + 1} \in [u_{i + 1}, v_{i + 1}]

, we have

\begin{matrix} j_{i + 1} - j_{i} \geq u_{i + 1} - v_{i} \geq 2 t . \end{matrix}

(6)

Note that a burst of

t^{'} \leq t

deletions may destroy at most two ps or create at most one

p

, so

Δ_{0} ≜ m - m^{'} \in {- 1, 0, 1, 2}

and

Δ_{0}

can be computed from

a_{0} (x) - a_{0} (y)

. We need to consider the following four cases according to

Δ_{0}

.

Case 1:

Δ_{0} = 2

. Then,

m^{'} = m - 2

and there is an

i_{d} \in [0, m^{'}]

such that

| x_{i_{d} + 1} | \leq t^{'} - 2

and

y

can be obtained from

x

by deleting a substring

1^{t_{1}} x_{i_{d} + 1} 0^{t_{0}}

for some

t_{0}, t_{1} > 0

, such that

| x_{i_{d} + 1} | + t_{0} + t_{1} = t^{'}

. More specifically,

y_{i_{d}} = x_{i_{d}} 0^{t} 1^{t - t_{1}} 0^{t - t_{0}} 1^{t} x_{i_{d} + 2}

. Clearly, we have

2 \leq t^{'} \leq t

and

x_{[u_{i_{d}} + 1, v_{i_{d}} + t^{'}]} = x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1} 0^{t} 1^{t} x_{i_{d} + 2}

. It is sufficient to let

L = [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

, but we still need to find

i_{d}

.

Consider

1_{p} (x)

and

1_{p} (y)

. By Definition 1,

1_{p} (x)

can be obtained from

1_{p} (y)

by

t^{'}

insertions and two substitutions in the substring

1_{p} {(y)}_{[u_{i_{d}} + 1, v_{i_{d}}]}

: inserting

t^{'}

0s and substituting two 0s by two 1s. Then, by (5), we can obtain

\begin{matrix} a_{1} (x) = a_{1} (y) + λ_{1} (i_{d}) + λ_{2} (i_{d}) + (m^{'} - i_{d}) t^{'} \end{matrix}

(7)

where

λ_{1} (i_{d}), λ_{2} (i_{d}) \in [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

are the locations of the two substitutions. To find

i_{d}

, we define a function

ξ_{2}

as follows: for every

i \in [0, m^{'}]

, let

ξ_{2} (i) = a_{1} (y) + 2 (u_{i} + 1) + (m^{'} - i) t^{'} .

Then, for each

i \in [0, m^{'} - 1]

, we can obtain

ξ_{2} (i + 1) - ξ_{2} (i) = 2 (u_{i + 1} - u_{i}) - t^{'} \geq 4 t - t^{'} > 0

, where the first inequality comes from (6). So, for each

i \in [0, m^{'} - 1]

, we have

\begin{matrix} a_{1} (y) < ξ_{2} (i) < ξ_{2} (i + 1) \leq ξ_{2} (m^{'}) < a_{1} (y) + 2 n, \end{matrix}

(8)

where the last inequality comes from the simple observation that

ξ_{2} (m^{'}) = a_{1} (y) + 2 (u_{m^{'}} + 1) < a_{1} (y) + 2 n

.

By definition of

ξ_{2}

and

a_{1}

, we can obtain

\begin{matrix} ξ_{2} (i_{d} + 1) - a_{1} (x) & = 2 (u_{i_{d} + 1} + 1) - λ_{1} (i_{d}) - λ_{2} (i_{d}) - t^{'} \\ \overset{(i)}{\geq} 2 u_{i_{d} + 1} + 2 - 2 (v_{i_{d}} + t^{'}) - t^{'} \\ \overset{(ii)}{\geq} 4 t + 2 - 3 t^{'} \\ > 0 \end{matrix}

where (i) holds because

λ_{1} (i_{d}), λ_{2} (i_{d}) \in [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

, and (ii) is obtained from (6). On the other hand, by (7),

a_{1} (x) - ξ_{2} (i_{d}) = λ_{1} (i_{d}) + λ_{2} (i_{d}) - 2 (u_{i_{d}} + 1) \geq 0

(noticing that

λ_{1} (i_{d}), λ_{2} (i_{d}) \in [u_{i_{d}} + 1, v_{i_{d}} + t^{'}])

. Hence, we can obtain

\begin{matrix} ξ_{2} (i_{d}) \leq a_{1} (x) < ξ_{2} (i_{d} + 1) . \end{matrix}

(9)

By (8) and (9),

i_{d}

and L can be found as follows: Compute

μ ≜ a_{1} (x) (\mod 2 n) - a_{1} (y) (\mod 2 n)

and

μ_{i} ≜ ξ_{2} (i) (\mod 2 n) - a_{1} (y) (\mod 2 n)

for i from 0 to

m^{'}

. Then, we can find an

i_{d} \in [0, m^{'}]

such that

μ_{i_{d}} \leq μ < μ_{i_{d} + 1}

, where

μ_{m^{'} + 1} = 2 n

. Let

L = [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

. Note that

x_{[u_{i_{d}} + 1, v_{i_{d}} + t^{'}]} = x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1} 0^{t} 1^{t} x_{i_{d} + 2}

and

x

is

(p, δ)

-dense, so by Remark 3, the length of L satisfies

| L | = | x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1} 0^{t} 1^{t} x_{i_{d} + 2} | \leq 3 (δ + 1 - 4 t) + 4 t \leq 3 δ

, where the last inequality holds because

2 \leq t^{'} \leq t

.

Case 2:

Δ_{0} = 1

. Then,

m^{'} = m - 1

and, similarly to Case 1, there is an

i_{d} \in [0, m^{'}]

such that

y_{i_{d}}

can be obtained from

x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1}

by deleting

t^{'}

symbols and the pattern

0^{t} 1^{t}

is destroyed. Clearly,

x_{[u_{i_{d}} + 1, v_{i_{d}} + t^{'}]} = x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1}

, and it is sufficient to let

L = [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

. To find

i_{d}

, consider

1_{p} (y)

and

1_{p} (x)

. By Definition 1,

1_{p} (x)

can be obtained from

1_{p} (y)

by

t^{'}

insertions and one substitution in the substring

1_{p} {(y)}_{[u_{i_{d}} + 1, v_{i_{d}}]}

: inserting

t^{'}

0s and substituting a 0 by a 1. By (5), we can obtain

\begin{matrix} a_{1} (x) = a_{1} (y) + λ (i_{d}) + (m^{'} - i_{d}) t^{'} \end{matrix}

(10)

where

λ (i_{d}) \in [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

is the location of the substitution. For every

i \in [0, m^{'}]

, let

ξ_{1} (i) = a_{1} (y) + (u_{i} + 1) + (m^{'} - i) t^{'} .

Then, for each

i \in [0, m^{'} - 1]

, we have

ξ_{1} (i + 1) - ξ_{1} (i) = u_{i + 1} - u_{i} - t^{'} \geq 2 t - t^{'} > 0

, and so we can further obtain

\begin{matrix} a_{1} (y) < ξ_{1} (i) < ξ_{1} (i + 1) \leq ξ_{1} (m^{'}) \leq a_{1} (y) + n . \end{matrix}

(11)

By definition of

ξ_{1}

and

a_{1}

, we can obtain

ξ_{1} (i_{d} + 1) - a_{1} (x) = u_{i_{d} + 1} + 1 - λ (i_{d}) - t^{'} > u_{i_{d} + 1} + 1 - (v_{i} + t^{'}) - t^{'} \geq 2 t + 1 - 2 t^{'} > 0

. On the other hand, by (10),

a_{1} (x) - ξ_{1} (i_{d}) = λ (i_{d}) - (u_{i_{d}} + 1) \geq 0

. Hence, we can obtain

\begin{matrix} ξ_{1} (i_{d}) \leq a_{1} (x) < ξ_{1} (i_{d} + 1) . \end{matrix}

(12)

By (11) and (12), L can be found as follows: Compute

μ ≜ a_{1} (x) (\mod 2 n) - a_{1} (y) (\mod 2 n)

and

μ_{i} ≜ ξ_{1} (i) (\mod 2 n) - a_{1} (y) (\mod 2 n)

for i from 0 to

m^{'}

. Let

i_{d} \in [0, m^{'}]

be such that

μ_{i_{d}} \leq μ < μ_{i_{d} + 1}

. Then, let

L = [u_{i_{d}} + 1, v_{i_{d}} + t^{'}]

, where

μ_{m^{'} + 1} = 2 n

. Note that

x_{[u_{i_{d}} + 1, v_{i_{d}} + t^{'}]} = x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1}

and

x

is

(p, δ)

-dense, so by Remark 3,

| L | = | x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1} | \leq 2 (δ + 1 - 4 t) + 2 t < 2 δ

.

Case 3:

Δ_{0} = 0

. Then,

m^{'} = m

. For every

i \in [0, m]

, let

ξ_{0} (i) = a_{1} (y) + (m - i) t^{'} .

Note that

x

contains m copies of

0^{t} 1^{t}

, so we have

n \geq 2 t m > m t^{'}

. Therefore, for each

i \in [0, m - 1]

, we can obtain

\begin{matrix} a_{1} (y) + n > a_{1} (y) + m t^{'} \geq ξ_{0} (i) > ξ_{0} (i + 1) \geq a_{1} (y) . \end{matrix}

(13)

As

Δ_{0} = 0

, there are two ways to obtain

y

from

x

:

1): There is an $i_{d} \in [0, m]$ such that $y_{i_{d}}$ can be obtained from $x_{i_{d}}$ by a burst of $t^{'}$ deletions. Correspondingly, by Definition 1, $1_{p} (x)$ can be obtained from $1_{p} (y)$ by inserting $t^{'}$ 0s into $1_{p} {(y)}_{[u_{i_{d}} + 1, v_{i_{d}}]}$ . Therefore, we have

$\begin{matrix} a_{1} (x) = a_{1} (y) + (m - i_{d}) t^{'} = ξ_{0} (i_{d}) . \end{matrix}$

(14)
2): There is an $i_{d} \in [0, m - 1]$ such that $x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1} = y_{i_{d}} 0^{t + t_{0}} 1^{t + t_{1}} y_{i_{d} + 1}$ for some $t_{0}, t_{1} \in [1, t^{'} - 1]$ , such that $t_{0} + t_{1} = t^{'}$ , and $y_{i_{d}} 0^{t} 1^{t} y_{i_{d} + 1}$ is obtained from $x_{i_{d}} 0^{t} 1^{t} x_{i_{d} + 1}$ by deleting the substring $0^{t_{0}} 1^{t_{1}}$ . By Definition 1, $1_{p} (x)$ can be obtained from $1_{p} (y)$ by inserting $t_{0}$ 0s in $1_{p} {(y)}_{[u_{i_{d}} + 1, v_{i_{d}}]}$ and $t_{1}$ 0s in $1_{p} {(y)}_{[v_{i_{d}} + 2, v_{i_{d} + 2 t}]}$ . Therefore, we have

$a_{1} (x) = a_{1} (y) + t_{0} + (m - i_{d} - 1) t^{'} .$

By definition of $ξ_{0}$ , we have $ξ_{0} (i_{d}) - a_{1} (x) = t^{'} - t_{0} > 0$ and $a_{1} (x) - ξ_{0} (i_{d} + 1) = t_{0} > 0$ . So, we can obtain

$\begin{matrix} ξ_{0} (i_{d}) > a_{1} (x) > ξ_{0} (i_{d} + 1) \end{matrix}$

(15)

For both cases, if

i_{d} \in [0, m - 1]

, then we can

L = [u_{i_{d}} + 1, v_{i_{d}} + 2 t + t^{'}]

; if

i_{d} = m

, then we can let

L = [u_{m} + 1, n]

. Note that

x_{[u_{i_{d}} + 1, v_{i_{d}} + 2 t + t^{'}]} = x_{i_{d}} 0^{t} 1^{t}

and

x_{[u_{m} + 1, n]} = x_{m}

, and since

x

is

(p, δ)

-dense, then by Remark 3, we have

| L | = | x_{i_{d}} 0^{t} 1^{t} | \leq 2 δ

or

| L | = | x_{m} | \leq 2 δ

. Moreover, by (13), (14), and (15),

i_{d}

(and so

L)

can be found as follows: Compute

μ ≜ a_{1} (x) (\mod 2 n) - a_{1} (y) (\mod 2 n)

and

μ_{i} ≜ ξ_{0} (i) (\mod 2 n) - a_{1} (y) (\mod 2 n)

for i from 0 to m. Then, we can always find an

i_{d} \in [0, m]

, such that

μ_{i_{d}} \geq μ > μ_{i_{d} + 1}

, which is what we want.

Case 4:

Δ_{0} = - 1

. Then,

m^{'} = m + 1

and there is an

i_{d} \in [0, m^{'} - 1]

such that

x_{i_{d}} = y_{i_{d}} 0^{t_{0}} s 0^{t - t_{0}} 1^{t} y_{i_{d} + 1}

or

x_{i_{d}} = y_{i_{d}} 0^{t} 1^{t_{1}} s 1^{t - t_{1}} y_{i_{d} + 1}

, where

t_{0} \in [1, t]

,

t_{1} \in [1, t - 1]

and

s \in Σ_{q}^{t^{'}}

, and

y

can be obtained from

x

by deleting

s

. In this case, we can let

L = [v_{i_{d}} + 1, v_{i_{d}} + 2 t + t^{'}]

, and can obtain

| L | = 2 t + t^{'} < δ

. To find

i_{d}

, we consider

1_{p} (x)

and

1_{p} (y)

. By Definition 1,

1_{p} (x)

can be obtained from

1_{p} (y)

by inserting

t^{'}

0s into

1_{p} {(y)}_{[v_{i_{d} + 1}, v_{i_{d}} + 2 t]}

and substituting

1_{p} {(y)}_{v_{i_{d}} + 1} = 1

by a 0. Therefore, we have

\begin{matrix} a_{1} (x) = a_{1} (y) - (v_{i_{d}} + 1) + (m^{'} - 1 - i_{d}) t^{'} . \end{matrix}

(16)

For every

i \in [0, m^{'} - 1]

, let

ξ_{- 1} (i) = a_{1} (y) - (v_{i} + 1) + (m^{'} - 1 - i) t^{'} .

Then, for each

i \in [0, m^{'} - 2]

, we have

ξ_{- 1} (i) - ξ_{- 1} (i + 1) = v_{i + 1} - v_{i} - t^{'} > 0

, where the inequality is obtained from (6). Moreover, we have

ξ_{- 1} (0) = a_{1} (y) - 1 + (m^{'} - 1) t^{'} < a_{1} (y) + 2 t m^{'} < a_{1} (y) + n

and

ξ_{- 1} (m^{'} - 1) = a_{1} (y) - (v_{m^{'} - 1} + 1) > a_{1} (y) - n

. So, for each

i \in [0, m^{'} - 2]

, we can obtain

\begin{matrix} a_{1} (y) + n > ξ_{- 1} (i) > ξ_{- 1} (i + 1) > a_{1} (y) - n . \end{matrix}

(17)

By (16), and by the definition of

ξ_{- 1}

, we have

a_{1} (x) = ξ_{- 1} (i_{d})

. So, by (17),

i_{d}

(and so

L)

can be found by the following process: For i from 0 to

m^{'} - 1

, compute

ξ_{- 1} (i)

. Then, we can always find an

i_{d} \in [0, m^{'} - 1]

such that

ξ_{- 1} (i_{d}) (\mod 2 n) = a_{1} (x) (\mod 2 n)

, which is what we want.

Thus, one can always find the expected interval

L \subseteq [n]

. From the above discussions, it is easy to see that the time complexity for finding such L is

O (n)

. □

In the rest of this section, we will use the so-called sequence replacement (SR) technique to construct q-ary

(p, δ)

-dense strings with only one symbol of redundancy for

δ = 2 t q^{2 t} ⌈ \log n ⌉

. The SR technique, which has been widely used in the literature (e.g., see [19,34,35,36]), is an efficient method for constructing strings with or without some constraints on their substrings. In this paper, to apply the SR technique to construct

(p, δ)

-dense strings, each length-

δ

string that does not contain

p

needs to be compressed to a shorter sequence, which can be realized by the following lemma.

Lemma 5.

Let

δ = 2 t q^{2 t} ⌈ \log n ⌉

and

S \subseteq Σ_{q}^{δ}

be the set of all sequences of length δ that do not contain

p = 0^{t} 1^{t}

. For

n \geq q^{\frac{6 t + 3 - \log_{q} e}{0.4}}

, there exists an invertible function

g : S \to Σ_{q}^{δ - ⌈ \log_{q} n ⌉ - 6 t - 2}

such that g and

g^{- 1}

are computable in time

O (δ)

.

Proof.

The proof follows the same idea of Proposition 1 of [19], and is replicated here for completeness.

As each

s \in S

has length

δ = 2 t q^{2 t} ⌈ \log n ⌉

and does not contain

p

, then

S

can be viewed as a subset of

{(Σ_{q}^{2 t} ∖ {p})}^{q^{2 t} ⌈ \log n ⌉}

, and we have

\begin{matrix} \log_{q} | S | & \leq \log_{q} {(q^{2 t} - 1)}^{q^{2 t} ⌈ \log n ⌉} \\ = (2 t) q^{2 t} ⌈ \log n ⌉ + ⌈ \log n ⌉ \log_{q} {(1 - \frac{1}{q^{2 t}})}^{q^{2 t}} \\ \overset{(i)}{\leq} (2 t) q^{2 t} ⌈ \log n ⌉ + (\log n + 1) \log_{q} (\frac{1}{e}) \\ = δ - \log_{q} n \log e - \log_{q} e \\ \leq δ - 1.4 \log_{q} n - \log_{q} e \\ \overset{(ii)}{\leq} δ - ⌈ \log_{q} n ⌉ - 6 t - 2, \end{matrix}

where (i) comes from the fact that

{(1 - \frac{1}{x})}^{x} < \frac{1}{e}

for

x \geq 1

, and (ii) holds when

0.4 \log_{q} n + \log_{q} e \geq 6 t + 3

, i.e.,

n \geq q^{\frac{6 t + 3 - \log_{q} e}{0.4}}

. Thus, each sequence in

S

can be represented by a q-ary sequence of length

δ - ⌈ \log_{q} n ⌉ - 6 t - 2

, which gives an invertible function

g : S \to Σ_{q}^{δ - ⌈ \log_{q} n ⌉ - 6 t - 2}

.

The computation of g and

g^{- 1}

involve conversion of integers in

[0, {(q^{2 t} - 1)}^{q^{2 t} ⌈ \log n ⌉} - 1]

between

(q^{2 t} - 1)

-base representation and q-base representation, so have time complexity

O (2 t q^{2 t} ⌈ \log n ⌉) = O (δ)

. □

In the rest of this section, we will always let

δ = 2 t q^{2 t} ⌈ \log n ⌉ .

As we are interested in large n, we will always assume that

n \geq q^{\frac{6 t + 3 - \log_{q} e}{0.4}}

. The following lemma gives a function for encoding q-ary strings to

(p, δ)

-dense strings.

Lemma 6.

There exists an invertible function, denoted by

EncDen : Σ_{q}^{n - 1} \to Σ_{q}^{n}

, such that for every

u \in Σ_{q}^{n - 1}

,

x = EncDen (u)

is

(p, δ)

-dense. Both

EncDen

and its inverse, denoted by

DecDen

, are computable in

O (n \log n)

time.

Proof.

Let g be the function constructed in Lemma 5. The functions

EncDen

and

DecDen

are described by Algorithms 1 and 2, respectively, where each integer

i \in [n]

is also viewed as a q-ary string of length

⌈ \log n ⌉

which is the q-base representation of i.

The correctness of Algorithm 1 can be proved as follows:

1): In the initialization step, if $\tilde{u} = u_{[n - δ + 2 t, n - 1]}$ contains $p$ then, clearly, $x$ has length n. If $\tilde{u} = u_{[n - δ + 2 t, n - 1]}$ does not contain $p$ , then

$x = (u_{[1, n^{'}]}, p, p, g ((\tilde{u}, 0^{2 t})), 0^{⌈ \log_{q} n ⌉ + 3})$

and so $| x | = n^{'} + 4 t + | g ((\tilde{u}, 0^{2 t})) | + ⌈ \log_{q} n ⌉ + 3 = n$ , where $n^{'} = n - δ + 2 t - 1$ , and by Lemma 5, $| g ((\tilde{u}, 0^{2 t})) | = δ - ⌈ \log_{q} n ⌉ - 6 t - 2$ . So, at the end of the initialization step, $x$ has length n. Moreover, $x_{[n^{'} + 1, n^{'} + 2 t]} = p$ , and the substring $x_{[n^{'} + 2 t + 1, n]}$ has length $\leq δ - 4 t + 1$ .
2): In each round of the replacement step, if $\tilde{x} ≜ x_{[i, i + δ - 1]}$ does not contain $p$ for some $i \in [1, n^{'} - δ + 1]$ , then by Lemma 5, $| (p, p, i, g (\tilde{x}), 0, 1^{2 t}, 0) | = δ = | x_{[i, i + δ - 1]} |$ , so by replacement, the length of the appended string equals to the length of the deleted substring, and hence the length of $x$ keeps unchanged.
3): At the beginning of each round of the replacement step, we have $x_{[n^{'} + 1, n^{'} + 2 t]} = p$ , so for $i \in [n^{'} + 2 t - δ + 1, n^{'}]$ , the substring $x_{[i, i + δ - 1]}$ contains $p$ . Equivalently, if $\tilde{x} ≜ x_{[i, i + δ - 1]}$ does not contain $p$ for some $i \in [n^{'} - δ + 2, n^{'}]$ , then it must be that $i \in [n^{'} - δ + 2, n^{'} + 2 t - δ]$ . In this case, $| (p, p, i, g ((x_{[i, n^{'}]}, 0^{ℓ})), 0, 1^{2 t - ℓ}, 0) | = δ - ℓ = | x_{[i, n^{'}]} |$ , so by replacement, the length of the appended string equals to the length of the deleted substring, and hence the length of $x$ keeps unchanged.
4): By 1), 2) and 3), the substring $x_{[n^{'} + 1, n - δ + 1]}$ is always of the form $p u p p v \dots p p w$ , where all substrings $u, v, \dots, w$ have a length not greater than $δ + 1 - 4 t$ . So, by Remark 3, for each $i \in [n^{'} + 1, n - δ + 1]$ , the substring $x_{[i, i + δ - 1]}$ contains $p$ .
5): At the end of each round of the replacement step, the value of $n^{'}$ strictly decreases, so the While loop will end after at most n rounds, and at this time, for each $i \in [1, n^{'}]$ , the substring $x_{[i, i + δ - 1]}$ contains $p$ , which combining with 4) implies that $x$ is $(p, δ)$ -dense.

The correctness of Algorithm 2 can be easily seen from Algorithm 1, so

DecDen

is the inverse of

EncDen

.

Note that Algorithms 1 and 2 have at most n rounds of replacement, and in each round,

g

(resp.

g^{- 1})

needs to be computed, which has time complexity

O (δ) = O (\log n)

by Lemma 5, so the total time complexity of Algorithms 1 and 2 is

O (n \log n)

. □

Algorithm 1: The function

EncDen

for encoding to

(p, δ)

-dense sequence

1:: Input: $u \in Σ_{q}^{n - 1}$
2:: Output: $x = EncDen (u) \in Σ_{q}^{n}$ such that $x$ is $(p, δ)$ -dense
3:: (Initialization Step:)
4:: Let $\tilde{u} = u_{[n - δ + 2 t, n - 1]}$ .
5:: if $\tilde{u}$ contains $p$ then
6:: let $n^{'}$ be the smallest $i \in [n - δ + 2 t - 1, n - 2]$ such that $u_{[i + 1, i + 2 t]} = p$ , and let $x = (u, 1)$ ;
7:: else
8:: let $n^{'} = n - δ + 2 t - 1$ and $x = (u_{[1, n^{'}]}, p, p, g ((\tilde{u}, 0^{2 t})), 0^{⌈ \log_{q} n ⌉ + 3})$ .
9:: end if
10:: (Replacement Step:)
11:: while there exists an $i \in [1, n^{'}]$ such that $\tilde{x} ≜ x_{[i, i + δ - 1]}$ does not contain $p$ do
12:: if $i \in [1, n^{'} - δ + 1]$ then
13:: delete $x_{[i, i + δ - 1]}$ from $x$ and append $(p, p, i, g (\tilde{x}), 0, 1^{2 t}, 0)$ to $x$ ; let $n^{'} = n^{'} - δ$ .
14:: end if
15:: if $i \in [n^{'} - δ + 2, n^{'}]$ then
16:: delete $x_{[i, n^{'}]}$ from $x$ and append $(p, p, i, g ((x_{[i, n^{'}]}, 0^{ℓ})), 0, 1^{2 t - ℓ}, 0)$ to $x$ , where $ℓ ≜ δ - | x_{[i, n^{'}]} |$ satisfying $1 \leq ℓ \leq 2 t - 1$ ; let $n^{'} = i - 1$ .
17:: end if
18:: end while
19:: Return $x = EncDen (u)$ .

Algorithm 2: The function

DecDen

for decoding of

(p, δ)

-dense sequence

1:: Input: $x = EncDen (u) \in Σ_{q}^{n}$
2:: Output: $u \in Σ_{q}^{n - 1}$
3:: while $x_{[n - ℓ^{'} - 2, n]} = 01^{ℓ^{'}} 0$ for some $ℓ^{'} \in [1, 2 t]$ do
4:: let $\tilde{u}$ be obtained from $g^{- 1} (x_{[n - δ + 6 t - ℓ^{'} + 1 + ⌈ \log_{q} n ⌉, n - ℓ^{'} - 2]})$ by deleting the last $2 t - ℓ^{'}$ symbols; delete the last $δ + ℓ^{'} - 2 t$ symbols of $x$ and insert $\tilde{u}$ at the position i of $x$ such that $i = x_{[n - δ + 6 t - ℓ^{'} + 1, n - δ + 6 t - ℓ^{'} + ⌈ \log n ⌉]}$ .
5:: end while
6:: if $x_{n} = x_{n - 1} = 0$ then
7:: let $\tilde{u}$ be obtained from $g^{- 1} (x_{[n - δ + 6 t, n - ⌈ \log_{q} n ⌉ - 3]})$ by deleting the last $2 t$ 0s and let $u = (x_{[1, n - δ + 2 t - 1]}, \tilde{u})$ .
8:: end if
9:: if $x_{n} = 1$ then
10:: let $u = x_{[1, n - 1]}$ .
11:: end if
12:: Return $u = DecDen (x)$ .

The Algorithm 1 is obtained by modifying the Algorithm 2 of [19], which is for binary sequences but for our purpose we need apply it to q-ary sequences.

4. Burst-Deletion Correcting $q$ -ary Codes

In this section, we still let

δ = 2 t q^{2 t} ⌈ \log n ⌉ .

Based on

(p, δ)

-dense sequences, for any fixed positive integers t and

q \geq 2

, we will construct a family of q-ary codes that can correct a burst of at most t deletions. The basic idea of our construction is as follows: For each

(p, δ)

-dense sequence

x

, use the sketches

a_{0} (x)

and

a_{1} (x)

(defined by (4) and (5), respectively) to locate the deletions within an interval of length

3 δ

. Then, use the functions

{\tilde{h}}_{ρ}^{(0)} (x), {\tilde{h}}_{ρ}^{(1)} (x)

constructed in Construction 1 to uniquely recover

x

.

Let

h : Σ_{q}^{m} \to Σ_{q}^{R_{m}}

be a sketch function for any positive integer m, where

R_{m}

is a positive integer depending on

m, q

and t, such that

h (z) \neq h (z^{'})

for any distinct

z, z^{'} \in Σ_{q}^{m}

with

B_{\leq t} (z) \cap B_{\leq t} (z^{'}) \neq Ø

. For each

x \in Σ_{q}^{n}

, let

\begin{matrix} f (x) = (a_{0} (x) (\mod 4), a_{1} (x) (\mod 2 n), {\tilde{h}}_{ρ}^{(0)} (x), {\tilde{h}}_{ρ}^{(1)} (x)) \end{matrix}

(18)

such that

a_{0} (x)

is defined by (4),

a_{1} (x)

is defined by (5),

{\tilde{h}}_{ρ}^{(0)} (x)

and

{\tilde{h}}_{ρ}^{(1)} (x)

are obtained by Construction 1 with

ρ = 3 δ = 6 t q^{2 t} ⌈ \log n ⌉

.

Lemma 7.

Let f be the function given by (18). If

x \in Σ_{q}^{n}

is

(p, δ)

-dense, then given

f (x)

and any

y \in B_{\leq t} (x)

, one can uniquely recover

x

.

Proof.

First, since

x

is

(p, δ)

-dense, then by Lemma 4, given

a_{0} (x) (\mod 4)

and

a_{1} (x) (\mod 2 n)

, one can find an interval L of length

| L | \leq 3 δ = ρ

, such that

y = x_{[n] ∖ D}

for some interval

D \subseteq L

of size

t^{'} = n - | y |

. Moreover, suppose

x^{'} \in Σ_{q}^{n}

, such that

y = x_{[n] ∖ D^{'}}^{'}

for some interval

D^{'} \subseteq L

of size

t^{'} = n - | y |

. Then, by Lemma 3, we have

{\tilde{h}}_{ρ}^{(ℓ)} (x) \neq {\tilde{h}}_{ρ}^{(ℓ)} (x^{'})

for some

ℓ \in {0, 1}

. Thus, given

f (x) = (a_{0} (x) (\mod 4), a_{1} (x) (\mod 2 n), {\tilde{h}}_{ρ}^{(0)} (x), {\tilde{h}}_{ρ}^{(1)} (x))

and any

y \in B_{\leq t} (x)

, one can uniquely recover

x

. □

Using the function f, we can give an encoding function of a q-ary code that can correct a burst of at most t deletions.

Lemma 8.

Let f be the function given by (18) and

f_{q} (x)

be the q-ary representation of

f (x)

. Let

\begin{matrix} E : Σ_{q}^{n - 1} & \to Σ_{q}^{n + r} \\ u & \mapsto (x, 0^{t} 1, f_{q} (x)) \end{matrix}

(19)

such that

x = EncDen (u)

and

r = t + 1 + | f_{q} (x) |

. Then, for each

z = E (u)

, given any

y \in B_{\leq t} (z)

, one can uniquely recover

x

(and so

z)

.

Proof.

Let

t^{'} = | z | - | y |

. Suppose

D = [i_{d}, i_{d} + t^{'} - 1] \subseteq [1, n + r]

is an interval such that

y = z_{[n + r] ∖ D}

. If there is more than one interval D such that

y = z_{[n + r] ∖ D}

, then we consider the D with the smallest

i_{d}

. Clearly, we have

i_{d} \in [1, n + r - t^{'} + 1]

. If

i_{d} \in [1, n + t + 1 - t^{'}]

, then

y_{n + t + 1 - t^{'}} = z_{n + t + 1} = 1

; if

i_{d} \in [n + t + 2 - t^{'}, n + r - t^{'} + 1]

, then

y_{n + t + 1 - t^{'}} = z_{n + t + 1 - t^{'}} = 0

. So, we have the following two cases.

Case 1:

y_{n + t + 1 - t^{'}} = 1

. Then,

i_{d} \in [1, n + t - t^{'} + 1]

. We need further to consider the following three subcases.

Case 1.1:

y_{[n + 1 - t^{'}, n + 1 + t - t^{'}]} = 0^{t} 1

. In this case, it must be that

D \subseteq [1, n]

. Therefore, we have

y_{[1, n - t^{'}]} \in B_{t^{'}} (x)

and

y_{[n + t + 2 - t^{'}, n + r - t^{'}]} = f_{q} (x)

. By Lemma 7,

x

can be recovered from

y_{[1, n - t^{'}]}

and

y_{[n + t + 2 - t^{'}, n + r - t^{'}]}

correctly.

Case 1.2: There is a

t^{''} \in [1, t^{'} - 1]

such that

y_{[n + 1 - t^{'} + t^{''}, n + 1 + t - t^{'}]} = 0^{t - t^{''}} 1

and

y_{n - t + t^{''}} \neq 0

. In this case, it must be that

D = [n + 1 - t^{'} + t^{''}, n + t^{''}]

. Therefore,

y_{[1, n + 1 - t^{'} + t^{''}]} = x_{[1, n + 1 - t^{'} + t^{''}]} \in B_{t^{'} - t^{''}} (x)

and

y_{[n + t + 2 - t^{'}, n + r - t^{'}]} = f_{q} (x)

. By Lemma 7,

x

can be recovered from

y_{[1, n + 1 - t^{'} + t^{''}]}

and

y_{[n + t + 2 - t^{'}, n + r - t^{'}]}

correctly.

Case 1.3:

y_{[n + 1, n + 1 + t - t^{'}]} = 0^{t - t^{'}} 1

and

y_{n} \neq 0

. In this case, it must be that

D \subseteq [n + 1, n + t]

. Therefore,

y_{[1, n]} = x

.

Case 2:

y_{n + t + 1 - t^{'}} = 0

. Then, we have

i_{d} \in [n + t + 2 - t^{'}, n + r - t^{'} + 1]

and

x = y_{[1, n]}

.

Thus, one can always uniquely recover

x

from

y

. □

Theorem 1.

There exists a q-ary code of length n capable of correcting a burst of at most t deletions, which has redundancy

\log n + 8 \log \log n + o (\log \log n)

bits, encoding complexity

O (q^{7 t} n {(\log n)}^{3})

and decoding complexity

O (n \log n)

.

Proof.

In Construction 1, let h be the function

h_{syn}

given by Lemma 2 and let

C_{syn}

be the code with the encoding function

E

constructed in Lemma 8. Then,

C_{syn} \subseteq Σ_{q}^{n}

is a code capable of correcting a burst of at most t deletions.

The redundancy of

C_{syn}

is

1 + r = 1 + t + 1 + | f_{q} (x) | = t + 2 + | f_{q} (x) |

in q-ary symbols. We now evaluate

| f_{q} (x) |

. By Lemma 2, we have

R_{2 ρ} = 4 \log_{q} (2 ρ) + o (\log_{q} (2 ρ))

. Moreover, since

ρ = 6 t q^{2 t} ⌈ \log n ⌉

, then for

ℓ \in {0, 1}

, by (3) and Remark 2, the length

| {\tilde{h}}_{ρ}^{(ℓ)} (x) |

of

{\tilde{h}}_{ρ}^{(ℓ)} (x)

(as a q-ary string) satisfies

\begin{matrix} | {\tilde{h}}_{ρ}^{(ℓ)} (x) | & < R_{2 ρ} \\ = 4 \log_{q} (12 t q^{2 t} ⌈ \log n ⌉) + o (\log_{q} (12 t q^{2 t} ⌈ \log n ⌉)) \\ = 4 \log_{q} \log n + o (\log_{q} \log n) . \end{matrix}

So, by (4) and (5), the length

| f_{q} (x) |

of

f_{q} (x)

(as a q-ary string) satisfies

\begin{matrix} | f_{q} (x) | & = \log_{q} 4 + \log_{q} (2 n) + | {\tilde{h}}_{ρ}^{(0)} (x) | + | {\tilde{h}}_{ρ}^{(1)} (x) | \\ \leq \log_{q} n + 8 \log_{q} \log n + o (\log_{q} \log n) . \end{matrix}

Thus, the redundancy of

C_{syn}

, measured in bits, is

(t + 2 + | f_{q} (x) |) \log q \leq \log n + 8 \log \log n + o (\log \log n)

.

Consider the encoding complexity of

C_{syn}

. Note that, by (3) and Remark 2,

{\tilde{h}}_{ρ}^{(0)} (x)

and

{\tilde{h}}_{ρ}^{(1)} (x)

are computable in time

O (n q^{t} {(2 ρ)}^{3}) = O (n q^{t} {(12 t q^{2 t} ⌈ \log n ⌉)}^{3}) = O (q^{7 t} n {(\log n)}^{3})

. Then, by (4), (5), and (18),

f (x)

is computable in time

O (q^{7 t} n {(\log n)}^{3})

. Moreover, by Lemma 6, the mapping

EncDen

is computable in

O (n \log n)

time, so by (19), the encoding complexity of

C_{syn}

is

O (q^{7 t} n {(\log n)}^{3})

.

For the decoding complexity of

C_{syn}

, by Lemma 6, the inverse of

EncDen

is computable in

O (n \log n)

time, so by the proof of Lemma 8, we only need to consider the complexity of recovering

x

from

f (x)

and any given

y \in B_{\leq t} (x)

. By Lemma 4 and 1) of Remark 1, one can locate the deletions within an interval

L_{j_{0}}

in time

O (n)

using

a_{0} (x) (\mod 4)

and

a_{1} (x) (\mod 2 n)

. Then,

x

can be recovered by brute force searching in time

O ((2 ρ q^{t}) (q^{t} {(2 ρ)}^{3})) = O (q^{8 t} {(\log n)}^{4})

. In fact, there are

| L_{j_{0}} | \cdot q^{t}

candidate sequences for

x

, and for each candidate sequence

x^{'}

, we need to verify whether

h_{syn} (x_{L_{j_{0}}}^{'}) = h_{syn} (x_{L_{j_{0}}})

, which takes time

O (q^{t} | L_{j_{0}} |^{3})

by Lemma 2. Hence, the time of brute force searching is

O ((| L_{j_{0}} | \cdot q^{t}) (q^{t} | L_{j_{0}} |^{3})) \leq O ((2 ρ q^{t}) (q^{t} {(2 ρ)}^{3})),

where the inequality holds because

| L_{j_{0}} | \leq 2 ρ

. Thus, the decoding complexity of

C_{syn}

is

O (n \log n)

. □

5. Correcting Burst-Deletion with Two Reads

In this section, we construct a family of codes correcting a burst of at most t deletions with 2 reads. Our construction improves the construction in [31] in redundancy. We first recall the concept of period of a sequence.

Suppose

T \leq m

are two positive integers and

x \in Σ_{q}^{m}

. We say that T is a period of

x

if

x_{i} = x_{i + T}

for any

i \in [1, m - T]

.

The following two simple observations are easy to see and will be used in our construction.

Observation 1: Let

x, x^{'} \in Σ_{q}^{n}

and

t^{'} \in [t]

. Suppose

D = [i_{d}, i_{d} + t^{'} - 1], D^{'} = [i_{d}^{'}, i_{d}^{'} + t^{'} - 1] \subseteq [n]

such that

i_{d} \leq i_{d}^{'}

. Then,

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'}

if and only if the following holds:

\begin{matrix} x_{i}^{'} = \{\begin{matrix} x_{i}, for i \in [1, i_{d} - 1], \\ x_{i + t^{'}}, for i \in [i_{d}, i_{d}^{'} - 1], \\ x_{i}, for i \in [i_{d}^{'} + t^{'}, n] . \end{matrix} \end{matrix}

Observation 2: Let

x \in Σ_{q}^{n}

and

t^{'} \in [t]

. Suppose

J \subseteq [n]

is an interval such that

| J | \geq t^{'}

and the substring

x_{J}

of

x

has period

t^{'}

. Then, for any

D = [i_{d}, i_{d} + t^{'} - 1] \subseteq J

and

D^{'} = [i_{d}^{'}, i_{d}^{'} + t^{'} - 1] \subseteq J

, we have

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}

.

Our construction will use

(p, δ)

-dense sequences for

p = 0^{t + 1} 1^{t + 1}

. To apply Lemma 6 to construct

(p, δ)

-dense sequences, we need to let

δ = 2 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉

. Then, by Lemma 6, there exists an invertible function

EncDen : Σ_{q}^{n - 1} \to Σ_{q}^{n}

, such that

x = EncDen (u)

is

(p, δ)

-dense for every

u \in Σ_{q}^{n - 1}

. Moreover, both

EncDen

and its inverse

DecDen

are computable in

O (n \log n)

time. Thus, in this section, we always assume that

p = 0^{t + 1} 1^{t + 1} and δ = 2 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉ .

Suppose

x \in Σ_{q}^{n}

is

(p, δ)

-dense and

J \subseteq [n]

is an interval of length

| J | \geq δ

. By Definition 1, there exists an

i_{0} \in J

such that

J_{0} ≜ [i_{0}, i_{0} + 2 (t + 1) - 1] \subseteq J

and

x_{J_{0}} = p = 0^{t + 1} 1^{t + 1}

. For any positive integer

T \leq 2 t

, let

i_{0}^{'} = i_{0} + t + 1 - T

if

T \leq t

, and let

i_{0}^{'} = i_{0}

if

T \geq t + 1

. Then,

[i_{0}^{'}, i_{0}^{'} + T] \subseteq J_{0} \subseteq J

and

x_{i_{0}^{'}} = 0 \neq 1 = x_{i_{0}^{'} + T}

. Thus, T is not a period of

x_{J}

. In other words, we have the following remark.

Remark 4.

Suppose

x \in Σ_{q}^{n}

is

(p, δ)

-dense, where

p = 0^{t + 1} 1^{t + 1}

and

δ = 2 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉

. Then, the length of any substring of

x

of period

T \leq 2 t

is at most δ.

Lemma 9.

Suppose

x \neq x^{'} \in Σ_{q}^{n}

are both

(p, δ)

-dense. If

| B_{\leq t} (x) \cap B_{\leq t} (x^{'}) | \geq 2

, then there exists an interval

J \subseteq [n]

of length

| J | \leq δ + t

and two intervals

D, D^{'} \subseteq J

of size

| D | = | D^{'} | \leq t

, such that

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'}

.

Proof.

Suppose

y, y^{'} \in B_{\leq t} (x) \cap B_{\leq t} (x^{'})

and

y \neq y^{'}

. Then,

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{1}^{'}}^{'}

for some intervals

D_{1}, D_{1}^{'} \subseteq [n]

of size

t_{1} ≜ | D_{1} | = | D_{1}^{'} | \leq t

, and

y^{'} = x_{[n] ∖ D_{2}} = x_{[n] ∖ D_{2}^{'}}^{'}

for some intervals

D_{2}, D_{2}^{'} \subseteq [n]

of size

t_{2} ≜ | D_{2} | = | D_{2}^{'} | \leq t

. For convenience in the discussions, we denote

D_{j} = [i_{j}, i_{j} + t_{j} - 1] and D_{j}^{'} = [i_{j}^{'}, i_{j}^{'} + t_{j} - 1], j = 1, 2 .

If

| i_{j} - i_{j}^{'} | \leq δ

for some

j \in {1, 2}

, then let

D = D_{j}

,

D^{'} = D_{j}^{'}

and

J = [λ, λ^{'} + t_{j} - 1]

such that

λ = min {i_{j}, i_{j}^{'}}

and

λ^{'} = max {i_{j}, i_{j}^{'}}

. Clearly, J, D, and

D^{'}

satisfy the desired conditions. So, in the following, we assume that

| i_{j} - i_{j}^{'} | > δ

for each

j \in {1, 2}

.

By the symmetry of

x

and

x^{'} (y

and

y^{'})

, we can assume

i_{1} < i_{1}^{'} and t_{2} \leq t_{1} .

Since

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{1}^{'}}^{'}

, where

D_{1} = [i_{1}, i_{1} + t_{1} - 1]

,

D_{1}^{'} = [i_{1}^{'}, i_{1}^{'} + t_{1} - 1]

and

i_{1} < i_{1}^{'}

, we can obtain from Observation 1 that

\begin{matrix} x_{i}^{'} = \{\begin{matrix} y_{i} = x_{i}, for i \in [1, i_{1} - 1], \\ y_{i} = x_{i + t_{1}}, for i \in [i_{1}, i_{1}^{'} - 1], \\ y_{i - t_{1}} = x_{i}, for i \in [i_{1}^{'} + t_{1}, n] . \end{matrix} \end{matrix}

(20)

Similarly, since

y^{'} = x_{[n] ∖ D_{2}} = x_{[n] ∖ D_{2}^{'}}^{'}

, where

D_{2} = [i_{2}, i_{2} + t_{2} - 1]

and

D_{2}^{'} = [i_{2}^{'}, i_{2}^{'} + t_{2} - 1]

, we have

If $i_{2} < i_{2}^{'}$ , then according to Observation 1,

$\begin{matrix} x_{i}^{'} = \{\begin{matrix} y_{i}^{'} = x_{i}, for i \in [1, i_{2} - 1], \\ y_{i}^{'} = x_{i + t_{2}}, for i \in [i_{2}, i_{2}^{'} - 1], \\ y_{i - t_{2}}^{'} = x_{i}, for i \in [i_{2}^{'} + t_{2}, n] . \end{matrix} \end{matrix}$

(21)
If $i_{2} > i_{2}^{'}$ , then according to Observation 1,

$\begin{matrix} x_{i}^{'} = \{\begin{matrix} y_{i}^{'} = x_{i}, for i \in [1, i_{2}^{'} - 1], \\ y_{i - t_{2}}^{'} = x_{i - t_{2}}, for i \in [i_{2}^{'} + t_{2}, i_{2} + t_{2} - 1], \\ y_{i}^{'} = x_{i}, for i \in [i_{2} + t_{2}, n] . \end{matrix} \end{matrix}$

(22)

Let

I_{1} = [i_{1}, i_{1}^{'} + t_{1} - 1] and I_{2} = [{\bar{i}}_{2}, {\bar{i}}_{2}^{'} + t_{2} - 1],

where

{\bar{i}}_{2} = min {i_{2}, i_{2}^{'}}

and

{\bar{i}}_{2}^{'} = max {i_{2}, i_{2}^{'}}

. Then, by (20), (21), and (22), we can easily obtain the following claim.

Claim 1: For all

i \in [n] ∖ (I_{1} \cap I_{2})

, we have

x_{i} = x_{i}^{'}

.

Since

x \neq x^{'}

, by Claim 1, we have

I_{1} \cap I_{2} \neq Ø

. If

| I_{1} \cap I_{2} | \leq t

then, clearly,

J = D = D^{'} = I_{1} \cap I_{2}

satisfies the desired conditions. So, in the following, we only need to consider

| I_{1} \cap I_{2} | > t

. We have the following two cases.

Case 1:

i_{2} < i_{2}^{'}

. Then,

min {i_{2}, i_{2}^{'}} = i_{2}

and

max {i_{2}, i_{2}^{'}} = i_{2}^{'}

, and so

I_{2} = [i_{2}, i_{2}^{'} + t_{2} - 1]

and

I_{1} \cap I_{2} = [λ, λ^{'}]

, where

λ = max {i_{1}, i_{2}}

and

λ^{'} = min {i_{1}^{'} + t_{1} - 1, i_{2}^{'} + t_{2} - 1}

. We need to further divide this case into the following two subcases.

Case 1.1:

t_{2} < t_{1}

. Let

D = [λ, λ + t_{2} - 1]

and

D^{'} = [λ^{'} - t_{2} + 1, λ^{'}]

. Then,

D, D^{'} \subseteq J

and

| D | = | D^{'} | \leq t

. Note that by Claim 1,

x_{i}^{'} = x_{i}

for every

i \in [n] ∖ (I_{1} \cap I_{2}) = [n] ∖ [λ, λ^{'}]

, and by (20),

x_{i}^{'} = x_{i + t_{2}}

for every

i \in [λ, λ^{'} - t_{2}]

. So, according to Observation 1, we can obtain

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'} .

Moreover, by (20) and (21),

x_{i} = x_{i - t_{2}}^{'} = x_{i - t_{2} + t_{1}}

for every

i \in [λ + t_{2}, λ^{'} - t_{1} + t_{2}]

. So,

x_{[λ + t_{2}, λ^{'}]}

has period

t_{1} - t_{2}

, and by Remark 4,

| [λ + t_{2}, λ^{'}] | \leq δ

. Hence, we have

| J | = | I_{1} \cap I_{2} | = | [λ, λ^{'}] | \leq δ + t_{2} \leq δ + t .

Case 1.2:

t_{2} = t_{1}

. We will prove that this case is impossible by contradiction. Without loss of generality, assume

i_{1} \leq i_{2}

. Since

| I_{1} \cap I_{2} | > t

, then

I_{1} \cap I_{2} = [i_{2}, {\tilde{i}}_{1}^{'}]

and

i_{2} \leq {\tilde{i}}_{1}^{'} - t

, where

{\tilde{i}}_{1}^{'} = min {i_{1}^{'}, i_{2}^{'}}

. By (20) and (21), we have

x_{i} = x_{i}^{'} = x_{i + t_{1}}

for every

i \in [i_{1}, i_{2} - 1]

, i.e.,

x_{[i_{1}, i_{2} + t_{1} - 1]}

has period

t_{1}

. So, we can obtain

y = x_{[n] ∖ [i_{1}, i_{1} + t_{1} - 1]} = x_{[n] ∖ [i_{2}, i_{2} + t_{1} - 1]} = y^{'}

(according to Observation 2), which contradicts to the assumption that

y \neq y^{'}

.

Case 2:

i_{2} > i_{2}^{'}

. In this case,

I_{2} = [i_{2}^{'}, i_{2} + t_{2} - 1]

and we have

I_{1} \cap I_{2} = [λ, λ^{'}]

, where

λ = max {i_{1}, i_{2}^{'}}

and

λ^{'} = min {i_{1}^{'} + t_{1} - 1, i_{2} + t_{2} - 1}

. Let

D = [λ, λ + t_{1} - 1]

and

D^{'} = [λ^{'} - t_{1} + 1, λ^{'}]

. Then,

D, D^{'} \subseteq J

and

| D | = | D^{'} | \leq t

. Note that by Claim 1,

x_{i}^{'} = x_{i}

for every

i \in [n] ∖ (I_{1} \cap I_{2}) = [n] ∖ [λ, λ^{'}]

, and by (20),

x_{i}^{'} = x_{i + t_{1}}

for every

i \in [λ, λ^{'} - t_{1}]

. So, according to Observation 1, we can obtain

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'} .

Moreover, by (20) and (22), we can obtain

x_{i} = x_{i + t_{2}}^{'} = x_{i + t_{2} + t_{1}}

for every

i \in [λ, λ^{'} - t_{1} - t_{2}]

. Hence,

x_{[λ, λ^{'}]}

is a substring of

x

of period

t_{1} + t_{2}

. By Remark 4,

| [λ, λ^{'}] | \leq δ

, so

| J | = | I_{1} \cap I_{2} | = | [λ, λ^{'}] | \leq δ \leq δ + t

. □

For

n = 30

, we give three examples of

y

and

y^{'}

, satisfying conditions of the different cases in the proof of Lemma 9.

Example 1.

Suppose

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{1}^{'}}^{'}

and

y^{'} = x_{[n] ∖ D_{2}} = x_{[n] ∖ D_{2}^{'}}^{'}

, where

D_{1} = {4, 5, 6, 7}

,

D_{1}^{'} = {27, 28, 29, 30}

,

D_{2} = {1, 2}

and

D_{2}^{'} = {23, 24}

. Here, we have

i_{1} = 4, i_{1}^{'} = 27

and

t_{1} = 4

;

i_{2} = 1, i_{2}^{'} = 23

and

t_{2} = 2

. Then, conditions of Case 1.1 are satisfied, and

J = I_{1} \cap I_{2} = [4, 24]

. Figure 1a is an illustration of this case. We can easily find that

x_{i}^{'} = x_{i + 2}

for every

i \in [4, 22]

and

x_{i}^{'} = x_{i}

for every

i \in [n] ∖ J

. So, by Observation 1,

x_{[n] ∖ {4, 5}} = x_{[n] ∖ {23, 24}}^{'}

. Moreover,

x_{i} = x_{i - 2}^{'} = x_{i + 2}

for every

i \in [6, 22]

, so the substring

x_{[6, 24]}

of

x

has period 2.

Figure 1. Illustration examples: each dot in the upper row represents a symbol of

x

, and each dot in the lower row represents a symbol of

x^{'}

, where two symbols with equal value are connected by a (solid or dashed) line segment. We can find that: (1) in (a), the substring

x_{[6, 24]}

of

x

has period 2; (2) in (b), the substring

x_{[1, 12]}

of

x

has period 4, so

x_{[n] ∖ {1, 2, 3, 4}} = x_{[n] ∖ {9, 10, 11, 12}}

; and (3) in (c), the substring

x_{[7, 26]}

of

x

has period 7.

Example 2.

Suppose

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{1}^{'}}^{'}

and

y^{'} = x_{[n] ∖ D_{2}} = x_{[n] ∖ D_{2}^{'}}^{'}

, where

D_{1} = {1, 2, 3, 4}

,

D_{1}^{'} = {17, 18, 19, 20}

,

D_{2} = {9, 10, 11, 12}

and

D_{2}^{'} = {27, 28, 29, 30}

. Here, we have

i_{1} = 1, i_{1}^{'} = 17

,

i_{2} = 9, i_{2}^{'} = 27

and

t_{1} = t_{2} = 4

. Then, conditions of Case 1.2 are satisfied. Figure 1b is an illustration of this case. We can find that

x_{i} = x_{i}^{'} = x_{i + 4}

for every

i \in [1, 8]

, i.e.,

x_{[1, 12]}

has period 4. So, by Observation 2, we have

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{2}} = y^{'}

.

Example 3.

Suppose

y = x_{[n] ∖ D_{1}} = x_{[n] ∖ D_{1}^{'}}^{'}

and

y^{'} = x_{[n] ∖ D_{2}} = x_{[n] ∖ D_{2}^{'}}^{'}

, where

D_{1} = {7, 8, 9, 10}

,

D_{1}^{'} = {27, 28, 29, 30}

,

D_{2} = {24, 25, 26}

and

D_{2}^{'} = {1, 2, 3}

. Here, we have

i_{1} = 7, i_{1}^{'} = 27

and

t_{1} = 4

;

i_{2} = 24, i_{2}^{'} = 1

and

t_{2} = 3

. Then, conditions of Case 2 are satisfied and

J = I_{1} \cap I_{2} = [7, 26]

. Figure 1c is an illustration of this case. We can easily find that

x_{i}^{'} = x_{i + 4}

for every

i \in [7, 22]

, and

x_{i}^{'} = x_{i}

for every

i \in [n] ∖ J

. So by Observation 1,

x_{[n] ∖ {7, 8, 9, 10}} = x_{[n] ∖ {23, 24, 25, 26}}^{'}

. Moreover,

x_{i} = x_{i + 3}^{'} = x_{i + 7}

for every

i \in [7, 19]

, i.e.,

x_{[7, 26]}

has period 7.

In the following, we give a construction of q-ary codes for correcting a burst of at most t deletions with two reads. Let

ρ = δ + t = 2 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉ + t

and

L_{ρ} = {L_{j} : j = 1, 2, \dots, ⌈n / ρ⌉ - 1}

be defined by (2). For each

ℓ \in {0, 1}

, let the function

{\tilde{h}}_{ρ}^{(ℓ)}

be obtained from Construction 1 by letting h be the function

h_{syn}

given by Lemma 2. Then, let

\begin{matrix} \tilde{f} (x) = ({\tilde{h}}_{ρ}^{(0)} (x), {\tilde{h}}_{ρ}^{(1)} (x)) . \end{matrix}

(23)

Lemma 10.

For each

x \in Σ_{q}^{n}

, the function

\tilde{f} (x)

is computable in time

O (q^{7 t} n {(\log n)}^{3})

, and when viewed as a binary string, the length

| \tilde{f} (x) |

of

\tilde{f} (x)

satisfies

| \tilde{f} (x) | \leq 8 \log \log n + o (\log \log n) .

Moreover, if

x

is

(p, δ)

-dense, then given

\tilde{f} (x)

and any two distinct

y, y^{'} \in B_{\leq t} (x)

, one can uniquely recover

x

.

Proof.

Note that

| L_{j} | \leq 2 ρ = 4 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉ + 2 t

for each j. Then, by (3) and by Remark 2, the functions

{\tilde{h}}_{ρ}^{(0)} (x)

and

{\tilde{h}}_{ρ}^{(1)} (x)

are computable in time

O (n q^{t} {(2 ρ)}^{3}) = O (q^{7 t} n {(\log n)}^{3}) .

Hence, by (23),

\tilde{f} (x)

is computable in time

O (q^{7 t} n {(\log n)}^{3})

.

Again by (3) and by Remark 2, the length of

\tilde{f} (x)

(viewed as a binary string) satisfies

\begin{matrix} | \tilde{f} (x) | & = | {\tilde{h}}_{ρ}^{(0)} (x) | + | {\tilde{h}}_{ρ}^{(1)} (x) | \\ = 2 \log q^{R_{2 ρ}} \\ = 2 \log q^{4 \log_{q} (2 ρ) + o (\log_{q} (2 ρ))} \\ = 8 \log \log n + o (\log \log n), \end{matrix}

where the last equality comes from the assumption that

ρ = δ + t = 2 (t + 1) q^{2 (t + 1)} ⌈ \log n ⌉ + t

.

Finally, we prove that if

x \in Σ_{q}^{n}

is

(p, δ)

-dense, then given

\tilde{f} (x)

and any two distinct

y, y^{'} \in B_{\leq t} (x)

, one can uniquely recover

x

. It suffices to prove that for any

(p, δ)

-dense

x, x^{'} \in Σ_{q}^{n}

, if

| B_{\leq t} (x) \cap B_{\leq t} (x^{'}) | \geq 2

, then

\tilde{f} (x) \neq \tilde{f} (x^{'})

. This can be proved as follows. By Lemma 9, there exists an interval

J \subseteq [n]

of length

| J | \leq δ + t = ρ

and two intervals

D, D^{'} \subseteq J

of size

| D | = | D^{'} | \leq t

, such that

x_{[n] ∖ D} = x_{[n] ∖ D^{'}}^{'}

. Then, by Lemma 3, we have

{\tilde{h}}_{ρ}^{(ℓ)} (x) \neq {\tilde{h}}_{ρ}^{(ℓ)} (x^{'})

for some

ℓ \in {0, 1}

. Therefore, by (23), we have

\tilde{f} (x) \neq \tilde{f} (x^{'})

. □

Theorem 2.

Let

{\tilde{C}}_{syn}

be the code with the encoding function

\begin{matrix} \tilde{E} : Σ_{q}^{n - 1} & \to Σ_{q}^{n + r} \\ u & \mapsto (x, w) \end{matrix}

(24)

where

x = EncDen (u)

,

w = (0^{t} 1, x_{[n - t + 1, n]}, {\tilde{f}}_{q} (x))

, and

r = | w | = 2 t + 1 + | {\tilde{f}}_{q} (x) |

. Then,

{\tilde{C}}_{syn}

is an

(n, 2, B_{\leq t})

-reconstruction code with redundancy

8 \log \log n + o (\log \log n)

bits, encoding complexity

O (q^{7 t} n {(\log n)}^{3})

and decoding complexity

O (q^{9 t} {(n \log n)}^{3})

.

Proof.

We first prove that

{\tilde{C}}_{syn}

is an

(n, 2, B_{\leq t})

-reconstruction code. We need to prove that for each codeword

z = \tilde{E} (u) = (x, 0^{t} 1, x_{[n - t + 1, n]}, {\tilde{f}}_{q} (x)) \in {\tilde{C}}_{syn}

, given any distinct

y, y^{'} \in B_{\leq t} (z)

, one can uniquely recover

x

.

Let

y = z_{[n + r] ∖ D}

, where

D \subseteq [n + r]

is an interval of size

t^{'} = | D | = | z | - | y |

. By the same discussions as in the proof of Lemma 8, exactly one of the following three cases holds:

Case 1:

D \subseteq [n + 1, n + r]

. In this case, we have

x = y_{[1, n]}

.

Case 2: There exists a

t^{''} \in [t^{'} - 1]

such that

D \subseteq [n - t^{'} + t^{''} + 1, n + t^{''}]

. In this case,

y_{[1, n + 1 - t^{'} + t^{''}]} = x_{[1, n + 1 - t^{'} + t^{''}]}

and

y_{[n + t + 2 - t^{'}, n + 2 t + 1 - t^{'}]} = z_{[n + t + 2, n + 2 t + 1]} = x_{[n - t + 1, n]}

. So,

x = (y_{[1, n + 1 - t]}, y_{[n + t + 2 - t^{'}, n + 2 t + 1 - t^{'}]})

.

Case 3:

D \subseteq [n]

. In this case, we have

y_{[n - t^{'}]} \in B_{\leq t} (x)

and

y_{[n + 2 t + 1 - t^{'}, n + r - t^{'}]} = {\tilde{f}}_{q} (x)

.

Similarly, for

y^{'} \in B_{\leq t} (z)

, either

x

can be recovered from

y^{'}

or

y^{'} = z_{[n + r] ∖ D^{'}}

for some interval

D^{'} \subseteq [n]

of size

| D^{'} | = | z | - | y^{'} | \leq t

. Suppose

y = z_{[n + r] ∖ D}

and

y^{'} = z_{[n + r] ∖ D^{'}}

for some intervals

D, D^{'} \subseteq [n]

. Then,

y_{[n - | D |]}, y_{[n - | D^{'} |]}^{'} \in B_{\leq t} (x)

and

y_{[n + 2 t + 1 - | D^{'} |, n + r - | D^{'} |]}^{'} = {\tilde{f}}_{q} (x)

. Moreover, we must have

y_{[n - | D |]} \neq y_{[n - | D^{'} |]}^{'}

, because otherwise, from (24), we will obtain

y = (y_{[n - | D |]}, w) = (y_{[n - | D^{'} |]}^{'}, w) = y^{'}

, which contradicts to the assumption that

y \neq y^{'}

. By Lemma 10,

x

can be uniquely recovered from

y

and

y^{'}

.

Thus, we proved that, given any distinct

y, y^{'} \in B_{\leq t} (z)

, one can uniquely recover

x

(and so

z)

, which implies that

{\tilde{C}}_{syn}

is an

(n, 2, B_{\leq t})

-reconstruction code.

By (24), the redundancy of

{\tilde{C}}_{syn}

is

r + 1 = t + 2 + | {\tilde{f}}_{q} (x) |

in q-ary symbols, which is at most

8 \log \log n + o (\log \log n)

bits by Lemma 10.

The encoding complexity of

{\tilde{C}}_{syn}

comes from the complexity of computing

{\tilde{f}}_{q} (x)

, which is

O (q^{7 t} n {(\log n)}^{3})

by Lemma 10. The decoding complexity of

{\tilde{C}}_{syn}

by brute force searching is at most

O ({(n q^{t})}^{2} q^{7 t} n {(\log n)}^{3}) = O (q^{9 t} {(n \log n)}^{3})

. □

6. Conclusions and Discussions

We proposed new constructions of q-ary codes correcting a burst of at most t deletions, for both classical error correcting codes and reconstruction codes. By using q-ary

(p, δ)

-dense strings and bounded burst-deletion correcting codes, our constructions reduce the code redundancy in the

\log \log n

term and have simpler encoding functions, compared to existing works.

A more general problem is to construct q-ary

(n, N, D_{t})

-reconstruction codes, i.e., q-ary reconstruction codes of length n for t-deletion channel with N reads. Note that the best known explicit construction (regarding the redundancy) for classical t-deletion correcting binary codes of length n has redundancy

(4 t - 1) \log n + o (\log n)

[11]. So, we are interested in constructing q-ary

(n, N, D_{t})

-reconstruction codes, for any fixed q and t, of length n and with redundancy

k \log n + o (\log n)

bits for some positive integer k such that

1 \leq k < 4 t - 1

and N depends only on k and t (and is independent of n).

Author Contributions

Methodology and writing, W.S.; Supervision and project administration, K.C. and T.Q.S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Singapore Ministry of Education Academic Research Fund Tier 2 T2EP50221-0036.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Levenshtein, V.I. Bounds for Deletion/Insertion Correcting Codes. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Lausanne, Switzerland, 30 June–5 July 2002. [Google Scholar]
Levenshtein, V.I. Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk. SSR 1965, 163, 845–848. (In Russian) [Google Scholar]
Tenengolts, G.M. Nonbinary codes, correcting single deletion or insertion. IEEE Trans. Inform. Theory 1984, 30, 766–769. [Google Scholar] [CrossRef]
Nguyen, T.T.; Cai, K.; Siegel, P.H. A new version of q-ary Varshamov-Tenengolts codes with more efficient encoders: The differential VT codes and the differential shifted VT codes. IEEE Trans. Inform. Theory 2024, 70, 6989–7004. [Google Scholar] [CrossRef]
Brakensiek, J.; Guruswami, V.; Zbarsky, S. Efficient low-redundancy codes for correcting multiple deletions. IEEE Trans. Inform. Theory 2018, 64, 3403–3410. [Google Scholar] [CrossRef]
Sima, J.; Bruck, J. On optimal k-deletion correcting codes. IEEE Trans. Inform. Theory 2021, 67, 3360–3375. [Google Scholar] [CrossRef]
Sima, J.; Gabrys, R.; Bruck, J. Optimal systematic t-deletion correcting codes. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
Guruswami, V.; Hastad, J. Explicit two-deletion codes with redundancy matching the existential bound. IEEE Trans. Inform. Theory 2021, 67, 6384–6394. [Google Scholar] [CrossRef]
Sima, J.; Gabrys, R.; Bruck, J. Optimal codes for the q-ary deletion channel. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
Bitar, R.; Hanna, S.K.; Polyanskii, N.; Vorobyev, I. Optimal codes correcting localized deletions. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021. [Google Scholar]
Song, W.; Polyanskii, N.; Cai, K.; He, X. Systematic codes correcting multiple-deletion and multiple-substitution errors. IEEE Trans. Inform. Theory 2022, 68, 6402–6416. [Google Scholar] [CrossRef]
Song, W.; Cai, K. Non-binary two-deletion correcting codes and burst-deletion correcting codes. IEEE Trans. Inform. Theory 2023, 69, 6470–6484. [Google Scholar] [CrossRef]
Liu, S.; Tjuawinata, I.; Xing, C. Explicit construction of q-ary 2-deletion correcting codes with low redundancy. arXiv 2023, arXiv:2306.02868. [Google Scholar] [CrossRef]
Levenshtein, V. Asymptotically optimum binary code with correction for losses of one or two adjacent bits. Probl. Kibern. 1967, 19, 293–298. [Google Scholar]
Cheng, L.; Swart, T.G.; Ferreira, H.C.; Abdel-Ghaffar, K.A.S. Codes for correcting three or more adjacent deletions or insertions. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014. [Google Scholar]
Schoeny, C.; Wachter-Zeh, A.; Gabrys, R.; Yaakobi, E. Codes correcting a burst of deletions or insertions. IEEE Trans. Inform. Theory 2017, 63, 1971–1985. [Google Scholar] [CrossRef]
Gabrys, R.; Yaakobi, E.; Milenkovic, O. Codes in the damerau distance for deletion and adjacent transposition correction. IEEE Trans. Inform. Theory 2017, 64, 2550–2570. [Google Scholar] [CrossRef]
Lenz, A.; Polyanskii, N. Optimal codes correcting a burst of deletions of variable length. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
Wang, S.; Tang, Y.; Sima, J.; Gabrys, R.; Farnoud, F. Nonbinary codes for correcting a burst of at most t deletions. IEEE Trans. Inform. Theory 2024, 70, 964–979. [Google Scholar] [CrossRef]
Levenshtein, V.I. Efficient reconstruction of sequences. IEEE Trans. Inform. Theory 2001, 47, 2–22. [Google Scholar] [CrossRef]
Levenshtein, V.I. Efficient reconstruction of sequences from their subsequences or supersequences. J. Combin. Theory Ser. A 2001, 93, 310–332. [Google Scholar] [CrossRef]
Dudik, M.; Schulman, L.J. Reconstruction from subsequences. J. Combin. Theory Ser. A 2003, 103, 337–348. [Google Scholar] [CrossRef]
Gabrys, R.; Yaakobi, E. Sequence reconstruction over the deletion channel. IEEE Trans. Inf. Theory 2018, 64, 2924–2931. [Google Scholar] [CrossRef]
Chee, Y.M.; Kiah, H.M.; Vardy, A.; Vu, V.K.; Yaakobi, E. Coding for racetrack memories. IEEE Trans. Inf. Theory 2018, 64, 7094–7112. [Google Scholar] [CrossRef]
Gabrys, R.; Milenkovic, O. Unique reconstruction of coded strings from multiset substring spectra. IEEE Trans. Inform. Theory 2019, 65, 7682–7696. [Google Scholar] [CrossRef]
Chrisnata, J.; Kiah, H.M.; Yaakobi, E. Optimal reconstruction codes for deletion channels. In Proceedings of the 2020 International Symposium on Information Theory and Its Applications (ISITA), Kapolei, HI, USA, 24–27 October 2020. [Google Scholar]
Pham, V.L.P.; Goyal, K.; Kiah, H.M. Sequence reconstruction problem for deletion channels: A complete asymptotic solution. In Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022. [Google Scholar]
Cai, K.; Kiah, H.M.; Nguyen, T.T.; Yaakobi, E. Coding for sequence reconstruction for single edits. IEEE Trans. Inform. Theory 2022, 68, 66–79. [Google Scholar] [CrossRef]
Chrisnata, J.; Kiah, H.M.; Yaakobi, E. Correcting deletions with multiple reads. IEEE Trans. Inf. Theory 2022, 68, 7141–7158. [Google Scholar] [CrossRef]
Sun, Y.; Ge, G. Correcting two-deletion with a constant number of reads. IEEE Trans. Inform. Theory 2023, 69, 66–79. [Google Scholar] [CrossRef]
Sun, Y.; Xi, Y.; Ge, G. Sequence reconstruction under single-burst-insertion/deletion/edit Channel. IEEE Trans. Inform. Theory 2023, 69, 4466–4483. [Google Scholar] [CrossRef]
Abu-Sini, M.; Yaakobi, E. On the Intersection of Multiple Insertion (or Deletion) Balls and Its Application to List Decoding Under the Reconstruction Model. IEEE Trans. Inform. Theory 2024, 70, 3262–3297. [Google Scholar] [CrossRef]
Sima, J.; Gabrys, R.; Bruck, J. Syndrome compression for optimal redundancy codes. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
Wijngaarden, A.V.; Immink, K.A.S. Construction of maximum run-length limited codes using sequence replacement techniques. IEEE J. Sel. Areas Commun. 2010, 28, 200–207. [Google Scholar] [CrossRef]
Nguyen, T.T.; Cai, K.; Immink, K.A.S.; Kiah, H.M. Capacity-approaching constrained codes with error correction for DNA-based data storage. IEEE Trans. Inform. Theory 2021, 67, 5602–5613. [Google Scholar] [CrossRef]
Sima, J.; Bruck, J. Correcting k deletions and insertions in racetrack memory. IEEE Trans. Inform. Theory 2023, 69, 5619–5639. [Google Scholar] [CrossRef]

Figure 1. Illustration examples: each dot in the upper row represents a symbol of

x

, and each dot in the lower row represents a symbol of

x^{'}

, where two symbols with equal value are connected by a (solid or dashed) line segment. We can find that: (1) in (a), the substring

x_{[6, 24]}

of

x

has period 2; (2) in (b), the substring

x_{[1, 12]}

of

x

has period 4, so

x_{[n] ∖ {1, 2, 3, 4}} = x_{[n] ∖ {9, 10, 11, 12}}

; and (3) in (c), the substring

x_{[7, 26]}

of

x

has period 7.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Some New Constructions of q-ary Codes for Correcting a Burst of at Most t Deletions^†

Abstract

1. Introduction

2. Preliminaries

2.1. A Construction of q-ary Burst-Deletion Codes

2.2. Bounded Burst-Deletion Correction

3. Pattern Dense Sequences

4. Burst-Deletion Correcting $q$ -ary Codes

5. Correcting Burst-Deletion with Two Reads

6. Conclusions and Discussions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Some New Constructions of q-ary Codes for Correcting a Burst of at Most t Deletions †

Abstract

1. Introduction

2. Preliminaries

2.1. A Construction of q-ary Burst-Deletion Codes

2.2. Bounded Burst-Deletion Correction

3. Pattern Dense Sequences

4. Burst-Deletion Correcting q -ary Codes

5. Correcting Burst-Deletion with Two Reads

6. Conclusions and Discussions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Some New Constructions of q-ary Codes for Correcting a Burst of at Most t Deletions^†

4. Burst-Deletion Correcting $q$ -ary Codes