Properties of Maxentropic DNA Synthesis Codes

Schouhamer Immink, Kees; Weber, Jos H.; Cai, Kui

doi:10.3390/e26121028

Open AccessFeature PaperArticle

Properties of Maxentropic DNA Synthesis Codes^†

by

Kees Schouhamer Immink

^1,*

,

Jos H. Weber

²

and

Kui Cai

³

¹

Turing Machines Inc., Willemskade 15, 3016 DK Rotterdam, The Netherlands

²

Department of Applied Mathematics, Delft University of Technology, 2628 CD Delft, The Netherlands

³

Science, Mathematics and Technology Cluster, Singapore University of Technology and Design (SUTD), 8 Somapah Rd, Singapore 487372, Singapore

^*

Author to whom correspondence should be addressed.

^†

This work was supported by SUTD Kickstarter Initiative (SKI) Grant 2021_04_05.

Entropy 2024, 26(12), 1028; https://doi.org/10.3390/e26121028

Submission received: 28 October 2024 / Revised: 19 November 2024 / Accepted: 27 November 2024 / Published: 27 November 2024

(This article belongs to the Special Issue Coding and Algorithms for DNA-Based Data Storage Systems)

Download

Browse Figures

Versions Notes

Abstract

Low-weight codes have been proposed for efficiently synthesizing deoxyribonucleic acid (DNA) for massive data storage, where a multiple of DNA strands are synthesized in parallel. We report on the redundancy and information rate of maxentropic low-weight codes for asymptotically large codeword length. We compare the performance of low-complexity nibble replacement (NR) codes, which are designed to minimize the synthesis time, with the performance of maxentropic low-weight codes. Finally, the asymptotic redundancy and information rate of codes with a runlength limitation are investigated.

Keywords:

code design; DNA synthesis; low-weight code; maximum runlength constraint; nibble replacement (NR) code

1. Introduction

The pioneering experiments conducted by Church et al. [1] demonstrated the feasibility to store data in synthetic deoxyribonucleic acid (DNA), promising a huge data capacity, nil dissipation during storage, and very long-term stability. Natural DNA consists of four types of nucleotides: adenine (‘A’), cytosine (‘C’), guanine (‘G’), and thymine (‘T’). Codes are used for translating user data into sequences of digits in the quaternary alphabet {A, C, G, T} that are suitable for the synthesis of DNA strands. Prior art studies have focused on error-correcting codes for restoring various kinds of defects in DNA [2,3,4] or constrained codes that avoid the generation of vexatious DNA sequences that are prone to error; see, for example, Refs. [5,6,7,8].

The synthesis of DNA strands is a relative expensive part of the storage chain. In array-based synthesis, multiple DNA strands are synthesized in parallel [9] by adding in each cycle a single nucleotide to a subset of the DNA strands. Lenz et al. [10], Makarychev et al. [11], Elishco and Huleihel [12], Immink et al. [13], and Nguyen et al. [14] presented and analyzed coding techniques for efficiently synthesizing multiple parallel strands so that overall synthesis time can be shortened. Of specific interest in minimizing the synthesis time are sets (codes) of words of low weight, which are dealt with in the next subsection.

1.1. Low-Weight Codes

Although our main interest is in the quaternary DNA case, we will consider q-ary sequences for generality. For clerical convenience, we assume that the alphabet used is

{1, \dots, q}

, where

q > 1

is a positive integer. For the DNA case, we represent the quaternary alphabet {A, C, G, T} by

{1, \dots, 4}

. Let

a = (a_{1}, \dots, a_{n})

,

a_{i} \in {1, \dots, q}

, be a sequence of n symbols, called word of length n. The symbol sum

w (a) = \sum_{i = 1}^{n} a_{i}

(1)

is termed the weight of the word a. Clearly,

n \leq w (a) \leq q n

. A constant-weight code of length n, denoted by

S_{n} (w)

, consists of all words of weight w, that is,

S_{n} (w) = {a \in {1, \dots, q}^{n} : w (a) = w} .

(2)

The size of

S_{n} (w)

, denoted by

| S_{n} (w) |

, is found as the coefficient of

z^{w}

of the generating function [15]

{(\sum_{i = 1}^{q} z^{i})}^{n} .

(3)

For synthesizing multiple words into physical sequences in parallel, we assume the sequences are generated by adding symbols in cycles. In each cycle in the synthesis process, one particular symbol from

{1, \dots, q}

is added to the sequences of the words waiting for that symbol. Throughout this paper, we assume the symbol adding in the subsequent cycles is in the order

1, 2, \dots, q, 1, 2, \dots, q, 1, 2, \dots

, which has been shown to be optimal; see [10,12]. In order to allow any word from

{1, \dots, q}^{n}

to be synthesized,

q n

cycles are needed. By restricting the set of words used for representing data, the number of required synthesis cycles can be reduced, as explained next.

Let the low-weight code

\cup_{w = n}^{t} S_{n} (w)

be the union of the sets of words of weight

w \leq t

, where the integer t,

n \leq t \leq q n

denotes the maximum weight of the codewords. As explained in [10,13], low-weight codewords y can be bijectively mapped to words

x = (x_{1}, \dots, x_{n})

,

x_{i} \in {1, \dots, q}

, by

x_{i} = x_{i - 1} + y_{i} \mod q,

(4)

with

x_{0} = q

, such that the words x have a synthesis time of at most t cycles. Let the low-weight code be denoted as

Y_{n} (γ)

, where

γ = t / n

, and the associated set of words x as

X_{n} (γ)

. From the synthesis perspective, we are interested in properties of the codes

X_{n} (γ)

, but because of the bijective mapping we can also study the low-weight codes

Y_{n} (γ)

.

1.2. Redundancy and Information Rate

The redundancy (in bits per symbol) of a low-weight code

Y_{n} (γ)

is defined by

ρ_{n} (γ) = {log}_{2} (q) - R_{n} (γ),

(5)

where

R_{n} (γ) = \frac{1}{n} {log}_{2} | Y_{n} (γ) | .

(6)

Lenz et al. [10] also introduced the information rate (in bits per cycle) of a low-weight code

Y_{n} (γ)

as

W_{n} (γ) = \frac{1}{n γ} {log}_{2} | Y_{n} (w) | = \frac{1}{γ} R_{n} (γ) .

(7)

Of course,

ρ_{n} (γ)

and

W_{n} (γ)

are also the redundancy and information rate, respectively, of

X_{n} (γ)

. Note that

W_{n} (γ)

is a measure for the synthesis efficiency of the codewords of

X_{n} (γ)

.

Using (3), we can straightforwardly compute

ρ_{n} (γ)

and

W_{n} (γ)

versus

γ

. Figure 1 and Figure 2 show the results for

n = 16

, 32,

64,

and

q = 4

. The curves suggest that

ρ_{n} (γ)

and

W_{n} (γ)

have a lower bound and upper bound, respectively, for asymptotic large n. A major goal of this paper is to determine these bounds.

1.3. Contributions and Overview of the Paper

Besides introducing the framework as just described, Lenz et al. [10] also conducted a brief performance analysis of DNA synthesis codes, mainly based on tools from the theory of cost-constrained channels. Constructions of efficient DNA synthesis codes were further explored in [13]. Here, in this paper, Section 2 deals with an extensive asymptotic analysis, with a focus on the trade-off between redundancy and information rate. The results are derived using Jaynes’ maximum entropy principle. In Section 3, we compare the obtained theoretical optima with the performance of practical nibble replacement codes. Finally, we extend the analysis to codes with a runlength constraint in Section 4 and conclude the paper in Section 5.

2. Asymptotic Analysis of Low-Weight Codes

In order to evaluate the sizes of large low-weight codes, we use the following approach. Let

C_{n} (w)

be the set of compositions

c = (n_{1}, \dots, n_{q})

of n, where

n_{i}

are nonnegative integers such that

\sum_{i = 1}^{q} n_{i} = n

subject to the constraint

\sum_{i = 1}^{q} i n_{i} = w

. The number of q-ary words of length n with

n_{i}

symbols equal to i,

1 \leq i \leq q

, denoted by

N_{c}

, equals

N_{c} = \frac{n!}{n_{1}! \dots n_{q}!} .

(8)

The constant-weight code size,

| S_{n} (w) |

, is found by summing

N_{c}

for all possible compositions

c \in C_{n} (w)

so that

| S_{n} (w) | = \sum_{c \in C_{n} (w)} N_{c} .

(9)

2.1. Asymptotic Analysis of $R_{n} (γ)$

We are specifically interested in

R_{n} (γ)

for asymptotically large n. So, let

n \to \infty

and

n_{i} \to \infty

for all i, while keeping

p_{i} = n_{i} / n

,

1 \leq i \leq q

, the distribution of the symbol values, fixed. It then follows (see Wallis argument in Section 11.4 of [16]), using Stirling’s approximation, that

\frac{1}{n} {log}_{2} N_{c} \to H_{c},

(10)

and thus

\frac{1}{n} {log}_{2} | S_{n} (w) | \to max_{c \in C_{n} (w)} H_{c},

(11)

where

H_{c} = - \sum_{i = 1}^{q} p_{i} {log}_{2} p_{i} .

(12)

In a similar vein, we find

R_{n} (γ) \to max_{n \leq w \leq γ n} \frac{1}{n} {log}_{2} | S_{n} (w) | \to max_{n \leq w \leq γ n} max_{c \in C_{n} (w)} H_{c} .

(13)

Since

\frac{1}{n} {log}_{2} | S_{n} (w) |

is monotonically increasing with w,

n \leq w \leq γ n

, if

1 \leq γ \leq (q + 1) / 2

, with a maximum

{log}_{2} (q)

at

w = n (q + 1) / 2

, we infer

R_{n} (γ) \to \{\begin{matrix} max_{c \in C_{n} (γ n)} H_{c}, & 1 \leq γ < (q + 1) / 2, \\ {log}_{2} (q), & (q + 1) / 2 \leq γ \leq q . \end{matrix}

(14)

The problem of determining

R_{\infty} (γ) = {lim}_{n \to \infty} R_{n} (γ)

, and thus the asymptotic redundancy

ρ_{\infty} (γ) = {log}_{2} (q) - R_{\infty} (γ)

and the asymptotic information rate

W_{\infty} (γ) = \frac{1}{γ} R_{\infty} (γ)

, is now a matter of finding, for asymptotically large n, a

c

in

C_{n} (γ n)

that maximizes

H_{c}

. The composition

c = (n_{1}, \dots, n_{q})

in

C_{n} (γ n)

is characterized by

\sum_{i = 1}^{q} n_{i} = n and \sum_{i = 1}^{q} i n_{i} = γ n,

(15)

which can be conveniently rewritten as

\sum_{i = 1}^{q} p_{i} = 1 and \sum_{i = 1}^{q} i p_{i} = γ .

(16)

In the next subsection, we maximize

H_{c}

by a judicious choice of the distribution of the symbol values,

p_{i}

, under these conditions.

2.2. Principle of Maximum Entropy

We change the above setting of finite-length codewords and now assume a stationary information source that transmits symbols of (integer) magnitude i,

i \in {1, \dots, q}

, with probability distribution

p =

(p_{1}, \dots, p_{q})

, where

p_{i} \in R

and

\sum p_{i} = 1

. The information content per symbol sent, or entropy, denoted by H, defined by Shannon [17], is

H = - \sum_{i = 1}^{q} p_{i} {log}_{2} p_{i} .

(17)

Although the variable

H_{c}

in (12) and Shannon’s entropy H share the same expression in p, the background of the expressions is different [16]. Note that in (12), the

p_{i}

’s are rational numbers, while in (17) the

p_{i}

’s are assumed to be real numbers.

We are interested in maximizing the entropy H. Define

\hat{H} (γ) = max_{p_{1}, \dots, p_{q}} H,

(18)

1 \leq γ \leq q

, where the maximization over the

p_{i}

is under the conditions (16). Jaynes [18] concluded that the entropy, H, is maximized subject to these constraints by the maxentropic probability distribution

{\hat{p}}_{i} = 2^{α - β i}, 1 \leq i \leq q,

(19)

where the parameters

α

and

β

,

α, β \in R

, satisfy the conditions

α = - {log}_{2} \sum_{i = 1}^{q} 2^{- β i}

(20)

and

\sum_{i = 1}^{q} i 2^{α - β i} = γ .

(21)

After substituting (19) to (21) into (17), we find

\hat{H} (γ) = β γ - α .

(22)

For the case

q = 2

, we may easily find that

p_{1} + p_{2} = 1

and

p_{1} + 2 p_{2} = γ

, so that

p_{1} = 2 - γ

,

p_{2} = γ - 1

, and

\hat{H} (γ) = - (2 - γ) {log}_{2} (2 - γ) - (γ - 1) {log}_{2} (γ - 1),

(23)

1 \leq γ \leq 2 .

For

q > 2

, no simple closed-form expression could be found, and we use numerical methods for solving (20) and (21).

2.3. Asymptotic Redundancy and Information Rate

From (14), we obtain

R_{\infty} (γ) = \{\begin{cases} \hat{H} (γ), & 1 \leq γ < (q + 1) / 2, \\ \log_{2} (q), & (q + 1) / 2 \leq γ \leq q . \end{cases}

(24)

As a result, the asymptotic redundancy is

ρ_{\infty} (γ) = \{\begin{matrix} \log_{2} (q) - \hat{H} (γ), & 1 \leq γ < (q + 1) / 2, \\ 0, & (q + 1) / 2 \leq γ \leq q . \end{matrix}

(25)

Figure 3 depicts, for

q = 2, 4

, and 6, the relationship between the asymptotic redundancy

ρ_{\infty} (γ)

and

γ

.

The asymptotic information rate,

W_{\infty} (γ)

, equals

W_{\infty} (γ) = \{\begin{matrix} \hat{H} (γ) / γ, & 1 \leq γ < (q + 1) / 2, \\ \log_{2} (q) / γ, & (q + 1) / 2 \leq γ \leq q . \end{matrix}

(26)

Figure 4 shows

W_{\infty} (γ)

versus

γ

for

q = 2, 4,

and 6.

The maximum asymptotic information rate, denoted by

{\hat{W}}_{\infty} = max_{γ} W_{\infty} (γ),

can be found after an analysis of (22). We write (22), using (20) and (21), as a function of

β

and conclude that the largest (real) root of

\sum_{i = 1}^{q} 2^{- i β} = 1,

(27)

denoted by

\hat{β}

, maximizes

W_{\infty} (γ)

. We obtain, see (20),

α = 0

and hence, see (22), we infer that

{\hat{H}}_{\infty} = \hat{H} (\hat{γ}) = \hat{β} \hat{γ},

(28)

where

\hat{γ} = \sum_{i = 1}^{q} i 2^{- \hat{β} i},

(29)

and

{\hat{W}}_{\infty} = \hat{β} .

(30)

Note that Equation (27) is equivalent to the characteristic equation

\sum_{i = 1}^{q} z^{- i} = 1

of a binary source under the constraint that the maximum runlength is q [19]. The capacity of binary sequences with a maximum runlength constraint of q equals

{log}_{2} (\hat{z})

, where

\hat{z}

is largest (real) root of the characteristic equation [17]. Hence, the maximum asymptotic information rate

{\hat{W}}_{\infty} = \hat{β}

of q-ary low-weight codes is equal to this capacity. Numerical values of the latter have been listed for selected values of q in [19]. Since the capacity approaches unity for increasing values of q, the same holds for the information rate

{\hat{W}}_{\infty}

, which is achieved for

\hat{γ} \to 2

. In other words, for large values of n and q, the maximum information rate is achieved by setting the maximum weight of the low-weight code equal to (roughly)

2 n

. The corresponding redundancy is

{log}_{2} (q) - \hat{γ} \hat{β} \to {log}_{2} (q) - 2

. For any q, the asymptotic redundancy can be lowered from

{log}_{2} (q) - \hat{γ} \hat{β}

to zero by increasing

γ

from

\hat{γ}

to

(q + 1) / 2

, which implies that the asymptotic information rate decreases from

\hat{β}

to

2 {log}_{2} (q) / (q + 1)

. This trade-off between redundancy and information rate is further explored for the case

q = 4

in the next subsection.

2.4. Case Study for $q = 4$

In this subsection, we consider the case

q = 4

, which is of particular interest since it is the alphabet size for DNA synthesis codes. For

q = 4

, we find using numerical methods that

{\hat{W}}_{\infty} = \hat{β} = 0.947

,

\hat{γ} = 1.766

, and

{\hat{H}}_{\infty} = 1.672

. The probability distribution at maximum entropy is

\hat{p} =

(0.519, 0.269, 0.140, 0.072). Figure 5 shows the parametric representation of

W_{\infty} (γ)

versus

ρ_{\infty} (γ)

with

γ

as a parameter for the case

q = 4

. The curve is a typical price/performance curve, where we may observe that a higher

W_{\infty} (γ)

comes with a higher penalty in redundancy

ρ_{\infty} (γ)

.

It is the difficult task of a system designer to trade the costs and benefits of the conflicting parameters. Note that in the range

γ \geq 5 / 2

we have

ρ_{\infty} (γ) = 0

, a zero-redundant system, while in the range

γ < \hat{γ}

we may achieve the same information rate

W_{\infty} (γ)

with a smaller redundancy. For example, we may notice that we can achieve

W_{\infty} (γ) = 0.8

for zero redundancy cost or for roughly 0.9. In practice, we prefer the smaller redundancy alternative so that in this range of practical interest, we have

4 / 5 \leq W_{\infty} (γ) \leq {\hat{W}}_{\infty} = 0.947

and

0 \leq ρ_{\infty} (γ) \leq 2 - {\hat{H}}_{\infty} = 0.328

. Figure 6 displays

W_{\infty} (γ)

versus

ρ_{\infty} (γ)

in the range of practical interest

\hat{γ} \leq γ < 5 / 2

.

3. Comparison with Implemented Codes

In this section, we compare the performance of implemented codes with that of maxentropic low-weight codes. In [13], various code implementations have been assessed. Here, we focus on the nibble replacement (NR) algorithm [13,20], which is an efficient method for encoding/decoding with small complexity and redundancy.

In the NR format, an n-symbol strand is divided into L subwords of length m, so that

n = L m

. Let

t_{m}

be the maximum allowed cycle count of an m-symbol q-ary word, then the maximum cycle count of the n-symbol q-ary word is

t = L t_{m}

. Let

M = \sum_{w = m}^{t_{m}} | S_{m} (w) |

(31)

denote the number of low-weight m-symbol codewords. Define

m_{h} = ⌈ {log}_{2} M ⌉

and

L = ⌊\frac{2^{m_{h} - 1}}{2^{m_{h}} - M}⌋ .

(32)

The NR algorithm translates

L m_{h} - 1

source bits into L

m_{h}

-bit words. Each

m_{h}

-bit word is translated, using a look-up table, into a q-ary m-symbol word that satisfies the

t_{m}

-cycle count constraint. The NR encoding method requires data storage of L

m_{h}

-bit words, the execution of the encoding algorithm [13], and a look-up table for translating an

m_{h}

-bit wide word into a word of m q-ary symbols so that very large, n-symbol wide, look-up tables are avoided. The overall redundancy per symbol,

ρ (t)

, and information rate,

W (t)

, of the n-symbol word are

ρ (t) = \frac{L (\log_{2} (q) m - m_{h}) + 1}{n}

(33)

and

W (t) = \frac{L m_{h} - 1}{t} .

(34)

Table 1 shows numerical results selected from Table I in [13].

The scattered points (black circles) in Figure 6 are found by plotting the redundancy,

ρ (t)

, and information rate,

W (t)

, of the NR codes shown in Table 1.

4. Runlength Limitation

It is known that homopolymer runs, i.e., adjacent repetitions of the same nucleotide, make DNA-based data storage more error prone [12]. Therefore, it could be advantageous to use strands in which long runs are avoided. Of course, this comes at the expense of an increased redundancy. In this section, we perform an asymptotic analysis of codes aiming at (i) small redundancy, (ii) high information rate, and (iii) small maximum runlength. These are conflicting goals resulting into trade-off considerations. Again, we start by investigating q-ary codes and then focus on the

q = 4

case.

We say that a code is r-RLL (runlength limited) if within any codeword any run of identical symbols is of length at most r, where

1 \leq r \leq n

. When

r = n

, there is actually no constraint with respect to the runlength. Here, we focus on the other extreme,

r = 1

; i.e., we consider codewords in which any two adjacent symbols are different. We investigate the asymptotic redundancy and information rate of q-ary 1-RLL codes. The same notation as before is used, where we indicate with a tilde that the

r = 1

constraint is in place.

Let

{\tilde{X}}_{n} (γ)

denote the q-ary code consisting of all 1-RLL sequences that can be synthesized in at most

t = γ n

cycles. The codewords

\tilde{y} = ({\tilde{y}}_{1}, \dots, {\tilde{y}}_{n})

of the associated low-weight code

{\tilde{Y}}_{n} (γ)

are obtained from the codewords

\tilde{x} = ({\tilde{x}}_{1}, \dots, {\tilde{x}}_{n})

of

{\tilde{X}}_{n} (γ)

by the bijective mapping

{\tilde{y}}_{i} = {\tilde{x}}_{i} - {\tilde{x}}_{i - 1} \mod q

(35)

with

{\tilde{x}}_{0} = q

. Note that due to the 1-RLL property of

\tilde{x}

, it holds that

{\tilde{x}}_{i} \neq {\tilde{x}}_{i - 1}

and thus

{\tilde{y}}_{i} \neq q

for all

2 \leq i \leq n

. Hence,

{\tilde{Y}}_{n} (γ) = \cup_{w = n}^{γ n} {\tilde{S}}_{n} (w)

, where

{\tilde{S}}_{n} (w) = {\tilde{y} \in {1, \dots, q}^{n} : w (\tilde{y}) = w \land {\tilde{y}}_{i} \neq q \forall i \geq 2}

(36)

and the range for

γ

is in this case

1 \leq γ \leq q - 1 + 1 / n

, since the maximum number of cycles is

(q - 1) n + 1

rather than

q n

due to the runlength constraint.

Similarly to what we did before, we next evaluate

{\tilde{R}}_{n} (γ) = \frac{1}{n} {log}_{2} | {\tilde{Y}}_{n} (γ) | .

(37)

Since the symbol distribution

({\tilde{p}}_{1}, \dots, {\tilde{p}}_{q})

satisfies, for any codeword in the low-weight code,

{\tilde{p}}_{q} \leq \frac{1}{n} \to 0

(38)

as

n \to \infty

, we can conclude that the value of

{\tilde{R}}_{\infty} (γ)

in the q-ary case is equal to the value of

R_{\infty} (γ)

in the

(q - 1)

-ary case. Hence, it easily follows that

The asymptotic redundancy ${\tilde{ρ}}_{\infty} (γ)$ in the q-ary 1-RLL case equals ${log}_{2} (q / (q - 1))$ plus the asymptotic redundancy $ρ_{\infty} (γ)$ in the $(q - 1)$ -ary case without runlength restriction;
The asymptotic information rate ${\tilde{W}}_{\infty} (γ)$ in the q-ary 1-RLL case equals the asymptotic information rate $W_{\infty} (γ)$ in the $(q - 1)$ -ary case without runlength restriction.

As an illustration, we consider the case

q = 4

. By applying the results from (25) and (26) for

q = 3

, we find

{\tilde{ρ}}_{\infty} (γ)

and

{\tilde{W}}_{\infty} (γ)

for

q = 4

. These 1-RLL results are compared to the corresponding results without runlength limitation from Section 2 in Figure 7, Figure 8 and Figure 9. Results for r-RLL codes,

1 < r < \infty

, will be in between the lower and the upper curves in these figures. Various trade-off possibilities can be considered. Note that, for small values of

γ

, imposing the runlength limitation comes at hardly any price, but that for larger values of

γ

we considerably pay in terms of redundancy and information rate. Fixing the asymptotic redundancy at, e.g.,

0.5

, it follows from Figure 9 that the asymptotic information rate drops from about 0.93 (∞-RLL, i.e., no runlength limitation) to about 0.87 (1-RLL).

5. Conclusions

We have analyzed coding techniques for efficiently synthesizing multiple parallel DNA strands. We have computed the maxentropic redundancy and information rate of low-weight codes,

ρ_{n} (γ)

and

W_{n} (γ)

, for asymptotically large codeword length n using Jaynes’ maximum entropy principle. We have compared the performance of low-complexity NR codes, which are designed to minimize the synthesis time, with the performance of maxentropic low-weight codes. Finally, the performance of codes with a runlength limitation has been evaluated. All the presented results allow for making trade-offs between synthesis time and redundancy for long codes.

Author Contributions

Writing—review & editing, K.S.I., J.H.W. and K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by SUTD Kickstarter grant number 2021_04_05.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Kees Schouhamer Immink was employed by the company Turing Machines Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Church, G.M.; Gao, Y.; Kosuri, S. Next-generation digital information storage in DNA. Science 2012, 337, 1628. [Google Scholar] [CrossRef]
Lenz, A.; Siegel, P.H.; Wachter-Zeh, A.; Yaakobi, E. Coding Over Sets for DNA Storage. IEEE Trans. Inf. Theory 2020, 66, 2331–2351. [Google Scholar] [CrossRef]
Nguyen, T.T.; Cai, K.; Immink, K.A.S.; Kiah, H.M. Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage. IEEE Trans. Inf. Theory 2021, 67, 5602–5613. [Google Scholar] [CrossRef]
Milenkovic, O.; Pan, C. DNA-Based Data Storage Systems: A Review of Implementations and Code Constructions. IEEE Trans. Commun. 2024, 72, 3803–3828. [Google Scholar] [CrossRef]
Blawat, M.; Gaedke, K.; Hutter, I.; Cheng, X.; Turczyk, B.; Inverso, S.; Pruitt, B.W.; Church, G.M. Forward Error Correction for DNA Data Storage. Procedia Comput. Sci. 2016, 80, 1011–1022. [Google Scholar] [CrossRef]
Benerjee, K.G.; Banerjee, A. On DNA Codes With Multiple Constraints. IEEE Commun. Lett. 2021, 25, 365–368. [Google Scholar] [CrossRef]
Milenkovic, O.; Kashyap, N. DNA codes that avoid secondary structures. In Proceedings of the International Symposium on Information Theory, ISIT 2005, Adelaide, SA, Australia, 4–9 September 2005; pp. 288–292. [Google Scholar] [CrossRef]
Benerjee, K.G.; Banerjee, A. On Homopolymers and Secondary Structures Avoiding, Reversible, Reversible-Complement and GC-balanced DNA Codes. In Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 204–209. [Google Scholar] [CrossRef]
Kosuri, S.; Church, G.M. Large-scale de novo DNA synthesis: Technologies and applications. Nat. Methods 2014, 11, 499–507. [Google Scholar] [CrossRef]
Lenz, A.; Liu, Y.; Rashtchian, C.; Siegel, P.H.; Wachter-Zeh, A.; Yaakobi, E. Coding for Efficient DNA Synthesis. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2885–2890. [Google Scholar] [CrossRef]
Makarychev, K.; Racz, M.Z.; Rashtchian, C.; Yekhanin, S. Batch Optimization for DNA Synthesis. IEEE Trans. Inf. Theory 2022, 68, 7454–7470. [Google Scholar] [CrossRef]
Elishco, O.; Huleihel, W. Optimal Reference for DNA Synthesis. IEEE Trans. Inf. Theory 2023, 69, 6941–6955. [Google Scholar] [CrossRef]
Immink, K.A.S.; Cai, K.; Nguyen, T.T.; Weber, J.H. Constructions and properties of efficient DNA synthesis codes. IEEE Trans. Mol. Biol.-Multi-Scale Commun. 2024, 10, 289–296. [Google Scholar] [CrossRef]
Nguyen, T.T.; Cai, K.; Immink, K.A.S. Efficient DNA Synthesis Codes with Error Correction and Runlength Limited Constraint. In Proceedings of the 2024 IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 7–12 July 2024; pp. 669–674. [Google Scholar] [CrossRef]
Flajolet, P.; Sedgewick, R. Analytic Combinatorics; Cambridge University Press: Cambridge, UK, 2009; ISBN 978-0-521-89806-5. [Google Scholar]
Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, MA, USA, 2003. [Google Scholar]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Jaynes, E.T. Probability, Statistics and Statistical Physics; Rosenkrantz, R., Ed.; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
Tang, D.T.; Bahl, L.R. Block Codes for a Class of Constrained Noiseless Channels. Inf. Control 1970, 17, 436–461. [Google Scholar] [CrossRef]
Immink, K.A.S.; Cai, K. Efficient encoding of constrained block codes. IEEE Commun. Lett. 2021, 25, 3468–3472. [Google Scholar] [CrossRef]

Figure 1. Redundancy

ρ_{n} (γ)

versus

γ

for

n = 16, 32, 64,

and

q = 4

.

Figure 1. Redundancy

ρ_{n} (γ)

versus

γ

for

n = 16, 32, 64,

and

q = 4

.

Figure 2. Information rate

W_{n} (γ)

versus

γ

for

n = 16, 32, 64,

and

q = 4

.

Figure 2. Information rate

W_{n} (γ)

versus

γ

for

n = 16, 32, 64,

and

q = 4

.

Figure 3. Redundancy

ρ_{\infty} (γ)

versus

γ

,

q = 2, 4,

and 6.

Figure 3. Redundancy

ρ_{\infty} (γ)

versus

γ

,

q = 2, 4,

and 6.

Figure 4. Information rate

W_{\infty} (γ)

versus

γ

,

q = 2, 4,

and 6.

Figure 4. Information rate

W_{\infty} (γ)

versus

γ

,

q = 2, 4,

and 6.

Figure 5. Parametric relationship between maxentropic information rate

W_{n} (γ)

versus redundancy

ρ_{n} (γ)

,

q = 4

. As a comparison, we plotted

W_{n} (γ)

versus

ρ_{n} (γ)

for

n = 16, 32, 64

.

Figure 5. Parametric relationship between maxentropic information rate

W_{n} (γ)

versus redundancy

ρ_{n} (γ)

,

q = 4

. As a comparison, we plotted

W_{n} (γ)

versus

ρ_{n} (γ)

for

n = 16, 32, 64

.

Figure 6. Parametric relationship between (maxentropic) information rate

W_{n} (γ)

versus redundancy

ρ_{n} (γ)

,

n = 16, 32, \infty

,

q = 4

, in the range of practical interest. The black circles refer to the NR codes compiled in Table 1.

Figure 6. Parametric relationship between (maxentropic) information rate

W_{n} (γ)

versus redundancy

ρ_{n} (γ)

,

n = 16, 32, \infty

,

q = 4

, in the range of practical interest. The black circles refer to the NR codes compiled in Table 1.

Figure 7. Asymptotic redundancies

ρ_{\infty} (γ)

(∞-RLL) and

{\tilde{ρ}}_{\infty} (γ)

(1-RLL) versus

γ

for the case

q = 4

.

Figure 7. Asymptotic redundancies

ρ_{\infty} (γ)

(∞-RLL) and

{\tilde{ρ}}_{\infty} (γ)

(1-RLL) versus

γ

for the case

q = 4

.

Figure 8. Asymptotic information rate

W_{\infty} (γ)

(∞-RLL) and

{\tilde{W}}_{\infty} (γ)

(1-RLL) versus

γ

for the case

q = 4

.

Figure 8. Asymptotic information rate

W_{\infty} (γ)

(∞-RLL) and

{\tilde{W}}_{\infty} (γ)

(1-RLL) versus

γ

for the case

q = 4

.

Figure 9. Parametric relationship between the asymptotic information rate and asymptotic redundancy for the ∞-RLL and 1-RLL cases,

q = 4

.

Figure 9. Parametric relationship between the asymptotic information rate and asymptotic redundancy for the ∞-RLL and 1-RLL cases,

q = 4

.

Table 1. Results of the NR coding method for selected values of subword length m and maximum subword cycle count,

t_{m}

,

q = 4

.

Table 1. Results of the NR coding method for selected values of subword length m and maximum subword cycle count,

t_{m}

,

q = 4

.

m	$t_{m}$	$m_{h}$	L	$ρ (t)$	$W (t)$
8	17	14	3	0.2917	0.8039
10	22	18	16	0.2062	0.8153
12	25	21	56	0.2515	0.8393
14	31	26	2	0.1786	0.8226
14	29	25	2	0.2500	0.8448
14	28	24	13	0.2912	0.8544

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Schouhamer Immink, K.; Weber, J.H.; Cai, K. Properties of Maxentropic DNA Synthesis Codes. Entropy 2024, 26, 1028. https://doi.org/10.3390/e26121028

AMA Style

Schouhamer Immink K, Weber JH, Cai K. Properties of Maxentropic DNA Synthesis Codes. Entropy. 2024; 26(12):1028. https://doi.org/10.3390/e26121028

Chicago/Turabian Style

Schouhamer Immink, Kees, Jos H. Weber, and Kui Cai. 2024. "Properties of Maxentropic DNA Synthesis Codes" Entropy 26, no. 12: 1028. https://doi.org/10.3390/e26121028

APA Style

Schouhamer Immink, K., Weber, J. H., & Cai, K. (2024). Properties of Maxentropic DNA Synthesis Codes. Entropy, 26(12), 1028. https://doi.org/10.3390/e26121028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Properties of Maxentropic DNA Synthesis Codes^†

Abstract

1. Introduction

1.1. Low-Weight Codes

1.2. Redundancy and Information Rate

1.3. Contributions and Overview of the Paper

2. Asymptotic Analysis of Low-Weight Codes

2.1. Asymptotic Analysis of $R_{n} (γ)$

2.2. Principle of Maximum Entropy

2.3. Asymptotic Redundancy and Information Rate

2.4. Case Study for $q = 4$

3. Comparison with Implemented Codes

4. Runlength Limitation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Properties of Maxentropic DNA Synthesis Codes †

Abstract

1. Introduction

1.1. Low-Weight Codes

1.2. Redundancy and Information Rate

1.3. Contributions and Overview of the Paper

2. Asymptotic Analysis of Low-Weight Codes

2.1. Asymptotic Analysis of R n ( γ )

2.2. Principle of Maximum Entropy

2.3. Asymptotic Redundancy and Information Rate

2.4. Case Study for q = 4

3. Comparison with Implemented Codes

4. Runlength Limitation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Properties of Maxentropic DNA Synthesis Codes^†

2.1. Asymptotic Analysis of $R_{n} (γ)$

2.4. Case Study for $q = 4$