2. Correcting Single Errors and Adjacent Transpositions
We first recall a theorem on the error detection of single errors and with a check digit
[
5,
6].
Theorem 1. Let be an identification number satisfying the check equation . Then, a single error a is detected if and only if for all a and b in .
We propose the following theorem for identification numbers with a single error. The approach in affixing a second check digit is similar to the one in [4]. Theorem 2. Suppose a single error in the information digits is detected through the check digit equations with check digits and :Suppose the error occurred in the position. Then, , where e is the error in Equation (1) and f the error in Equation (2). Proof. As a single error was detected in , we obtain in (1) . Thus, one of the digits in the string is e too big; i.e., for some and where c is the correct digit.
Now, the single error applied to (2) yields . Suppose the error occurred in the position. Then, . Thus,
. I.e.,
. We obtain,
. Therefore, . □
We now propose a theorem for the correction of an adjacent transposition error.
Theorem 3. Suppose an adjacent transposition error in the information digits is detected through the check Equations (3) and (4), with check digits and , where n is odd.Suppose the error occurred in the and positions. Then, , where e is the error in Equation (3) and f the error in Equation (4). Proof. Suppose the transposition error occurred at the
and
positions switched one to the other. Then, in Equation (
3), for an odd number
i, we obtain
(
. Note that . So we get .
Now, in Equation (
4), we obtain
.
.
.
.
(.
. As , we obtain and that . Substituting by its value from (5), we obtain . □
We provide an example for hexadecimal numbers similarly to the systems in [
3,
5,
6]. In this case, the symbols
in the check equation are the elements of
represented as hexadecimal numbers mapped to quadruples over
. In this system, the hexadecimal numbers {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} are represented as
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
.
As an example, let
be information digits for an hexadecimal check digit system. In this scheme, we use the companion matrix
P used in [
7,
9],
generated by the irreducible polynomial
. Equations (
1) and (
2) of Theorem 2 verify that the two check digits are
and
. The identification number is thus
.
As an illustration of locating a single error, assume that the information number
was transcribed instead of the identification number in the previous paragraph. We see that this produces an error
in Equation (
1) and an error
in Equation (
2). We find that the error location formula
in Theorem 2 yields
. This is true for
, confirming that the error occurred in the third position.
To illustrate the correction of an adjacent transposition error, we use an identification number of length 10 with the information digits . First, note that, with the check digits and , the identification number satisfies both check equations in Theorem 3, verifying that is the correct number.
We now locate an adjacent transposition to illustrate the formula in Theorem 3. Suppose the identification number was typed instead of the correct one displayed above. We show that positions 4 and 5 were transposed.
Verifying
in Equation (
1) of Theorem 3, we obtain an error
. Similarly, verifying
in Equation (
2) of the same theorem yields an error
. Substituting in the formula, we obtain
; i.e.,
. We see that this holds for
, and conclude that the adjacent positions 4 and 5 were transposed.
The reader can use any computing means at their disposal to perform the matrix operations in the examples above, particularly the matrix exponent of the form for an integer n.
3. Application to DNA Sequences
We apply our previously developed companion matrix error-detection system and corresponding error-correction formulas to DNA sequences. As explained in [
10], a DNA (deoxyribo neuclotide acid) sequence is a string of four letters representing bases called adenine (A), thymine (T), guanine (G), and cytosine (C). In genetic coding, these bases are grouped into codons, which are sequences of three consecutive nucleotides [
11]. Given that each codon consists of three bases, there are 64 possible codon combinations. These codons serve as building blocks for amino acids, which in turn form proteins. We note that we use T for U in this paper, as U is commonly used in DNA transcription [
12].
For the purpose of encoding a DNA sequence, we use the check digit system with entries represented as elements of the Abelian group
and where the k-tuple is taken as the base
p representation of the coefficient
[
5,
6,
7]. We take
and represent the elements of
as base-64 characters and map them to the 6-tuples of the group
G. We denote them as shown below. The full list is in
Table 1 before the conclusion of this paper.
.
.
.
.
.
.
.
We map the nucleotides as follows:
,
,
, and
. Similar mappings are used in [
13,
14]. We then represent binary strings
of length 6 by points
with six coordinates. We see, for example, that:
.
.
.
.
.
.
We encode a DNA sequence
of length
n via the equation
where
P is the companion matrix of a primitive polynomial of the Galois field
, and symbols
through
are called the information digits, and
is the check digit to be affixed to the information digits. Here, each
corresponds to a codon which is then mapped to a binary 6-tuple.
As an example, we will encode the DNA subsequence aggatctagc agcagcagaa gcggagcttt obtained from [
15]. Taken one codon at a time successively, we note that the above DNA sequence corresponds to the binary representation:
.
Using the alphanumeric base-64 representation of these binary sequences, the DNA sequence is encoded as
. The check digits are found with the use of the check equation
where
is the companion matrix of the primitive polynomial
of
GF(64).
The check digit scheme covered in this work has its real-life applications. As pointed out in [
3], examples of check digit schemes based on hexadecimal numbers include the
International Standard Audiovisual Number (ISAN) and
Mobile Equipment Identifier (MEID). It is also pointed out in [
7,
9] that this design over groups of prime-power orders outperforms ISAN and MEID as the system based on p-groups is able to detect errors of types 1–5 and t-jump transpositions and t-jump twin errors with a detection radius of 14 and 16. A novelty in the design built in this work is that it has two check digits and can correct single errors and adjacent transposition errors. Another novelty is its application to DNA sequences by affixing an extra codon to the sequence as a check digit. It is important to note that the extra codon is only used for error detection and correction purposes. Once the identification number is decoded back into a DNA sequence with the A, C, G, T alphabet, one has to delete the extra codon.
To end this section, we give an example for correcting a single error and a transposition error involving the aforementioned DNA sequence aggatctagc agcagcagaa gcggagcttt. First, recall that, using the mapping in
Table 1 above, the sequence can be written as
.
Before transmitting the sequence, we first use Equations (1) and (2) of Theorem 2 to find the check digits and We thus obtain the identification number .
We now locate a single error in the DNA sequence. Suppose the transmitted sequence
was received as
. This would yield an error
through Equation (
1) and an error
in Equation (
2) of Theorem 2. The error locator equation
is true for
, verifying that the error occurred in position 3 of the identification number, with a capital letter O received or stored instead of the lower-case letter o.
To demonstrate the location of a transposition error, suppose the sequence is to be transmitted. First, we encode it with Theorem 3 by finding and . Computations with the same theorem yield and , which we represent by H and K, respectively. Thus, we obtain as the identification number to be used.
Suppose the digits were transmitted. First, note that Q and 2 were transposed. Verifying this identification number in Equation (3), we obtain the error . From Equation (4) of Theorem 3, we obtain the error . We also see that satisfies the equation . This confirms that 2 and Q (positions 7 and 8) were transposed. As a side note, it may be necessary to use scalar multiplication to avoid decimal answers.