# DNA Sequence and Structure under the Prism of Group Theory and Algebraic Surfaces

## Abstract

## 1. Introduction

## 2. Materials and Methods

#### 2.1. DNA Conformations

#### 2.2. Finitely Generated Groups, Free Groups and Their Conjugacy Classes, and Aperiodicity of Sequences

#### 2.2.1. Groups ${f}_{p}$ Close to Free Groups and Aperiodicity of Sequences

_{r}on r letters is an endomorphism of the corresponding free group ${F}_{r}$ (Definition 4.1 in [13]). The endomorphism property means the two relations $\rho \left(uv\right)=\rho \left(u\right)\rho \left(v\right)$ and $\rho \left({u}^{-1}\right)={\rho}^{-1}\left(u\right)$, for any $u,v\in {F}_{r}$.

#### 2.2.2. Aperiodicity of Substitutions

#### 2.2.3. A Four-Letter Sequence for the Transcription Factor of the Fos Gene

## 3. Discussion

#### 3.1. $SL(2,\mathbb{C})$ Character Varieties and Algebraic Surfaces

#### 3.2. The Hopf Link

#### 3.3. Beyond the Hopf Link

#### 3.4. The Fricke–Klein Seventh Variable Polynomial

## 4. Results

#### 4.1. Group Structure and Topology of Transcription Factors

#### The Character Variety for the Transcription Factor of the DBX Gene

#### 4.2. Group Structure and Topology of DNA Telomeric Sequences

**Table 4.**Group analysis of the telomere sequence found in some eukaryotes. The first column is for the telomere repeat, the second column is the organism under investigation, the third column is for the PDB code, the fourth column is for the card seq of the group $\pi $ or that of the corresponding group that is identified, the fifth column is for the Perron–Frobenius eigenvalue when the sequence is found to be aperiodic, the sixth column identifies the presence of the Hopf link (in two-base sequences) or the DNA conformation (in three-base sequences) and the seventh column is a relevant reference. The notation G-quadr is for the G-quadruplex; see Figure 4. The card seq for ${\pi}_{1}^{\u2033}$ is $[1,3,2,16,16,69,118,719,1877,8949\cdots ]$. The Hecke group ${H}_{4}$ is defined in (Table 2 in [4]).

Seq | Organism | PDB | Card Seq | ${\mathit{\lambda}}_{\mathbf{PF}}$ | Link/DNA Conf | Ref |
---|---|---|---|---|---|---|

G4T4G4 | Oxytricha | 1D59 | ${\pi}_{1}^{\u2033}$ | $(\sqrt{5}+1)/2$ | HL | [37] |

TG4T | universal | 244D_1 | ${H}_{4}$ | . | no | [38] |

T2G4 | Tetrahymena | 230D | ${H}_{4}$ | . | no | [29] |

T2AG3 | Vertebrates | 2HY9 | ${F}_{2}$ | 2.5468 | G-quadr. | [30] |

TAG3 | Giardia | 2KOW | ${F}_{2}$ | 2.2055 | G-quadr | [31] |

T2AG2 | Bombys mori | unknown | ${F}_{2}$ | no | G-quadr | [32] |

T4AG3 | Green algae | unknown | ${F}_{2}$ | 3.07959 | unknown | [33] |

G2T2AG | Human | unknown | ${F}_{2}$ | 2.5468 | G-quadr | [34] |

TAG3T2AG3 | Human | 2HRI | ${F}_{2}$ | 3.3923 | G-quadr | [35] |

G3T2AG3T2AG3T | Human | unknown | ${F}_{2}$ | 4.3186 | G-quadr | [36] |

(GGGTTA)3G3T | Human | unknown | ${\pi}_{2}$ | no | basket | [36] |

#### 4.3. Group Structure and Topology of the DNA Decamer Sequence $d\left(CCnnn{N}_{6}{N}_{7}{N}_{8}GG\right)$ [10]

## 5. Conclusions

**Figure 2.**(

**Left**): the Hopf link. (

**Right**): the link $L=A\cup B$ is attached to the plane ${R}^{2}$ in the half-space ${R}_{+}^{3}$. It is not splittable. This can be proved by checking that the fundamental group $\pi ={\pi}_{2}\left(L\right)$ is not free [18] and p. 90 in [19]. One gets ${\pi}_{2}=\left(\right)open="\langle "\; close="\rangle ">x,y,z\left|\right(x,(y,z))=z$, where (.,.) means the group theoretical commutator. The cardinality sequence of cc of subgroups of ${\pi}_{2}$ is $[1,3,10,51,164,1365,9422,81594,721305,\cdots ]$ (Figure 3 in [4]).

**Figure 3.**(

**Left**): a three-dimensional picture of the $S{L}_{2}\left(\mathbb{C}\right)$ character variety ${\mathrm{\Sigma}}_{H}$ for the Hopf link complement H. (

**Right**): a modified character variety of defining equation ${f}_{\tilde{H}}(x,y,z)$ with similar singularities.

**Figure 4.**Human telomere DNA quadruplex structure in K+ solution hybrid-1 form, PDB 2HY9 [30].

**Figure 5.**(

**Left**) The four-strand Holliday junction J: PDB $1ZF2$, (

**Right**) A complete turn of A-DNA: PDB $2D47$. It is associated to DNA dodecamer sequence $d\left(CCCCCGCGGGGG\right)$ with $SL(2,\mathbb{C})$ containing the factor ${f}_{H}=xyz-{x}^{2}-{y}^{2}-{z}^{2}+4$ (the Cayley cubic) and the factor ${f}_{\tilde{H}}={z}^{4}-2xyz+2{x}^{2}+2{y}^{2}-3{z}^{2}-4$.

r | Card Seq | Sequence Code |
---|---|---|

1 | $[1,1,1,1,1,1,1,1,1,\cdots ]$ | A000012 |

2 | $[1,3,7,26,97,624,4163,34470,314493,\cdots ]$ | A057005 |

3 | $[1,7,41,604,13753,504243,24824785,1598346352,\cdots ]$ | A057006 |

**Table 2.**Group structure of motifs for a few two-letter transcription factors. The card seq for the modular group ${H}_{3}$ is $[1,1,2,3,2,8,7,10,18,28,\cdots ]$. The Baumslag–Solitar group $BS(-1,1)$ is the fundamental group of the Klein bottle. The card seq for $BS(-1,1)$ is $[1,3,2,5,2,7,2,8,3,8,2,13,2,9,4,\cdots ]$. The card seq for ${\pi}_{1}$ is $[1,4,1,2,4,2,1,7,2,2,4,2,2,8,1,2,7,2,3,\cdots ]$; for ${\pi}_{1}^{\prime}$, it is $[1,1,1,2,1,3,3,1,2,2,1,1,9,2,14,2,1,\cdots ]$. The symbol HL means that the Cayley cubic is part of the Groebner base for the ideal ring of the corresponding $SL(2,\mathbb{C})$ character variety. For three-letter transcription factors, the ideal ring of the corresponding $SL(2,\mathbb{C})$ character variety contains the Fricke–Klein seventh variable polynomial 4, which is a feature of the four-punctured sphere topology. The group structure of three-letter transcription factors not leading to free groups is shown in (Table 5 in [4]).

Gene | Motif | Card Seq | Link | Type | Literature |
---|---|---|---|---|---|

DBX | TTTATTA | ${F}_{1}$ | HL | ${K}_{3}$ | [23], MA0174.1 |

SPT15 | TATATATAT | . | . | . | ., MA0386.1 |

PHOX2A | TAATTTAATTA | ≈${F}_{1}$ | . | . | ., MA0713.1 |

FOXA | TGTTTGTTT | ${F}_{1}$ | . | . | [24,25] |

FOXG | TTTGTTTTT | . | . | . | [24] |

NKX6-2 | TAATTAA | ${H}_{3}$ | no | ${K}_{3}$ | [23], [MA0675.1, MA0675.2] |

FOXG | TGTTTG | $BS(-1,1)$ | no | ${K}_{3}$ | [23,26], MA1865.1 |

HoxA1, HoxA2 | TAATTA | ${\pi}_{1}$ | no | ${K}_{3}$ | [23], [MA1495.1, MA0900.1] |

POU6F1, Vax | ., [MAO628.1, MA0722.1] | ||||

RUNX1 | TGTGGT | . | no | . | ., MA0511.1 |

RUNX1 | TGTGGTT | ${\pi}_{1}^{\prime}$ | no | ${K}_{3}$ | [23], MA0002.2 |

EHF | CCTTCCTC | . | HL | ., MA0598.1 |

**Table 3.**A short account of the function or dysfunction (through mutations or isoforms) of genes associated with transcription factors and sections in Table 2.

Gene | Type | Function | Dysfunction |
---|---|---|---|

DBX | drosophila segmentation | ||

SPT15 | TATA-box | gene expression, regulation | |

binding protein | in Saccharomyces cerevisiae | ||

PHOX2A | homeodomain | differentiation, maintenance | fibrosis |

of noradrenergic phenotype | of extraocular muscles | ||

FOX proteins | forkhead box | growth, differentiation, | |

FOXA2 | . | insulin secretion | diabete |

longevity | |||

NKX6-2 | homeobox | central nervous system, pancreas | spastic ataxia |

FOXG | forkhead box | notochord (neural tube) | chordoma |

HoxA1 | homeobox | embryonic devt of face and hear | autism |

HoxA2 | . | . | cleft palate |

Pou6F1 | . | neuroendocrine system | clear cell adenocarcinoma |

Vax | . | forebrain development | craniofacial malform. |

RunX1 | Runt-related | cell differentiation, pain neurons | myeloid leukemia |

EHF | homeobox | epithelial expression | carcinogenesis, asthma |

**Table 5.**Group analysis of the sequence $d\left(CCnnn{N}_{6}{N}_{7}{N}_{8}GG\right)$, where ${N}_{6}$, ${N}_{7}$ and ${N}_{8}$ are taken in the two nucleotides G and C and $nnn$ is specified in order to maintain the self-complementarity of the sequence [10]. The first column is for the selected triplet ${N}_{6}{N}_{7}{N}_{8}$, the second column is for the code in the protein data bank, the third column is for the DNA conformation when known (see Table 1 in [10]), the fourth column is for the cardinality structure of subgroups of $\pi $ and the fifth column checks the occurrence of a surface corresponding to the Hopf link in the factorization of the $SL(2,\mathbb{C})$ of $\pi $. The symbols A, B and J are for A-DNA, B-DNA and a four-stranded Holliday junction; lowercase is used when the conformation is not confirmed in [10].

Triplet | PDB | Conformation | Card Seq ($\mathit{\pi}$) | Knot |
---|---|---|---|---|

CCC | 1ZF1 | A | $[1,1,1,1,7,1,1,2,9,6,\cdots ]$ | HL |

CCC | 1ZF2 | J | idem | HL |

CCG | 1ZEX | A | idem | HL |

CGG | 1ZEY | A | $[1,1,1,2,6,3,1,4,2,6,\cdots ]$ | HL like |

CGC | none | unknown | $[1,1,2,1,6,3,2,1,3,6,\cdots ]$ | no |

GGG | 1ZF9 | A | $[1,1,1,1,10,25,25,9,2,1798,\cdots ]$ | no |

GCC | none | b/J | $[1,1,1,1,6,1,2,1,1,6,\cdots ]$ | HL |

GCG | none | unknown | $[1,1,2,2,7,5,1,4,5,9,\cdots ]$ | no |

GGC | none | B/a | $[1,1,1,1,6,11,9,5,2,208,\cdots ]$ | no |

(card seq of Hecke group ${H}_{5}$) |

**Table 6.**Group analysis of the sequence $d\left(CCnnn{N}_{6}{N}_{7}{N}_{8}GG\right)$, where ${N}_{6}$, ${N}_{7}$ and ${N}_{8}$ are taken in the two nucleotides A,T [10]. Groups ${\pi}_{3}$ and ${\pi}_{3}^{\prime}$ are as in (Table 5 in [4]). The card seq for ${\pi}_{3}^{\prime}$ is $[1,7,50,867,15906,570528,\cdots ]$; for ${\pi}_{3}^{\u2033}$, it is $[1,7,50,739,15234,548439,\cdots ]$; for ${\pi}_{3}^{\left(4\right)}$, it is $[1,7,59,1258,24787,\cdots $]. Groups ${\pi}_{3}^{\u2033}$ and ${\pi}_{3}^{\prime}$ may be simplified to a group whose card seq is that of ${\pi}_{2}$, the fundamental group of the link $L=A\cup B$ described in Figure 3 (right).

Triplet | PDB | Conformation | $\mathit{\pi}$ |
---|---|---|---|

TTA | 1ZFH | B | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

TAA | none | B | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

AAT | none | b | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

ATT | none | unknown | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

AAA | none | b | ${\pi}_{3}^{\prime}\to {\pi}_{2}$ |

TTT | none | unknown | ${\pi}_{3}^{\prime}\to {\pi}_{2}$ |

ATA | none | unknown | ${\pi}_{3}^{\left(4\right)}$ |

TAT | none | unknown | ${\pi}_{3}^{\left(4\right)}$ |

**Table 7.**Group analysis of the sequence $d\left(CCnnn{N}_{6}{N}_{7}{N}_{8}GG\right)$[10], where ${N}_{6}$, ${N}_{7}$ and ${N}_{8}$ are taken in the two nucleotides A,G (left) and A,C (right). Groups ${\pi}_{3}$ and ${\pi}_{3}^{\prime}$ are as in (Table 5 in [4]). The card seq for ${\pi}_{3}^{\left(3\right)}$ is $[1,7,41,668,14969,\cdots $] and, for ${\pi}_{3}^{\left(5\right)}$, it is $[1,7,41,604,28153,\cdots $].

Triplet | PDB | Conformation | $\mathit{\pi}$ | Triplet | PDB | Conformation | $\mathit{\pi}$ |
---|---|---|---|---|---|---|---|

AGA | 1ZEW | B | ${F}_{3}$ | ACA | none | unknown | ${\pi}_{3}^{\left(3\right)}\to {\pi}_{2}$ |

AGG | none | unknown | ${\pi}_{3}^{\left(5\right)}$ | ACC | none | J | ${F}_{3}$ |

GGA | 1ZFA | A | ${F}_{3}$ | CCA | none | unknown | ${F}_{3}$ |

AAG | none | unknown | ${F}_{3}$ | AAC | 1ZF0 | B | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

TGT | none | unknown | ${F}_{3}$ | TCT | none | b | ${\pi}_{3}^{\left(3\right)}\to {\pi}_{2}$ |

TGG | 1ZF6 | A | ${F}_{3}$ | TCC | none | unknown | ${F}_{3}$ |

GGT | 1ZF8 | A | ${F}_{3}$ | CCT | none | b | ${F}_{3}$ |

TTG | none | unknown | ${\pi}_{3}^{\prime}$ | TTC | none | B | ${\pi}_{3}^{\u2033}\to {\pi}_{2}$ |

**Table 8.**Group analysis of the sequence $d\left(CCnnn{N}_{6}{N}_{7}{N}_{8}GG\right)\phantom{\rule{3.33333pt}{0ex}}$[10], where ${N}_{6}$, ${N}_{7}$ and ${N}_{8}$ are taken in the three nucleotides A, G, C (left) and A, T, C (right). The card seq for ${\pi}_{3}^{\left(6\right)}$ is $[1,7,59,874,20371,748320\cdots $].

Triplet | PDB | Conformation | $\mathit{\pi}$ | Triplet | PDB | Conformation | $\mathit{\pi}$ |
---|---|---|---|---|---|---|---|

AGC | 1ZFM | B | ${F}_{3}$ | ATC | 1ZFC/1ZF3 | B/J | ${\pi}_{3}^{\left(6\right)}$ |

ACG | none | unknown | ${F}_{3}$ | ACT | none | B | ${\pi}_{3}^{\left(3\right)}\to {\pi}_{2}$ |

GCA | 1ZFE | B | ${F}_{3}$ | TCA | none | unknown | ${\pi}_{3}^{\left(3\right)}\to {\pi}_{2}$ |

GAC | 1ZF7 | B | ${F}_{3}$ | TAC | none | unknown | ${\pi}_{3}^{\left(6\right)}$ |

CAG | none | unknown | ${F}_{3}$ | CAT | none | unknown | ${F}_{3}$ |

CGA | none | unknown | ${F}_{3}$ | CTA | none | unknown | ${F}_{3}$ |

