1. Introduction
We found in a previous work that the approach of quantum computation based on magic states [
1,
2,
3] may also be used to explore the symmetries and the structure of the genetic code [
4,
5,
6]. Given an appropriate finite group
G with
d conjugacy classes, one takes an irreducible character
and a corresponding
r-dimensional representation in the conjugacy class. For the application to the genetic code, one takes the finite group
(with
or 7 and
the binary octahedral group) [
4,
5]. For such a group, the dimension
r may be 1, 2, 3, 4, or 6 and the relevant conjugacy classes may be mapped to the amino acids of degeneracy
r in their relation to codons. Then one defines
one-dimensional projectors
, where the
are the
states obtained from the action of a
d-dimensional Pauli group
on the character
. When the rank of the Gram matrix
with elements
is
, the character
corresponds to a minimal informationally complete quantum measurement (or MIC), see, e.g., ([
4], Section 3).
The second step of our work deals about the (secondary) genetic code found in the protein structure.
Proteins are long polymeric linear chains encoded with the 20 amino acid residues arranged in a biologically functional way. Today the protein database (or PDB) contain about
entries [
7]. Proteins may perform a large variety of functions in living cells and organisms including molecular recognition, catalyzing metabolic reactions, DNA replication and structural support for molecules. The sequence of amino acids leads to many different three-dimensional foldings that happen to be more conserved during evolution than the sequences themselves. The structure of proteins determines their biological function [
8].
A coarse-grained representation of the backbone structure of the linear chain in a protein—a secondary code—contains three main elements that are
helices and
pleated sheets, due to the interactions between atoms and backbones, and random coils that indicate an absence of a regular structure. The ordered structures are held in shape by hydrogen bonds, which form between the carbonyl of one amino acid and the amino of another. In an
helix, there is a pattern of bonds that puts the polypeptide chain into a helical structure with each turn of the helix containing
amino acids [
9]. In a
pleated sheet, two or more segments of a polypeptide chain line up next to each other, forming a sheet-like structure held together by hydrogen bonds [
10]. The three main elements of a protein linear chain are usually denoted
H (if the segments form an
helix),
E (if the segments form a
pleated sheet) and
C (if the segments form a coil) and constitute what is called the secondary structure of the protein.
The protein secondary structure is an algebraic notation that is useful when working with X-ray diffraction and NMR structures from PDB. However in vivo proteins encounter a wide variety of effects (solvent effects, anionic and cationic concentration effects, van der Waals forces, binding to other proteins and nucleic acids) to name a few. The scheme below does lend itself to defining algebraic operations of transformations or projections that could be performed to account for some of these effects.
In this paper, we are interested in the universality of the two- or three-letter secondary code found in proteins. The letters are segments of the protein that correspond to an helix H, a pleated sheet E or a random coil C. Our view of the connection of proteins as words with two letters (or three letters) and free group theory is as follows. One defines the two-letter group or the three-letter group , where or rel(H,E,C) is the model of the protein secondary structure. For example, a hypothetical secondary code, such as , would correspond to the group which is called the modular group. Sometimes the group G corresponds (or is close in its structure) to the fundamental group of a three-dimensional manifold so that we take as a candidate manifold of the protein foldings. For the aforementioned example, the candidate manifold would be the trefoil knot complement.
We find, from several protein examples belonging to highly symmetric complexes, that the secondary code has to obey some structural algebraic constraints relying to free group theory. Our first investigation points out the possible role of two algebraic building blocks. The first one is the hyperbolic (unoriented) 3-manifold of smallest volume known as the Gieseking manifold [
11], when the secondary code only consists of two letters
H and
C. The second one is the oriented hypercartographic group
[
12,
13,
14] (alias the two-generator free group), when the secondary code needs the three letters
H,
E, and
C. The consistency of the (primary) genetic code and the secondary code is studied under the light of the Kummer surface that we already assumed to play a role in the quaternary structure of protein complexes [
5].
In
Section 2, we provide a few elements about free group theory, finitely generated subgroups of a free group and the fundamental group of a 3-manifold. We single out the mathematical objects that will be useful for our approach of the secondary structures of proteins.
In
Section 3, we feature a protein example—the histone H3 of drosophila melanogaster—with a short sequence of 136 amino acids (136 aa) only comprising
H and
C segments in the secondary pattern. We compare the results obtained from four different models and softwares and how well they fit the cardinality sequence of subgroups of a few candidate 3-manifolds. The Gieseking manifold
is a good candidate (obtained from one model) not only in terms of the cardinality sequence but also in terms of the structure of the corresponding subgroups.
In
Section 4, we pass to more examples of proteins comprising
H,
E, and
C patterns. In
Section 4.1, we look at the secondary pattern of myelin P2 in homo sapiens with 133 aa. In
Section 4.2, we look at the case of the gamma-carbonic anhydrase (247 aa long) within its 3-fold symmetric complex. Then, in
Section 4.3, we study the Hfq protein with 74 aa in each arm of the Hfq 6-fold symmetric complex. In both cases, a theory close to the observed patterns is based on the oriented hypercartographic group
, a straightforward generalization of the cartographic group
introduced by A. Grothendieck in his essay [
12]. In the latter case, the subgroup sequence of
perfectly fits the secondary pattern of Hfq protein predicted by one particular model. In
Section 4.4, we study the secondary patterns obtained for proteins belonging to 5-fold and 7-fold symmetric complexes. In particular, we provide the comparison of models for the H2A-H2B complex in nucleoplasmin and the acetylcholine receptor (with
) and the Lsm 1-7 complex (with
). In addition, one proposes a local mapping of the amino acids to a protein secondary structure with pseudo-helices, sheets and coils based on the characters of the group
.
In
Section 5, we investigate the nucleosome complex which is 8-fold symmetric. Following our previous work in [
4,
5], we find that the nucleosome complex allows to define another group theoretical model of the genetic code based on the characters of the group
. In addition, one can map the DNA double helix scaffold of the nucleosome complex to the 16 singular points of a Kummer surface.
In
Section 6, we briefly comment about the absolute Galois group over the rationals
as an object worthwhile to be used in the context of protein sequences.
2. Algebraic Geometrical Models of Secondary Structures
Let be the free group on l generators.
It is known that every group is a quotient of some free group. One constructs a finitely presented group
as the quotient of a free group
G by the normal subgroup defined by a set of relations rels between the generators
One also needs to define subgroups of finite index in a group. A subgroup of the finitely presented group is generated by the words specified by a generator list that may contain words or subgroups. In the following, we are interested by the cardinality sequence that counts the number of subgroups of a finite index d up to some maximal index. This sequence allows us to identify a group (potentially as the fundamental group of a 3-manifold).
Then, to a pair
corresponds the permutation group
P that organizes the cosets. With the Todd-Coxeter procedure, one can obtain a permutation representation
P of the pair from the action of
on the coset space. In many cases, the finite group
P has a geometrical meaning in the sense that it corresponds to a finite geometry [
15].
Finally, the group theoretical approach may be related to the theory of 3-manifolds. According to the Poincaré conjecture (now a theorem) every simply connected closed 3-manifold is homeomorphic to the 3-sphere
, alias the house of qubits [
16]. However, one can dress
as a 3-manifold
that looses the homeomorphism to
following the work of W. Thurston [
17]. For instance, the three-dimensional space surrounding the tubular neighborhood of a knot—the knot complement
—is a 3-manifold. Among the invariants characterizing a 3-manifold, there is the fundamental group
which accounts for the first homotopy of
. Finding a 3-manifold
whose
is the current
is a way to identify the nature of the object under study.
Below we introduce two algebraic geometric objects playing a role in our description of protein secondary structures. The first object is the hyperbolic 3-manifold of the smallest volume [
11,
18]. The second one is the group of oriented hypermaps, a generalization of Grothendieck’s cartographic group [
12,
14].
2.1. The Gieseking Manifold
This 3-manifold was described by Gieseking in his 1912 thesis. One takes an ideal regular tetrahedron in the 3-dimensional hyperbolic space, that is a tetrahedron with all four vertices on the sphere at infinity and all dihedral angles equal to
. Then, one identifies adjacent faces so that the orientation on the edges match ([
11], Figure 1). The resulting hyperbolic manifold has minimal volume among non-compact hyperbolic manifolds. This volume is Gieseking’s constant
. Remarkably, this constant also equals
, which is the Dedekind zeta function at 2 for the field
[
18,
19].
The fundamental group for the Gieseking manifold is denoted
in SnapPy software [
20]. The fundamental group is
The cardinality sequence
of subgroups of index
of
is given in Table 2. The permutation groups organizing the cosets of subgroups of
up to index 10 are in
Table 1. The identification of sub-manifolds follows from SnapPy.
In the next section, we find that a model of the secondary structure in histone H3 (PDB 6PWE_1) (obtained with the software PORTER) is the group
It is shown in
Table 1 and
Table 2 that this model fits perfectly the Gieseking fundamental group at the first 7 places and approximately at the subsequent 3 places. Up to index 7 the permutation groups
P are the same. At index 8, all
P’s related to subgroups of
are also those related to subgroups of
G, but
and
which are related to subgroups of
G are not in subgroups related to
. There are also a few differences between subgroups of
and
G at index 9 and 10.
2.2. The Hypercartographic Group
The cartographic group is defined as
The terminology comes from Grothendieck’s Esquisse d’un programme [
12,
13]. It was motivated by the fact that conjugacy classes of transitive subgroups of the oriented subgroup
of index 2 of the unoriented group
can be identified to topological maps on connected, oriented surfaces without boundary, while more generally, conjugacy classes of
can be identified with maps on connected surfaces which may or may not be orientable or have a boundary. The group
was investigated by the first author in relation to quantum contextuality in quantum information [
15].
Here, we are concerned with a slight generalization of the cartographic group
. To interpret our results we need the oriented hypercartographic group
whose definition is
This group is intimately related to the so-called Belyi’s theorem. The latter theorem states that a complex algebraic curve is defined over the field
of algebraic numbers if and only if it may be uniformized by a subgroup of finite index in a triangle group. See [
14] and the conclusion of the present paper for additional details.
In the section below, the group defined from the PORTER model of the secondary structure in protein Hfq (PDB 1HK9) is as follows
It is shown in
Table 3 that this group perfectly fits the hypercartographic group
in terms of the cardinality of subgroups up to the higher index 7 that could be calculated. In addition, the corresponding permutation groups organizing the cosets of subgroups in both the cases of
and
G fit as well.
2.3. Fundamental Groups of 3-Manifolds
Hyperbolic 3-manifolds that can be decomposed into regular ideal tetrahedra (up to 25 for the orientable case and up to 21 for the non-orientable case) have been investigated in [
21]. Details can be found in SnapPy [
20]. In
Table 2 and
Table 3, we collected a few 3-manifolds whose number of subgroups
of index
d of their fundamental group
is close to that of the group arising from the secondary structure of the protein in question. For example, the figure-of-eight knot
, which is the subgroup of index 2 in
, corresponds to the manifold ooct_00001 in SnapPy (see
Table 1 and
Table 2) and
is the 0-surgery on
[
22].
3. Secondary Structure with Helices: Drosophila Melanogaster Histone H3 (PDB 6PWE_1)
Now we show how the theory of the former section may be applied to concrete secondary structures of proteins. One starts with a simple example with two generators ( helices H and coils C). At the next section, we will study a simple example with three generators ( helices H, sheets E and coils C). Both examples are generic and provide a good credit to our models based on the unoriented hyperbolic manifold and the oriented hypercartographic group .
A review of the state of the art in the modeling of secondary structure is given in [
8]. It is admitted that there is a limit imposed on the secondary structure prediction due to the somewhat arbitrary definition of three states
H,
E, and
C. It is true that there exist other fine structures in the secondary protein pattern such as a
helix, a
helix and other structures belonging to DSSP (the Dictionary of Protein Secondary Structures). As a result, the assignment inconsistency would limit the highest accuracy based on three states to about 90%. In practice, the best softwares achieve a precision about 80%.
We used the softwares PSIPRED
[
23], PORTER 4.0 [
24], PHYRE2 [
25], and RAPTORX [
26]. We do not enter into the details about the theory of these softwares. Below, we we find that PORTER
is often well adapted to our goal of identifying an algebraic secondary structure. PORTER
uses two cascaded bidirectional recurrent neural networks: one for prediction and one for filtering. The method has been trained and benchmarked by cross-validation on a set of many non redundant proteins.
3.1. The Primary (Linear) Structure
The mRNA sequence for histone H3 of drosophila melanogaster may be found in [
27] with the reference NM_
. It contains 529 base pairings (529 bp). A convenient way to pass from the NCBI format (with line feeds, numbers and blank spaces) to the bare linear sequence is to make use of a software such as Massager [
28]. Then, a reading frame such as Expasy [
29] allows to extract the candidate proteins.
The Frame 1 for sequence NM_ is as follows:
IVFSNVK–T-TLVKPKSE
MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRP
GTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVM
ALQEASEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA
-ADTALTCR-SASVLYNRSFS
The partial sequence (in bold) beginning at the start codon M and ending at the stop codon ‘-’ is the histone protein H3 with the NCBI reference NP_
. It can also be found at the protein data base PDB [
7] with reference 6PWE_1. The sequence consists of 136 amino acids (136 aa).
3.2. The Secondary Structure
According to most models, the secondary structure of histone protein H3 only consists of subsections with an helix H or a coil C.
The predicted secondary structures obtained from the three softwares for the histone H3 protein are as follows:
CCCCCCCCCCCCCCCCCHHHHCHHHHCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHCCC
CCCCCCCCCCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCC
HHHHHCCCCHHHHHHHHHHHCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHCHHHH
CCHHHCCCHHHHHHHHHHHHCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHC
HHHHHHHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHC
HHHHHHHHHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHC
CCCCCCHHHHHHHHHHCCCCC
CCCCCCHHHHHHHHHHCCCCC
CCCCCCHHHHHHHHHHCCCCC
CCCCCCHHHHHHHHHHHCCCC
The first line is from PSIPRED, the second one is from PORTER, the third one is from PHYRE2, and the last one is from RAPTORX. One can visually check how close are the predictions.
Figure 1 is a sketch of the secondary structure of histone H3. In
Table 2, it is found that the best model happens to come from the fundamental group
of the Gieseking manifold
described in
Section 2.1.
5. The 8-Fold Symmetric Histone Complex of the Nucleosome: 3WKJ in the Protein Data Bank
Strong DNA packaging is found in the nucleosome of eukaryotes. The nucleosome complex consists of a double helix wrapped around a set of eight histone proteins comprising two copies of H2A, H2B, H3, and H4. The nucleosome is the fundamental sub-unit of chromatin. Eukaryotic chromatin is further compacted by being folded into more complex structures eventually forming a chromosome. Nucleosomes are considered to be the support of epigenetic information. The nucleosome core particle contains approximately 146 base pairs (bp) of DNA wrapped in
left-handed superhelical turns around the histone octamer as shown in
Figure 5a.
We already met histone H3 of a different specie (drosophila melanogaster) in
Section 3 as the preliminary example of a protein only containing
helices and random coils. In the histone complex 3WKJ of the nucleosome, the secondary structure of histone H3 is also found to be made of segments with
helices and coils but with a different organization according to our group theoretical approach. This is also true for the other histones H4, H2A, and H2B of the histone octamer.
In this section, we do not enter into the secondary structure of histones. We rather focus on the 8-fold symmetry of the core particle in the histone complex. What interests us about the double helix is the fact that their projection is a set of 16 double points as shown by the arrows in
Figure 5a. The reader may be familiar with our previous paper [
5] in which 16 double points occur in a beautiful algebraic object called a Kummer surface. Such a Kummer surface was constructed from the character table of the group
in the context of the spliceosome complex that we investigated in
Section 4.4. Below, we pursue in the same line of ideas and build another model of the genetic code based on the group
and a corresponding Kummer surface.
The character table for the group
is in
Table 5. As before for the group
,
Table 5 contains a good assignment to the 20 amino acids and some details about the character fields through the entries
. For dimensions 2 and 4, the assignments correspond to characters that are informationally complete. However, it is not the case for the assignments of amino acids in dimensions 1, 3, and 6.
All 8 characters having
and
in their entries are informationally complete and are at the origin of the Kummer surface. We now show an important characteristics of such characters. As an example, let us write the character number 16 as obtained from Magma [
33]
where # denotes the algebraic conjugation, that is
indicates replacing the root of unity
w by
.
One defines a genus 2 hyper-elliptic curve
defined over the group
from the equation
with
,
and
. Explicitly,
leading to the polynomial definition of the Kummer surface
as
The de-singularization of the Kummer surface is obtained in a simple way by restricting the product to the five first factors.
As usual for elliptic and hyper-elliptic curves of genus
g,
is embedded in a weighted projective plane, with weights 1,
, and 1, respectively, on coordinates
x,
y, and
z. Therefore, point triples are such that
,
in the field of definition, and the points at infinity take the form
. Below, the software Magma is used for the calculation of points of
[
33]. For the points of
, there is a parameter called ‘bound’ that loosely follows the heights of the
x-coordinates found by the search algorithm.
It is found that the corresponding Jacobian of has points as follows:
* the 6 points bounded by the modulus 1:
, , , and .
* the 10 points of modulus :
, , , , , , , , and .
The 16 points organize as a commutative group isomorphic to the maximally abelian group
as shown in the following Jacobian addition
Table 6.
Where the blocks are given explicitly as
To conclude this section, we can define a model of the secondary structure of nucleosome complex based on the character table of
as we did for the spliceosome complex with the character table of
. The amino acids that are mapped to characters containing
should belong to a pseudo-helix
of the secondary structure. The other amino acids either correspond to a constant entry in the character table and belong to a pseudo-coil
or to a non-constant entry (which is either
,
, or
) and belong to a pseudo-sheet
. In
Table 3, the cardinality structure of subgroups of finite index of
obtained with this model is compared to that of the other models PSIPRED, PHYRE2, PORTER, and RAPTORX. One, again, observes that the cardinality sequence either fits, at the first few places, the hypercartographic group
or that of a 3-manifold.
6. Discussion
The (primary) genetic code maps the 4-base words of DNA to the 20 proteinogenic amino acids, a feature that we could model by using concepts of quantum information theory associated to finite group representations. The (mostly informationally complete) characters of finite groups
of signature
(
the binary octahedral group) are able to account for the degeneracies and many properties of the code (see [
4] when
, see [
5] when
and
Section 5 of this paper when
).
The secondary ‘genetic code’ lacks the universality of the primary code. In the standard models of the secondary structure of proteins, the mapping from the 20 amino acids to segments of
helices
H,
sheet strands
E, and coils
C is not pointwise. The present generation of softwares is defined by the evolutionary information derived from alignment of multiple homologous sequences and the highest reported accuracy uses neural networks for the optimal comparison of the sequences [
8].
We could identify algebraic structures in the secondary code of proteins by employing the theory of infinite groups with generators
H,
E, and
C and the protein relation induced by the chosen model. Some hyperbolic 3-manifolds have been found as possible models of such a secondary structure. There exists a correspondence between the 3-sphere and the Bloch sphere of qubits so that a 3-manifold may be seen as a ‘dressing’ of qubits ([
16], Section 1.1). In this view, quantum information controls the secondary structure. Notice that topological dynamics and negative-curvature manifolds have been proposed for modeling the brain in Reference [
34].
It was unexpected that the oriented hypercartographic group seems to play a major role in the secondary structure. Why are we interested by this feature?
We are interested in geometric physical codes or languages in action [
35] and their connection to the concept of emergence. Group representations arise here as a formal way to describe those geometrical codes. Back to the secondary structure of proteins, we already mentioned in the introduction that oriented hypermaps on surfaces are organized as the oriented hypercartographic group
. Another important aspect is that
is related to the so called absolute Galois group
, the group of field-automorphisms of the field extension
of the rational field
. In the
Esquisse d’un programme [
12,
13,
36], Grothendieck emphasizes the interest of looking at the action of
on topological, geometric and even combinatorial structures. The highest level is the so-called ‘Teichmüller tower’. The simplest level concerns bipartite (hyper)maps called ‘dessins d’enfants’. To any dessin
corresponds a (so-called) Belyi function
, where
is a rational function of the complex variable
x whose structure reflects the critical points and the topology of
. The remarkable result is that
acts faithfully on
, that is, each non-identity element of
sends two non-isomorphic dessins to two inequivalent Belyi functions
, so that none of the structure of
is lost by proceeding in this way. In passing, it is good to mention that the theory of ‘dessins d’enfants’ can be used to account for geometric contextuality, the counterpart of quantum contextuality [
15,
37].
Let us go back to the secondary structure of protein Hfq in
Section 4.3 that builds one of the 7 arms of the Lsm 1-7 complex in
Figure 3b. According to our theory, there is a group structure of the protein that intimately reflects that of
. Every subgroup of index
d of
can be seen as permutation group on
d elements, it can be drawn as a dessin
and there is a faithful action of
on all dessins and permutation groups. In other words, the protein Hfq contains in its structure the topology and algebra of
. The biological meaning of this algebraic geometric structure needs further work. We leave it open at this stage. It may be that the constraint of approximating the secondary structure with three letter segments
H,
E, and
C implies that every protein has to obey the
rules. We believe that this rule may be seen as a support of the connection of biology to quantum gravity. In [
38], it is shown how a theory of quantum gravity may connect to
. We already proposed a connection of our approach of the genetic code (see [
5] and
Section 5 of this paper) to the Kummer surfaces that are
surfaces and play a role in some models of quantum gravity [
39].