Evolutionary Invariant of the Structure of DNA Double Helix in RNAP II Core Promoters

Eukaryotic and archaeal RNA polymerase II (POL II) machinery is highly conserved, regardless of the extreme changes in promoter sequences in different organisms. The goal of our work is to find the cause of this conservatism. The representative sets of aligned promoter sequences of fifteen organisms belonging to different evolutional stages were studied. Their textual profiles, as well as profiles of the indexes that characterize the secondary structure and the mechanical and physicochemical properties, were analyzed. The evolutionarily stable, extremely heterogeneous special secondary structure of POL II core promoters was revealed, which includes two singular regions—hexanucleotide “INR” around TSS and octanucleotide “TATA element” of about −28 bp upstream. Such structures may have developed at some stage of evolution. It turned out to be so well matched for the pre-initiation complex formation and the subsequent initiation of transcription for POL II machinery that in the course of evolution there were selected only those nucleotide sequences that were able to reproduce these structural properties. The individual features of specific sequences representing the singular region of the promoter of each gene can affect the kinetics of DNA-protein complex formation and facilitate strand separation in double-stranded DNA at the TSS position.


Introduction
The heterogeneity of the three-dimensional structure of the double-stranded DNA plays an important role in the regulation of genetic processes. This heterogeneity is modulated by the nucleotide sequence. Proteins may recognize the shape of DNA ("indirect readout") or the unique chemical signatures of the DNA bases ("direct readout") [1]. As a rule, DNA-binding proteins combine both readout mechanisms to achieve DNA-binding specificity [2].
RNA polymerase II (Pol II) in eukaryotes is responsible for the transcription of messenger RNA and some non-protein-coding small nuclear RNAs. Pol II core promoters are fragments of genomic DNA, about 100 bp long, surrounding the transcription start site (TSS). Transcription initiation occurs when TATA-binding protein (TBP) binds to the eight base-pair TATA elements of Pol II core promoter, coordinating accretion of class II initiation factors and Pol II into a functional preinitiation complex (PIC). This process is a slow stage of transcription; it leads to the formation of a long-lived protein-DNA complex [3].
The mechanical, thermodynamic, and structural properties of Pol II promoter regions have long attracted the attention of researchers [4][5][6][7][8][9]. Regardless of the length of the analyzed promoter fragment and analysis methods, all studies come to the same conclusion. In the vicinity of the TSS, all structural properties of DNA noticeably deviate from the average level, and core promoter regions are exceptionally heterogeneous.
The nucleotide sequences of the core promoters are usually represented by the DNA coding strand (namely, the strand with the 5 →3 vector directed to the TSS from the upstream region; hereinafter, we will call it the upper strand). The TSS position is taken as Int. J. Mol. Sci. 2022, 23, 10873 2 of 29 coordinates −1, +1 (there is no nucleotide with zero coordinates). In all organisms, positions −2, +4 are occupied by the initiator element (INR). At this region, the complementary strands of the double helix diverge, and Pol II recognizes the template strand. TATA element in the promoters of most organisms is located at a distance of about −28 bp from the TSS.
The common regularities of the core promoter architecture in each species may be revealed after the superposition of signals from a huge amount of species' promoter sequences properly aligned at the TSS. The well-annotated database of promoter sequences is an essential basis for identifying general patterns in the promoter structure. To analyze structural features of DNA that determine RNA polymerase II core promoter [10], we previously used the EPD New database [11]. The profiles of the averaged textual, structural, mechanical, and physicochemical characteristics in each position of the sets of 60 bp core promoter sequences (positions from −50 to +10) in the eight organisms available at that time from the EPD New database [11] (H. sapiens, M. musculus, D. melanogaster, D. rerio, C. elegans, A. thaliana, S. cerevisiae, S. pombe), were constructed. The analysis of these profiles allowed us to reveal the common scheme of the animal and plant core promoter architecture. The promoters of the unicellular fungus S. pombe were found to correspond to the same structural scheme, but the structure of the core promoter of another unicellular fungus, S. cerevisiae, turned out to be different [10].
To date, the number of organisms available for analysis in the EPD New database [12] has increased markedly. In addition to representatives of the Metazoa (vertebrates and invertebrates), plants, and unicellular fungi (S. cerevisiae, S. pombe), a representative of the Protozoa appeared, namely the parasite P. falciparum, whose genome is 80% AT-pairs. Moreover, the total number of promoters in the samples of those organisms that were previously represented in this database also increased noticeably. Therefore, it became possible to check the generality of the conclusions obtained by us earlier and to analyze the degree of influence of the percentage of AT pairs in the genomes of different organisms on the structural features of their promoters.

Results
The sets of promoters of fifteen evolutionarily different organisms were retrieved from the EPD New section of the Eukaryotic Promoter Database (EPD) (http://epd.vital-it.ch (accessed on 24 July 2022)) [12]. This resource allows access to the collection of databases of experimentally validated promoters of several model organisms, for which TSS mapping was the result of high-throughput experiments such as CAGE and Oligo-capping, resulting in high precision and high coverage. We used sets of ten animal promoters, vertebrates, invertebrates, and insects, namely H. sapiens, M. mulatta, M. musculus, R. norvegicus, C. familiaris, G. gallus, D. rerio, C. elegans, D. melanogaster, and A. mellifera; two plant promoters, namely A. thaliana and Z. mays; two unicellular fungi promoters, namely S. cerevisae and S. pombe; and protozoan promoters, namely P. falciparum. The profiles of the averaged textual, structural, mechanical, and physicochemical properties of 80 bp core promoter sequences (positions from −50 to +30) were constructed.

Comparative Statistical Characteristics of the Nucleotide Sequences in the Core Promoters of Metazoans, Plants, Unicellular Fungi, and Protozoan
First, we compared the percentages of the A, T, G, and C nucleotides in core promoter sequences in different organisms. For simplicity, according to IUPAC nomenclature, we will use the terms W (for nucleotides A and T) and S (for nucleotides G and C). Frequencies of mononucleotides occurrence at each position along the coding strand are shown in Figure 1A-D.
The frequencies of occurrence of dinucleotides in the core promoter sequences of all fifteen species are shown in Figure S1. The frequencies of occurrence of tetranucleotides TATA and AAAA in the core promoter sequences of all fifteen species are shown in Figure S2. The logo-representation of the promoter sequences with an information content of 1.0 bits is shown in Figure 2, while that with an information content of 0.4 bits is shown in Figure S3. We present two options for scaling the logo image to best reveal the features of different fragments of core promoters because the frequencies of occurrence of nucleotides differ sharply in different regions. Logos were made at http://weblogo.threeplusone.com (accessed on 24 July 2022).  The frequencies of occurrence of dinucleotides in the core promoter sequences of all fifteen species are shown in Figure S1. The frequencies of occurrence of tetranucleotides TATA and AAAA in the core promoter sequences of all fifteen species are shown in Figure S2.
The logo-representation of the promoter sequences with an information content of 1.0 bits is shown in Figure 2, while that with an information content of 0.4 bits is shown in Figure S3. We present two options for scaling the logo image to best reveal the features of different fragments of core promoters because the frequencies of occurrence of nucleotides differ sharply in different regions. Logos were made at http://weblogo.threeplusone.com (accessed on 24 July 2022). Profiles of core promoter sequences as the mononucleotides frequencies of occurrence (in percentages) at each position along the strand, complementary to template for data sets of H. sapiens, M. mulatta, M. musculus, R. norvegicus, C. familiaris, and G. gallus. (B). Profiles of core promoter sequences as the mononucleotides frequencies of occurrence (in percentages) at each position along the strand, complementary to template for data sets of D. melanogaster, A. mellifera, D. rerio, and C. elegans. (C). Profiles of core promoter sequences as the mononucleotides frequencies of occurrence (in percentages) at each position along the strand, complementary to template for data sets of A. thaliana and Z. mays. (D). Profiles of core promoter sequences as the mononucleotides frequencies of occurrence (in percentages) at each position along the strand, complementary to template for data sets of S. cerevisae, S. pombe, and P. falciparum.
For all of the considered mammalian promoters (H. sapiens, M. mulatta, M. musculus, R. norvegicus, C. familiaris), as well as for promoters of G. gallus, the percentage of S exceeds that of W in all positions, for the exception of the TATA element, where the percentages of W are almost equal to that of S ( Figure 1A). On the other hand, the promoters of A. mellifera, as well as promoters of A. thaliana, unicellular fungi S. cerevisae and S. pombe, and protozoan P. falciparum have the highest percentage of W nucleotides at all positions ( Figure 1B-D). The promoters of another insect, D. melanogaster, as well as promoters of C. elegans and D. rerio, are composed of a roughly equal amount of W and S nucleotides, while the TATA element is also enriched by W nucleotides. The promoters of another plant, Z. mays, have a noticeable asymmetry in the distribution of G and C nucleotides between the coding and non-coding strands. In the coding strand, the content of cytidines is~15% higher than the content of guanines. This determines both the highest frequency of occurrence of the CC dinucleotide before and after TSS ( Figure S1) and the extremely low frequencies of the occurrence of TATA and AAAA tetranucleotides ( Figure S2). Another distinguishing feature of Z. mays promoters is the presence of a well-defined motif in the vicinity of the +25 position in Figures 1C, 2 and S3. This was also noted earlier [13], where the cap analysis of the gene expression (CAGE) was used to identify genome-wide TSSs in root and stem tissues of two maize (Z. mays) inbred lines (B73 and Mo17). The authors hypothesized that the region around +25 harbors an element other than the GC-rich motif that correlates with 6 of 29 the presence of TATA consensus. The profiles of all of the species except for S. cerevisiae have two regions where the frequencies of dinucleotides occurrence deviate from the mean values ( Figure S1). These two regions are located at the TATA-box position and at the region around TSS. For all of the considered mammalian promoters (H. sapiens, M. mulatta, M. musculus, R. norvegicus, C. familiaris), as well as for promoters of G. gallus, the percentage of S exceeds that of W in all positions, for the exception of the TATA element, where the percentages of W are almost equal to that of S ( Figure 1A). On the other hand, the promoters of A. mellifera, as well as promoters of A. thaliana, unicellular fungi S. cerevisae and S. pombe, and protozoan P. falciparum have the highest percentage of W nucleotides at all positions ( Figure 1B-D). The promoters of another insect, D. melanogaster, as well as promoters of C. elegans and D. rerio, are composed of a roughly equal amount of W and S nucleotides, while the TATA element is also enriched by W nucleotides. The promoters of another plant, Z. mays, have a noticeable asymmetry in the distribution of G and C nucleotides between the coding and non-coding strands. In the coding strand, the content of cytidines is ~15% higher than the content of guanines. This determines both the highest frequency of occurrence of the CC dinucleotide before and after TSS ( Figure S1) and the extremely low frequencies of the occurrence of TATA and AAAA tetranucleotides ( Figure S2). Another distinguishing feature of Z. mays promoters is the presence of a well-defined motif in the vicinity of the +25 position in Figures 1C, 2 and S3. This was also noted earlier [13], where the cap analysis of the gene expression (CAGE) was used to identify genome-wide TSSs in root and stem tissues of two maize (Z. mays) inbred lines (B73 and Mo17). The authors hypothesized that the region around +25 harbors an element other than the GC-rich motif that correlates with the presence of TATA consensus. The profiles of all of the species except for S. cerevisiae have two regions where the frequencies of dinucleotides occurrence deviate from the mean values ( Figure S1). These two regions are located at the TATA-box position and at the region around TSS.
Logo representation ( Figure 2) provides detailed information about the characteristic features of the TATA elements and the INR elements in the promoters of each organism. In the position of the TATA element of all mammals, as well as of G. Logo representation ( Figure 2) provides detailed information about the characteristic features of the TATA elements and the INR elements in the promoters of each organism. In the position of the TATA element of all mammals, as well as of G. gallus, all four nucleotides (G, C, A, and T) occur with equal frequency. In other considered organisms (with the exception of S. cerevisae), the frequency of nucleotides A and T in the TATA element are higher than that of G and C. However, the degree of the excess differs quite noticeably between organisms in this group. In both insects (D. melanogaster and A. mellifera), it is minimal, and it is most pronounced in D. rerio, A. thaliana, and S. pombe. The logo image of P. falciparum differs sharply from all other organisms since the frequencies of the occurrence of the A and T nucleotides are significantly higher.
The occurrences of various octanucleotides in the position of the TATA element of all organisms under consideration are shown in Table 1, while Table S1 also includes the absolute number of each of the octanucleotides in that position for every organism and also presents the frequencies of the occurrence of various octanucleotides in the positions −10-−3 in the promoters of S. cerevisae.    We have chosen the TATA-box position in the promoters of each organism based on the positions of the minimum in the profiles of the physical parameter "Stacking energy" and of the maximum in the profiles of the physical parameter "Mobility to bend towards major groove", which we present in Figures 3-6 (lines a,f). A perceptible shift in the position of the TATA box for A. thaliana promoters coincides with the data obtained earlier [14].
From Table 1, one can see that the frequencies of occurrence of different octanucleotides presenting the TATA box are rather close. The leading position in this list for all of the analyzed mammalians, as well as in D. melanogaster and C. elegans, is occupied by the TATAAAAG sequence; however, other octanucleotides occur with a very close frequency. So, the term consensus only conditionally reflects the real situation. Analysis of the TBP-TATA box minor groove interface based on the crystallographic results of their complex structures obtained with refinement better than 2 Int. J. Mol. Sci. 2022, 23,10873 6b-d and 7b-d) in the parametrization of Perez et al. variation in the DNA double helix to Roll and Slide change e and 7c-e) in the parametrization of Goni et al. [23]. These f at the base-pair step resolution. To evaluate the stiffness of the major groove, we used the parametrization of Garten parameter "Mobility to bend towards major groove" was re and related to each of the complementary strands. In  Figure 7 presen characteristics of two non-promoter regions in H. sapiens ge (−500-−420) and(−300-−220), and the profiles of t computer-simulated random nucleotide sequences. They profiles of the H. sapiens core promoters.
Stacking energy is a part of the enthalpy of DNA stabilizing forces. Its value in the core promoter sequences G. gallus is about −16.5 ± 0.2 Kkal/mol ( Figure 3a) and Figu and unicellular fungi, the stacking energy is somewhat low value of the staking energy is intermediate ( Figure 5a). energy is in the promoter sequences of P. falciparum (Figure this Protozoa, this is due to the compensation of low DNA third hydrogen bond in AT-rich sequences. A shallow glo energy profiles in the region around −28 bp-−34 bp relati organism) is present in the profiles of all organisms, with melifera, and S. cerevisiae. In P. falciparum, its depth is the sm in the TATA box region is the property of the majority of th of naked DNA. This is confirmed by the absence of local m profiles of the non-promoter regions, as well as in the prof (Figure 7a). It is interesting that the average level of non-promoter regions of the human genome is practically regions, while in the set of random sequences, it is somewh is due to the percentage of the AT pairs in the sequence percentage of AT pairs is less than the percentage of GC, w the AT and GC content is approximately the same.
Base-pair step parameter Roll defines an angle betwe neighboring base pairs. The positive value of this angle towards the minor groove. Among the three rotational pa and Tilt), Roll is the most important for understanding the Base-pair step parameter Slide defines the mutual dis base pairs in the direction perpendicular to the minor and Slide values are a distinguishing feature of B-DNA, whil values of the Slide are always negative. Thus, the sign indicator that allows us to discriminate between the B-and The values of these two parameters show that the double helix in the promoter regions of mammals, inverte fungi (with the exception of their INR element) belongs structural parameters of Roll and Slide in the core promote G. gallus vary between 1.35°-1.7° and 0.25-0.48 Å, respectiv A. thaliana, the values of Roll and Slide are somewhat lowe Slide (~0.2 Ǻ), but in the promoters of another plant, parameters are as in mammals. In the promoter sequences o of Roll and Slide are also close to mammals. The excep Protozoa, double-stranded DNA, at least in the core prom [15] have shown that van der Waals interactions between nonpolar atoms and between nonpolar and polar atoms are factors for complex formation. Moreover, from the kinetic probing, it was found that TBP has less than a 10 3 -fold preference for binding TATAWAAR sequence compared to binding of nonspecific yeast genomic DNA [16]. These results allow us to suggest that hydrogen bonding does not play any role in TBP-TATA box complex formation. Therefore, those octanucleotides that are selected on the basis of low energy costs for bending towards a wide groove can be TATA elements.    In contrast, the INR element of all of the organisms is highly selective for the nucleotide sequence. The details can be seen in the logo representation (Figures 2 and S3) and Tables 2-4. From Table 2, one can see that all of the organisms show a preference for PyPu in positions −1 and +1. However, it should be noted that the occurrence of PuPu and PyPy in mammals, G. gallus, D. rerio, as well as in the plant Z. mays is also high enough, noticeably higher than in both insects (D. melanogaster and A. mellifera), in the plant A. thaliana, in the invertebrate C. elegans, and in unicellular organisms (S. cerevisae, S. pombe, and P. falciparum). We find it interesting that the promoters of pure lines of plant Z. mays are somewhat different from the promoters of wild-type A. thaliana.
All of the organisms, with the exception of S. cerevisae, S. pombe, and P. falciparum, display CA in this position as preferable. In both unicellular fungi (S. cerevisae and S. pombe), What properties of PyPu dinucleotides and especially CA dinucleotide determine their preference in position (−1, +1)? This position is responsible for the double helix divergence, so the dinucleotide step that it occupies must have unique properties. It is known that the deformability of dinucleotides decreases in the order of PyPu > PuPu > PuPy. It was shown that with the help of a spin probe while studying the effects of nucleotide sequence on DNA duplex dynamics [17]. The special mobility of PyPu steps is explained by the greater intensity of the S↔N dynamics in furanose cycles in 5 -terminal pyrimidines compared to 5 -terminal purines, and after 5 Cyt, it reaches its maximum [18]. The advantage of the CpA step over CpG in positions −1 and +1 can be explained by the presence of only two hydrogen bonds, which must be broken at the initial stage of chain divergence. This explanation is confirmed by reactivity with the conformation-sensitive reagent chloroacetaldehyde, which reacts with unpaired adenines and cytosines. This reactivity was confined strictly to adenosine in the d(CA/TG) repeat [19]. In this regard, it is interesting to note that during the formation of nucleosomes, two conformational flexible pyrimidine-purine steps can act as strong positioning signals. These are the pyrimidine-purine step CA/TG, which is unique to the 10 possible dinucleotides and is located preferentially at both inward-and outwardfacing minor grooves but not in between, and TA, which is located at inward-facing minor grooves [20].
The occurrence of tetranucleotides in positions −2 and +2, specific for each of the 15 species, is shown in Table 4. It can be assumed that the greater the percentage of less deformable dinucleotides (PuPu or PuPy) in the TSS position of promoter samples of a particular organism, the more variable the strength of different promoters in this organism will be.

Physical and Structural Anisotropy of the Naked DNA in the Core Promoters
The heterogeneity of any DNA fragment is the result of the variation of the physical and structural characteristics of individual base-pair steps. Bending anisotropy, for example, is sequence-dependent and, to a first approximation, reflects both the geometry and stability of the individual base steps [20]. We have built profiles of the base step characteristics for the sets of the core promoters of all 15 organisms using indexes of numerical parameterization for the ten double-stranded duplexes, which are collected in the database DiProDB http: //diprodb.fli-leibniz.de (accessed on 24 July 2022) [21]. Among the parameters of a large number of different properties of the ten double-stranded duplexes, which are held in the database, we chose six parameters most suitable for evaluating the anisotropy of nucleotide sequences for DNA axis bending. They are the stacking energy, Roll and Slide, the stiffness of the structure to Roll alteration and to Slide alteration, as well as the stiffness of the structure to bend towards the major groove, which includes alteration to all of the base-pair steps parameters. The database contains several versions of the parameters of the same name, and earlier [10], we verified that the profiles built from different versions of the parameters are in qualitative agreement with each other. Profiles of physical and structural parameters are presented in Figures 3a-f, 4a-f, 5a-f, 6a-f and 7a-f.
We present the profiles of the variations in the stacking energy (Figures 3a, 4a, 5a, 6a and 7a) and the base-pair step parameters of Roll and Slide (Figures 3b-d, 4b-d, 5b-d, 6b-d and 7b-d) in the parametrization of Perez et al. [22], the profiles of stiffness variation in the DNA double helix to Roll and Slide changes (Figures 3c-e, 4c-e, 5c-e, 6c-e and 7c-e) in the parametrization of Goni et al. [23]. These five parameters describe DNA at the base-pair step resolution. To evaluate the stiffness of the structure to bend towards the major groove, we used the parametrization of Gartenberg and Crothers [24]. Their parameter "Mobility to bend towards major groove" was resolved for all 16 dinucleotides and related to each of the complementary strands. In Figures 3f, 4f, 5f, 6f   We present the profiles of the variations in the stacking energy (Figures 3a, 4a, 5a, 6a and 7a) and the base-pair step parameters of Roll and Slide (Figures 3b-d, 4b-d, 5b-d, Stacking energy is a part of the enthalpy of DNA formation and defines its stabilizing forces. Its value in the core promoter sequences of all of the mammalians and G. gallus is about −16.5 ± 0.2 Kkal/mol (Figures 3a and 4a), while in invertebrates and unicellular fungi, the stacking energy is somewhat lower (Figure 4a). In plants, the value of the staking energy is intermediate (Figure 5a). The lowest level of stacking energy is in the promoter sequences of P. falciparum (Figure 6a). It can be assumed that in this Protozoa, this is due to the compensation of low DNA stability in the absence of a third hydrogen bond in AT-rich sequences. A shallow global minimum on the stacking energy profiles in the region around −28 bp-−34 bp relative to TSS (depending on the organism) is present in the profiles of all organisms, with the exception of C. elegans, A. melifera, and S. cerevisiae. In P. falciparum, its depth is the smallest. The good base stacking in the TATA box region is the property of the majority of the specially selected sequences of naked DNA. This is confirmed by the absence of local minima in the stacking energy profiles of the non-promoter regions, as well as in the profiles of the random sequences (Figure 7a). It is interesting that the average level of the stacking energy in the non-promoter regions of the human genome is practically the same as in the promoter regions, while in the set of random sequences, it is somewhat lower. We assume that this is due to the percentage of the AT pairs in the sequences: in the human genome, the percentage of AT pairs is less than the percentage of GC, while in the random sequences, the AT and GC content is approximately the same.
Base-pair step parameter Roll defines an angle between the average planes of two neighboring base pairs. The positive value of this angle corresponds to its opening towards the minor groove. Among the three rotational parameters (helical Twist, Roll, and Tilt), Roll is the most important for understanding the bending of DNA [23,25].
Base-pair step parameter Slide defines the mutual displacement of the neighboring base pairs in the direction perpendicular to the minor and major grooves. The Positive Slide values are a distinguishing feature of B-DNA, while in the A-form of DNA, the values of the Slide are always negative. Thus, the sign of the Slide is an important indicator that allows us to discriminate between the B-and A-DNA forms [26,27].
The values of these two parameters show that the structure of the naked DNA double helix in the promoter regions of mammals, invertebrates, plants, and unicellular fungi (with the exception of their INR element) belongs to the B family. In fact, the structural parameters of Roll and Slide in the core promoter regions of the mammals and G. gallus vary between 1.35-1.7 • and 0.25-0.48 characteristics of two non-promoter regions in H. sapiens genomic seque (−500-−420) and(−300-−220), and the profiles of the 80 bp computer-simulated random nucleotide sequences. They are presente profiles of the H. sapiens core promoters.
Stacking energy is a part of the enthalpy of DNA formation stabilizing forces. Its value in the core promoter sequences of all of the G. gallus is about −16.5 ± 0.2 Kkal/mol (Figure 3a) and Figure 4a), while and unicellular fungi, the stacking energy is somewhat lower (Figure 4 value of the staking energy is intermediate (Figure 5a). The lowest energy is in the promoter sequences of P. falciparum (Figure 6a). It can be this Protozoa, this is due to the compensation of low DNA stability in third hydrogen bond in AT-rich sequences. A shallow global minimum energy profiles in the region around −28 bp-−34 bp relative to TSS (d organism) is present in the profiles of all organisms, with the exception melifera, and S. cerevisiae. In P. falciparum, its depth is the smallest. The go in the TATA box region is the property of the majority of the specially se of naked DNA. This is confirmed by the absence of local minima in the profiles of the non-promoter regions, as well as in the profiles of the ra (Figure 7a). It is interesting that the average level of the stackin non-promoter regions of the human genome is practically the same as regions, while in the set of random sequences, it is somewhat lower. We is due to the percentage of the AT pairs in the sequences: in the hum percentage of AT pairs is less than the percentage of GC, while in the ra the AT and GC content is approximately the same.
Base-pair step parameter Roll defines an angle between the avera neighboring base pairs. The positive value of this angle correspond towards the minor groove. Among the three rotational parameters (he and Tilt), Roll is the most important for understanding the bending of D Base-pair step parameter Slide defines the mutual displacement o base pairs in the direction perpendicular to the minor and major groo Slide values are a distinguishing feature of B-DNA, while in the A-fo values of the Slide are always negative. Thus, the sign of the Slide indicator that allows us to discriminate between the B-and A-DNA form The values of these two parameters show that the structure of double helix in the promoter regions of mammals, invertebrates, plants fungi (with the exception of their INR element) belongs to the B fam structural parameters of Roll and Slide in the core promoter regions of t G. gallus vary between 1.35°-1.7° and 0.25-0.48 Å, respectively. In the c A. thaliana, the values of Roll and Slide are somewhat lower than in mam Slide (~0.2 Ǻ), but in the promoters of another plant, Z. may, the parameters are as in mammals. In the promoter sequences of unicellular of Roll and Slide are also close to mammals. The exception is P. fa Protozoa, double-stranded DNA, at least in the core promoter region , respectively. In the core promoters of A. thaliana, the values of Roll and Slide are somewhat lower than in mammals, especially Slide (~0.2 promoters for all of t characteristics of two n (−500-−420) and(−300 computer-simulated ra profiles of the H. sapiens Stacking energy i stabilizing forces. Its va G. gallus is about −16.5 and unicellular fungi, t value of the staking e energy is in the promot this Protozoa, this is du third hydrogen bond in energy profiles in the r organism) is present in melifera, and S. cerevisiae in the TATA box region of naked DNA. This is profiles of the non-prom (Figure 7a). It is inte non-promoter regions o regions, while in the set is due to the percentag percentage of AT pairs the AT and GC content Base-pair step par neighboring base pairs towards the minor gro and Tilt), Roll is the mo Base-pair step par base pairs in the direct Slide values are a disti values of the Slide are indicator that allows us The values of the double helix in the pro fungi (with the except structural parameters o G. gallus vary between A. thaliana, the values o Slide (~0.2 Ǻ), but in parameters are as in ma of Roll and Slide are Protozoa, double-stran ), but in the promoters of another plant, Z. may, the values of these parameters are as in mammals. In the promoter sequences of unicellular fungi, the values of Roll and Slide are also close to mammals. The exception is P. falciparum. In this Protozoa, double-stranded DNA, at least in the core promoter region, which we have analyzed, may represent the intermediate form with a negative value of Slide, which corresponds to some structure on the B↔A transition path [26,28].
Our profiles show that the values of Roll and Slide, as well as their stiffness in the TATA-box position of all the species (except for S. cerevisiae), differ from the average level. The extent of the difference depends on the organism. It is most pronounced in plants, S. pombe, and most mammals. The invertebrates present maximum diversity in the TATAbox position. For example, the profiles of Slide and its stiffness of C. elegans do not have peculiarities in the TATA-box position, but Roll and its stiffness have. It is important to note that while the values of both structural parameters -Roll and Slide -are somewhat less than the average level, the rigidity of the Roll drops noticeably, while the rigidity of the Slide either remains at an average level or increases. Hence, it can be concluded that binding to TBP is accompanied by an increase in the opening of the angle between adjacent base pairs towards minor grooves. This is what happens when the helical axis is bent towards the major grooves. The profiles of the parameter "Mobility to bend towards major groove" in the core promoters of all the organisms (Figures 3f, 4f, 5f and 6f), with the exception of S. cerevisiae, clearly reflect this predisposition for octanucleotides in the TATA-box regions. It should be noted that in the core-promoter sequences of A. melifera, the increase in the values of the "Mobility to bend towards the major groove" parameter is noticeably less than in other invertebrates. Moreover, in the profiles of S. cerevisiae, the maximum falls on the position of −8 bp.

Variations of Ultrasonic Cleavage and DNase I Cleavage Intensities in Core Promoter Sequences
The intensities of the sequence-specific ultrasonic cleavage of the double-stranded DNA provide information on the intensity of the intramolecular conformational movements in every strand [18,29,30], and the DNase I enzymatic cleavage of the double-stranded DNA provide information on the width of their grooves [31][32][33][34]. Therefore, the variation in the local structure in the DNA double helix can also be assessed using the data of these independent new methods.
The relative intensities of the cleavage of the central phosphodiester bond in the 16 dinucleotides and 256 tetranucleotides were determined by multivariate statistical analysis [18]. The experimental details are also given in [29,30]. It was shown that the cleavage rates for all pairs of complementary dinucleotides are significantly different, and the sequence-dependent ultrasonic cleavage rates are consistent with the intensity of N↔S interconversion at the 5 -sugar ring [18]. Therefore, cleavage rates may be useful for characterizing the functional regions of the genome as a measure of local conformational dynamics. We use several indexes for the description of the intensity of ultrasonic cleavage [10]: R is the relative cleavage intensities of the central position of each of the 16 dinucleotides; T is the relative cleavage intensities of the central position of each of the 256 tetranucleotides; S is the combination of indexes R and T (S = T − R). The S index provides information on the effect of the nearest context on the intensity of ultrasonic cleavage in the dinucleotide, i.e., if S < 0, the first and the fourth nucleotides of a tetranucleotide bring down the intensity of the cleavage in the central step; otherwise they increase it.
The cutting rates of bovine pancreatic deoxyribonuclease I (DNase I) vary along a given DNA sequence, indicating that the enzyme recognizes sequence-dependent structural changes in the DNA double-helix. The high-resolution crystal structures of the two DNase I-DNA complexes showed that the enzyme binds tightly in the minor groove and to the sugar-phosphate backbones of both strands, thereby inducing widening in the minor groove and bending towards the major groove [31,32]. The context near the dinucleotide step strongly affects its cleavage efficiency. These can be rationalized by the fact that six base pairs are in contact with the enzyme. The intrinsic rate of the cleavage by DNase I closely tracks the width of the minor groove [33]. We have used the intensity indices of DNase I cleavage at the hexanucleotide level (D), which were obtained in [34]. The profiles of the ultrasonic indexes R, T, and S and the DNase I cleavage index D are depicted in blue for the upper strand and in red for the lower (template) strand.
The lowest value of the ultrasonic cleavage for the H. sapiens core promoters was detected in the region from −32 to −24 bp relative to TSS (Figure 8, indexes R and T). The same region of the promoter has the highest DNase I cleavage (Figure 8, index D). This indicates a decrease in the conformational motion in this region and minor groove widening. The minimum ultrasonic cleavage of the upper (coding strand) falls at position −26, but in the lower (template) strand, at position −29. This means that there is some shift in the intensity of the conformational movement in the complementary strands. The profiles of the differences in the S-indexes between the strands revealed periodic alteration to the conformational motion intensity in the complementary strands until the position of −3 bp. The observed behavior of the core promoter fragment structure is in good agreement with the results of the MD calculations in [35], which confirmed an important role of the indirect readout mechanism in TATA-box recognition, and revealed regular oscillations between several alternate structures in the process of TBP binding.       Figures S4-S14, respectively.
It is significant that the cleavage intensities of the TATA element, as well as that of Inr, have singular properties in the profiles of all but one species. Ultrasonic cleavage diminished in the TATA element, while DNase I cleavage enhanced. The exception is the TATA region in the core promoters of S. cerevisae. Both methods show a messy pattern of cleavage around the TSS in all species.

Discussion
Previously, we found a special structural organization in the nucleotide sequences of double-stranded DNA of minimal core promoters of POL II in metazoans and Schizosaccharomyces pombe. They have singular mechanical and structural properties at the positions of the TATA-box and around TSS [10].
This work was undertaken due to the fact that new data appeared that significantly expanded the range of organisms available for analysis, as well as the significant increase in the number of promoter nucleotide sequences available. As a result, the characteristics of the mechanical and structural properties of the core promoters of POL II in the fifteen organisms from different steps of the evolutionary ladder were obtained. These are the ten representatives of the animal kingdom-mammals, vertebrates, and invertebrates-namely, H. sapiens, M. mulatta, M. musculus, R. norvegicus, C. familiaris, G. gallus, D. rerio, C. elegans, D. melanogaster, and A. mellifera; two representatives of the plant kingdom (A. thaliana and Z. mays), two representatives of the kingdom of unicellular fungi (S. cerevisiae and S. pombe), and a representative of Protozoa (P. falciparum). The AT and GC contents of the genomes of these organisms are different. Some of them have a GC-rich genome, while the genomes of the others contain nearly equivalent amounts of AT and GC, or a slight excess of AT, while 80% of the P. falciparum genomic sequences consist of AT. The aim of the present work was to assess the generality of the characteristics of the core promoters obtained earlier based on the analysis of a much wider range of organisms that differ significantly in evolutionary development and the percentage of AT pairs in the genomic DNA.
As a result, here we have shown that the core promoters of POL II in organisms representing the kingdoms of animals, plants, fungi, and protozoa have a special structural organization. The fragments of 80 bp (positions from −50 to +30), regardless of the AT content in the genomic DNA, have two singular regions: a hexanucleotide with coordinates −2-+4 (INR) surrounding the transcription start site (TSS) and an octanucleotide separated from TSS at a distance of about 28-35 bp (depending on the organism) located upstream. In the TSS position (−1, +1), the occurrence of the PyPu/PyPu steps is exceptionally high, with a noticeable predominance of the d (CA/TG) dinucleotide. The conformational features of this dinucleotide remarkably favor the formation of an open complex (PIC). The TATA-box region of all but one organism is about 28-35 bp upstream and has unique mechanical and structural properties. Its mobility to bend towards the major groove is increased, and the stacking energy is reduced; the minor groove expands significantly, and the conformational dynamics are reduced. These local properties of the TATA region contribute to its indirect readout by TBP and the subsequent PIC formation.
It is important that the profiles of the control fragments of the same length, taken from the human genome in the vicinity of −300 and −500, as well as from a sample of 30,000 random sequences, do not reveal any structural organization.
However, it should be noted that there is no TATA-element in the position around −28 bp in the promoters of S. cerevisiae. However, the structural features that resemble the TATA box are found in the profiles of S. cerevisiae at positions −3-−10. We also reveal three organisms (C. elegans, A. melifera, and P. falciparum), where the TATA-element in the position around −28 bp is present, but some of its features are less pronounced. Let us consider in more detail the features of the TATA element in these organisms.
C. elegans does not have any peculiarities in the TATA-box position in the profiles of Slide and Slide stiffness, while in the profiles of Roll and Roll stiffness, it has. The magnitude of the maximum in the profile of the parameter "Mobility to bend towards the major groove" is relatively lower than in other organisms, and the profiles of ultrasonic cleavage and DNase I cleavage in the TATA region have no peculiarities until TSS. We suppose that these features are the result of the fact that not TBP but TBP-like factor CeTLF is used to activate Pol II in C. elegans [36,37]. Therefore, the PIC assembly machinery may have its own characteristics.
The profiles of the intensity of the ultrasound cleavage and DNase I cleavage of A. melifera do not have any features in the area of the TATA element, and the parameter "Mobility to bend towards the major groove" is noticeably less pronounced than in the profiles of the other invertebrates. A. melifera is an insect that is characterized by complex social behavior. Its transcription is still studied insufficiently, and there are little data for understanding the details of this process [38].
The extremely high TA content of the P. falciparum genomic sequence (about 80%) does not allow the formation of a completely autonomous structure of the Pol II core promoter, which would not require additional control. In P. falciparum, both ultrasonic and DNase I cleavage virtually does not change throughout the entire region upstream to TSS. However, in Figure 6f we saw a faintly pronounced wide maximum in the profile of P. falciparum "Mobility to bend towards major groove". It seems that this is a marker for TBP binding, but it is too weak. Apparently, additional mechanisms are needed to realize gene expression and identify the TATA element in the promoter of P. falciparum. The role of G-quadruplexes in gene expression is widely discussed [39]. In addition, the presence of G-quadruplex-forming DNA motifs in the P. falciparum genome was shown [40]. This is all the more surprising given that 80% of its genome consists of AT pairs. However, it is obvious that the P. falciparum genome must contain some additional mechanisms to facilitate the recognition of the TATA element.
Let us try to figure out how much the deviations in the profiles of these three organisms can fundamentally change the idea of an evolutionarily stable structural organization of RNA polymerase II promoters. Despite the absence of some structural features in the region of the TATA element in these three organisms, one of its characteristics is present in all organisms without exception. This characteristic is "Mobility to bend towards the major groove". It reaches its maximum in the TATA region (Figure 4f), and the presence of the motifs in the logo representations (Figures 2 and S3) of C. elegans and A. melifera are evident. Thus, C. elegans, A. melifera, and P. falciparum still have a marker of the TATA element. Note that the messy pattern of cleavage around the TSS is present in all organisms.
The only organism whose promoter sequences do not have the structural markers of the TATA element at a position around −28 bp upstream of the TSS is S. cerevisiae. However, we registered the maximum in the profiles of the parameter "Mobility to bend towards major groove" at the position of −8 bp. Previously we have already obtained this result when processing a smaller sample of its promoters [10]. The peculiarity of S. cerevisiae transcription machinery may be due to the peculiarities of the functioning of Pol II in this organism, which was discovered when compared with S. pombe transcription machinery [41]. The differences in the core promoters' structural organization of two yeasts may be associated with an evolutionary distance between S. pombe and S. cerevisiae. Really, these organisms diverged in evolution about 500 million years ago [42]. The features of Pol II functioning during transcription in S. cerevisiae have recently been studied in detail [43].

Materials and Methods
We analyzed the sets of promoters of fifteen evolutionarily different organisms that were retrieved from the EPD New section of the Eukaryotic Promoter Database (EPD) (http://epd.vital-it.ch (accessed on 24 July 2022) [12]. We used sets of the animal promoters (29,597 promoters for H. sapiens, 9556 promoters for M. mulatta, 25,111 promoters for M. musculus, 12,569 promoters for R. norvegicus, 6126 promoters for G. gallus, 7352 promoters for C. familiaris, 16,972 promoters for D. melanogaster, 6461 promoters for A. mellifera, 10,726 promoters for D. rerio, 7120 promoters for C. elegans); plant promoters (22,702 promoters for A. thaliana, 17,059 promoters for Z. mays); unicellular fungi promoters (5117 promoters for S. cerevisae and 4802 promoters for S. pombe); and protozoan promoters (5597 promoters for P. falciparum). We checked that all of these sequences are 80 nucleotides long and strictly defined. The profiles of the averaged textual, structural, mechanical, and the physicochemical properties of 80 bp core promoter sequences (positions from −50 to +30) were constructed.
For analysis of the structural, mechanical, and physicochemical properties of the core promoter sequences, we use indexes of numerical parameterization for the ten doublestranded duplexes, which were collected from the database DiProDB http://diprodb.flileibniz.de (accessed on 24 July 2022) [21]. For the profile construction of the variations in the stacking energy and the base-pair step parameters, Roll and Slide, we used the parametrization of Perez et al. [22], for the profile construction of stiffness variation in the DNA double helix to Roll and Slide changes, we used the parametrization of Goni et al. [23], and for the profile construction of stiffness of the structure to bend towards major groove we evaluated using the parametrization of Gartenberg and Crothers [24].

Profiles Construction
The X-axes of the profiles define the position relative to the TSS, which was denoted as +1 bp, while negative and positive numbers denote the upstream and downstream regions. We have written the programs in Python 3.10 for profile construction.

Conclusions
Eukaryote organisms, regardless of the level of their evolutionary development and the AT content of genomic sequences, have common structural features of the naked DNA in the RNA polymerase II core promoter region. These features are the exceptional heterogeneity and asymmetry of the 3D structure and the inclusion of two singular regions-hexanucleotide ("INR") around TSS and the octanucleotide ("TATA element") upstream. The strength of each promoter, to some extent, depends on the nucleotide sequences forming its singular regions. In our opinion, all of the data presented here correspond to the bottom-up approach conception of evolution [44], starting from the physicochemical properties of nucleic and amino acid polymers.