Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome

In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.5 substitutions per position (x). We generated sets of artificial DNA sequences with x ranging from 0 to 4.4 and applied MAHDS as well as currently used multiple sequence alignment algorithms, including ClustalW, MAFFT, T-Coffee, Kalign, and Muscle to these sets. The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We also used MAHDS to analyze a set of promoter sequences from the Arabidopsis thaliana genome and discovered many conserved regions upstream of the transcription initiation site (from −499 to +1 bp); a part of the downstream region (from +1 to +70 bp) also significantly contributed to the obtained alignments. The possibilities of applying the newly developed method for the identification of promoter sequences in any genome are discussed. A server for multiple alignment of nucleotide sequences has been created.


Introduction
Multiple sequence alignment (MSA) is one of the central tasks of bioinformatics. Significant efforts have been made to develop tools for MSA [1][2][3], which often include dynamic programming, progressive alignment, iterative methods, hidden Markov models (HMMs), and genetic algorithms [4,5]. The direct use of dynamic programming is hardly possible for the alignment of a large number of sequences (more than 10), since it is an NP-complete problem [6] and the calculations require considerable time. Therefore, some heuristic solutions have been developed, which are based on using the objective function for assessing the power of MSA and an optimization procedure. As a result, the quality of the constructed MSA is improved [2].
The progressive alignment is the most popular MSA algorithm. It includes pairwise comparison of N sequences, calculation of a matrix of distances between the sequences, construction of a matrix-based guide tree, and, finally, progressive MSA. Often, an optimization procedure is applied to the created progressive alignment to improve the final result and eliminate unnecessary deletions and insertions (indels). The sum of the similarity functions of pairwise alignments, which may be used as an objective function, can be obtained by projecting the constructed MSA on two-dimensional planes [7,8], resulting overall in N(N−1)/2 of pairwise alignments. Alternative target functions such as consensus [9], entropy [10], or circular sum [11] are also used. However, in most cases, progressive alignment cannot create the optimal MSA since the errors obtained at any stage of the procedure are accumulated at the final alignment. To minimize such errors, an optimization procedure that utilizes various mathematical algorithms is used. The most

General Description of the MAHDS Algorithm
The main idea behind the MAHDS method is the construction of the optimal image rather than direct calculation of the MSA. Here, we used position-weight matrix (PWM) as such an image. For example, we have a set of sequences sq(1), sq (2), . . . , sq(N), which we combine into one sequence S of length L and then determine the best alignment between S and PWM that was originally calculated based on random sequence alignment and has the dimension of 4 × L/N. Then, sequence S is aligned against PWM using the Needelman-Wunsch algorithm [30] (for an example, see publication [31]) and the similarity between S and PWM is measured based on F(L,L) calculated by two-dimensional dynamic programming at the point (L, L). In the two-dimensional matrix F with the size L × L, sequence S is plotted in the X axis and the PWM, multiplied N times, in the Y axis. As a result, sequence S will be compared by dynamic programming with sequence S 1 , which contains the column numbers of the PWM matrix treated as characters and looks as 1, 2, . . . , L/N, 1, 2, . . . , L/N, . . . . We also created S 2 , which is a randomly shuffled sequence S.
The purpose is to determine PWM with the greatest F(L,L), which then will be referred to as the best PWM. Using the created two-dimensional alignment, we can easily reconstruct the multiple alignment of the initial sequences. Therefore, MAHDS computes MSA by calculating F(L,L), finding the best PWM, and aligning sequence S to PWM.
As it is extremely unlikely to obtain a good approximation of the optimal MSA from the first random PWM, we determined the best PWM using an optimization procedure. For this, we generated a set of random PWMs (Q) (Figure 1, point 1), which contained n 1 random matrices, aligned matrix number i from set Q with sequence S, and calculated the corresponding F(L,L) value in position i of vector V(i) (i = 1, . . . , n 1 ) (Figure 1, point 2). Then, vector V(i) was sorted in the ascending order and the matrices from set Q were arranged in accordance with the position of the corresponding value in V(i).
Genes 2021, 12, x FOR PEER REVIEW 3 of 21 calculated multiple alignment to identify promoter sequences in different genomes is discussed.

General Description of the MAHDS Algorithm
The main idea behind the MAHDS method is the construction of the optimal image rather than direct calculation of the MSA. Here, we used position-weight matrix (PWM) as such an image. For example, we have a set of sequences sq (1), sq (2), ..., sq(N), which we combine into one sequence S of length L and then determine the best alignment between S and PWM that was originally calculated based on random sequence alignment and has the dimension of 4 × L/N. Then, sequence S is aligned against PWM using the Needelman-Wunsch algorithm [30] (for an example, see publication [31]) and the similarity between S and PWM is measured based on F(L,L) calculated by two-dimensional dynamic programming at the point (L, L). In the two-dimensional matrix F with the size L × L, sequence S is plotted in the X axis and the PWM, multiplied N times, in the Y axis. As a result, sequence S will be compared by dynamic programming with sequence S1, which contains the column numbers of the PWM matrix treated as characters and looks as 1, 2, ..., L/N, 1, 2, ..., L/N, .... We also created S2, which is a randomly shuffled sequence S.
The purpose is to determine PWM with the greatest F(L,L), which then will be referred to as the best PWM. Using the created two-dimensional alignment, we can easily reconstruct the multiple alignment of the initial sequences. Therefore, MAHDS computes MSA by calculating F(L,L), finding the best PWM, and aligning sequence S to PWM.
As it is extremely unlikely to obtain a good approximation of the optimal MSA from the first random PWM, we determined the best PWM using an optimization procedure. For this, we generated a set of random PWMs (Q) (Figure 1, point 1), which contained n1 random matrices, aligned matrix number i from set Q with sequence S, and calculated the corresponding F(L,L) value in position i of vector V(i) (i = 1, ..., n1) ( Figure 1, point 2). Then, vector V(i) was sorted in the ascending order and the matrices from set Q were arranged in accordance with the position of the corresponding value in V(i). Next, the genetic algorithm was applied to introduce random mutations into the matrices from Q and create hybrids (descendants) (Figure 1, point 3). The matrix corre-Genes 2021, 12, 135 4 of 20 sponding to value V(1) (the smallest F(L, L)) was excluded from set Q and replaced with the descendant. Then, the values of vector V(i) were recalculated and V(i) re-sorted. The resultant matrix corresponded to the largest F(L, L) contained in V(L) and was denoted as maxF(L, L) ( Figure 1, point 2). If the maxF(L, L) value decreased, then the process of matrix optimization was considered complete, if not-it was repeated. As a result, we obtained maxF(L, L), two-dimensional alignment of sequences S and S 1 , and the best PWM ( Figure 1). The steps of this algorithm are described in detail below.

Creation of the Set of Random PWMs
We used random sequences to obtain a set of random PWMs (Q; Figure 1, point 1), which consisted of L/N columns and 16 rows. For this, we generated a random nucleotide sequence S 2 of length L with equal base probability (0.25), transformed it into a numeric sequence (where bases a, t, c, and g were coded as 1, 2, 3, and 4, respectively), and compared it with sequence S 1 of the same length L containing N repeated sequences (1, 2, . . . , L/N). Then, we filled frequency matrix M(L/N, 16) as: where i ranges from 2 to L. This approach of matrix construction takes into account two effects in MSA [32]: The frequency of nucleotides at each position and the correlation of nucleotides in neighboring positions (i and i − 1), which allows multiple alignment of highly diverged sequences where base frequencies in each MSA position may not differ from those in sequence S, while maintaining the correlation of neighboring nucleotides. An example is shown in Figure 2, where short sequences were used for convenience. However, all conclusions are applicable to sequences of any length. Let us consider the following 4-nt sequences: Atcg, tagc, cgat, gcta, atat, tata, gcgc, cgcg, atta, taat, cgta, and gcat, which were constructed with dinucleotides at, ta, cg, and gc at positions 1-2 and 3-4. As a result, in each position of the MSA shown in Figure 2, the nucleotide frequency is 0.25, but nucleotide pairs 1-2 and 3-4 could be only at, ta, cg, or gc. The probability that the first and second columns contain only at, ta, cg, or gc is 0.0625 × 4 = 0.25 and the number of lines in the alignment is 12. Then, the expected number of each nucleotide pair is: 12 × 0.25 = 3. If a normal approximation for the binomial distribution is used, then the variance is 12 × 0.25 × (1 − 0.25) = 2.25. In total, the first two columns contain 12 pairs (at, ta, cg, or gc). Then, the normal distribution argument for this event is (12 − 3)/ √ 2.25 = 6 and the probability due to purely random factors is less than 10 −6 . Given that this value is the same for columns 3 and 4, the probability of accidentally creating the alignment shown in Figure 2 is less than 10 −12 , which means that the individual columns in Figure 2 appear to be random, but there is a strong correlation between the adjacent bases.
Then, we calculated PWM matrix W(L/N, 16) using M(L/N, 16): Then, matrix W(i,j) was transformed to obtain the given R 2 and K d , calculated using the following formulas: Here, p 1 (i) is the probability of symbols in S 1 , which is N/L for any i; p 2 (k) = p(l)p(m), where p(l) and p(m) are the probabilities of the l or m type nucleotides in S (l,m ∈ {a,t,c,g}); p(l) = q(l)/L, where q(l) is the number of l type nucleotides in S; and L is the length of S. For all the calculations, we used R 0 = 110L/N and K 0 = −1.8 [32]. The matrix transformation procedure is described in detail in [25] and an example for the original matrix of five columns and R 2 = 155 is shown in Table 1 (a small matrix is used to fit in full in Table 1). The transformed matrix has R 2 = 2000 and K d =−1.5.
Here, p1(i) is the probability of symbols in S1, which is N/L for any i; 2 ( ) ( ) ( ) p k p l p m = , where p(l) and p(m) are the probabilities of the l or m type nucleotides in S (l,m ∈ {a,t,c,g}); p(l) = q(l)/L, where q(l) is the number of l type nucleotides in S; and L is the length of S. For all the calculations, we used R0 = 110L/N and K0 = −1.8 [32]. The matrix transformation procedure is described in detail in [25] and an example for the original matrix of five columns and R 2 = 155 is shown in Table 1 (a small matrix is used to fit in full in Table 1). The transformed matrix has R 2 = 2000 and Kd =−1.5. It is necessary to transform matrix W so that the distribution function for F(L, L) is similar for different W matrices. Therefore, the same R 2 and Kd values should be maintained. The optimal cost for an insertion or deletion (del, formula 5) depends on R 2 and the average value of F(L, L) for a random sequence S depends on Kd. The distribution function F(L, L) can be obtained by the shuffling of sequence S into 1000 random sequences denoted as set SR and calculating F(L, L) of matrix W for each sequence of the set. If R 2 and Kd for different transformed W matrices are the same, then F(L, L) distributions will be equal or similar [25], which is the objective of matrix transformation.
After obtaining one transformed PWM (WT) for set Q, we shuffled sequence S and repeated the calculations using Formulas (1) and (2). As a result, about 10 6 PWMs for set Q were selected and then filtered to leave only those that uniformly fill the Euclidean space with dimension 16 L/N. We considered each matrix as a point in this space and calculated the Euclidean distance D between all matrices. We selected the threshold D0 so that set Q contained less than 10 3 matrices. Any two matrices with D < D0 were excluded from the set. We denoted the number of matrices in set Q as n1. It is necessary to transform matrix W so that the distribution function for F(L, L) is similar for different W matrices. Therefore, the same R 2 and K d values should be maintained. The optimal cost for an insertion or deletion (del, formula 5) depends on R 2 and the average value of F(L, L) for a random sequence S depends on K d . The distribution function F(L, L) can be obtained by the shuffling of sequence S into 1000 random sequences denoted as set SR and calculating F(L, L) of matrix W for each sequence of the set. If R 2 and K d for different transformed W matrices are the same, then F(L, L) distributions will be equal or similar [25], which is the objective of matrix transformation. After obtaining one transformed PWM (WT) for set Q, we shuffled sequence S and repeated the calculations using Formulas (1) and (2). As a result, about 10 6 PWMs for set Q were selected and then filtered to leave only those that uniformly fill the Euclidean space with dimension 16 L/N. We considered each matrix as a point in this space and calculated the Euclidean distance D between all matrices. We selected the threshold D 0 so that set Q contained less than 10 3 matrices. Any two matrices with D < D 0 were excluded from the set. We denoted the number of matrices in set Q as n 1 .

Comparison of Set Q with Sequence S Using Dynamic Programming
Then, we aligned sequence S with each of the matrices from set Q using the global alignment algorithm. The F value was calculated as: where n = s(k) + 4(s(j) − 1)), i and j each ranges from 2 to L, and s(j) and s 1 (i) are elements of sequences S and S 1 , respectively. Parameter n reflects the fact that in matrix W, dinucleotides were taken into account. To determine n, the previous position (k), already included in the alignment, should be defined. It was calculated from the created transitions in matrix F, depending on the previous base in S and used to obtain W(s 1 (i),n). If the previous base of sequence S is s(j − t), then k = j − t and n = s(j − t) + (s(j) − 1) × 4. Three cases should be considered. In the first, t = 1 corresponds to the movement along the main diagonal of matrix F and there is no deletion in sequence S in the alignment ( Figure 3A). In the second, t > 1 corresponds to a deletion of t − 1 bases in sequence S (illustrated in Figure 3B for t = 2). Finally, deletions may occur simultaneously in both sequences S and S 1 , which correspond to deletions of columns in matrix W. If the previous symbol in sequence S 1 has the number i − 1, then there is no deletion, but if this number is i − t (t > 1), then there is a deletion of t − 1 bases in sequence S 1 . For these transitions, we did not consider the correlations of adjacent bases and took n = s(j). Rather than matrix W(s 1 (i),n), we used matrix W 1 (s 1 (i),s(j)): In this case, the correlation of adjacent bases is not considered, which is quite acceptable when the number of deletions is relatively small (illustrated in Figure 3C for t = 2).
The zero row and column of matrix F were filled with negative numbers, F(0, j) and F(i, 0) were 0 for i and j ranging from 1 to L, respectively, and F(0, 0), F(1, 0), . . . , F(2, 0) were also equal to 0. Matrix E(x, n) was used to define the first column and row of matrix F. The insertion/deletion penalty value (del = 25.0) was selected based on our earlier work [25]. The reverse transition matrix was filled along with matrix F. Therefore, we aligned sequences S 1 and S using the reverse transition matrix and determined F(L, L). The alignment of S 1 and S was obtained for all matrices from the Q set. As a result, vector V(i) (i = 1, . . . , n 1 ) contained F(L, L) for each matrix.

Application of the Genetic Algorithm to the Q Set
To optimize matrices from the Q set, we used a genetic algorithm described in our previous study [25]. The aim was to change each PWM from the Q set to maximize F(L, L), which was considered an objective function. F(L, L) for each matrix was put into vector V(i) (i = 1, 2, . . . , n 1 ), which was sorted in the ascending order from V(1) (the minimum) to V(n 1 ) (the maximum) and the matrices in the Q set were arranged accordingly. Then, two matrices were randomly selected with the probability of choosing a matrix, which increased with the increase of i from 1 to n 1 , and the two matrices were used to create a "descendant", for which any element of the first matrix was selected with an equal probability. Then, rectangles were randomly selected to the right and left above and below the selected element in the first matrix with the probability of 0.25 and the elements within the rectangle were moved from the first to the second matrix to create a descendant, which replaced the PWM with V(1).
The zero row and column of matrix F were filled with negative numbers, F(0, j) and F(i, 0) were 0 for i and j ranging from 1 to L, respectively, and F(0, 0), F(1, 0), ..., F(2, 0) were also equal to 0. Matrix E(x, n) was used to define the first column and row of matrix F. The insertion/deletion penalty value (del = 25.0) was selected based on our earlier work [25]. The reverse transition matrix was filled along with matrix F. Therefore, we aligned sequences S1 and S using the reverse transition matrix and determined F(L, L). The alignment of S1 and S was obtained for all matrices from the Q set. As a result, vector V(i) (i = 1, …, n1) contained F(L, L) for each matrix.
. In this case, t = 2 and one symbol is deleted from the sequence. Then, we do not take into account the base correlation in sequence S and use matrix W1(s1(i),s(j)) and n = s(j) = 4.
. In this case, t = 2 and one symbol is deleted from the sequence. Then, we do not take into account the base correlation in sequence S and use matrix W 1 (s 1 (i),s(j)) and n = s(j) = 4.
Then, we introduced mutations in 10% of the randomly selected matrices from the Q set. To do this, a randomly selected element of the matrix was changed to a random value in the range from −10.0 to +10.0. Usually, less than 10 4 cycles were required to achieve the moment when V(n 1 ) did not increase, i.e., to reach the maximum designated as maxV(n 1 ). However, in rare cases, more than 10 5 cycles were performed. At the output of the algorithm (Figure 1), we obtained maxV(n 1 ), two-dimensional alignment of sequences S 1 and S, and matrix maxW, which were used to compute the alignment.

Calculation of Statistical Significance for maxV(n 1 )
We used the Monte Carlo method to estimate the statistical significance of maxV(n 1 ). Sequence S was randomly shuffled to obtain 200 random sequences. Then, matrix maxW was included in the Q set described in Section 2.2, which allowed taking into account the effectiveness of maxW alignment with random sequences. Then, each of these sequences were treated as described in Section 2.2, Section 2.3, Section 2.4 and maxW was calculated for each, producing 200 maxV(n 1 ). Then, the mean maxV(n 1 ) and variance D(maxV(n 1 )) were calculated and used to compute Z: where maxV(n 1 ) was calculated for sequences S 1 and S in Section 2.4. Z was obtained for each MSA and the average Z value for random S sequences was estimated according to Formula (7) after each sequence had been subjected to the procedures described in Section 2.2, Section 2.3, Section 2.4 and here. As a result, the mean Z = 1.8, and we can assume that the MSA is non-random at Z > 6.0.

MSA Construction
The MSA was computed using the two-dimensional alignment of sequences S 1 and S. Each position in sequence S 1 corresponded to a column in the MSA. Any insertion in the two-dimensional alignment (opposite to which there was a gap in sequence S 1 ) resulted in an additional column in the MSA.

Comparison of Various MSA Methods
The algorithm shown in Figure 1 can also be applied to determine the statistical significance of MSAs created by other algorithms. Let us denote the MSA as A, the length of each sequence in A as K, and the number of sequences as N. All sequences from A are linked to produce sequence S 3 of length L = KN. Then, the PWM is calculated for A using Formula (2), transformed using Formulas (3) and (4), and applied to create the two-dimensional alignment for sequence S 3 using Formulas (5) and (6) and to calculate F(L, L). The statistical significance of A is then computed according to Formula (7).
However, the columns that have a sum of elements < N/2 should be excluded from A to eliminate redundant deletions in the calculation of F(L, L), whereas those with the sum > N/2 cannot be excluded since it would lead to an excessive number of insertions. Consequently, the number of columns became K ≤ K, resulting in a new alignment A (K is the length of each sequence in A ). To construct the PWM using A , frequency matrix M(K , 16) was first calculated using Formula (1) and then the PWM (designated as W A ') was calculated using Formula (2). Formulas (3) and (4) were applied to transform the resulting matrix and obtain matrix WT A' , which was used to calculate F(L, L) (L = K N) based on A'. For this, the sequence from A' was merged with sequence S 4 with all the spaces preserved. At the same time, sequence S 5 containing column numbers {1, 2, . . . , K } of the WT A' matrix repeated N times was created. Then, we determined the sum of F 1 = F 1 + WT(s 5 (i),n), where n = s 4 (i − 1) + (s 4 (i) − 1) × 4 was calculated for all i from 2 to L = K'N, for which s 4 (i − 1) and s 4 (i) were not gaps, whereas for those i for which s 4 (i − 1) was a gap, the sum was calculated as F 2 = F 2 + E(s 5 (i),s 4 (i)). Matrix E was calculated from the WT A' matrix using Formula (6). We also calculated F 3 = −k 1 del, where k 1 was the number of gaps in alignment A', and del was the insertion/deletion penalty (Formula (5)), as well as F 4 = −k 2 del, where k 2 was the difference in the number of nucleotides between alignments A and A'. Finally, we calculated F (KN', KN') Weight matrix WT A' is the image of alignment A', for which statistical significance can be estimated based on the effectiveness of the alignment between the WT A' matrix and random sequences. If the alignment is random, then matrix WT A' would be random too and F 5 would be close to the value obtained for random sequences (Section 2.2).
Then, sequence S 4 was randomly shuffled to create 200 sequences and matrix WT A' was included in the Q set as described in Section 2.5. Each of the 200 sequences were treated as described in Section 2.2, Section 2.3, Section 2.4. As a result, 200 maxV(n 1 ), each for a different random sequence, were obtained and used to calculate the mean maxV(n 1 ) and variance D(maxV(n 1 )). Then, we calculated Z using Formula (7), where F 5 was used rather than maxV(n 1 ). The MSA constructed by different mathematical methods, including MAHDS, had the same algorithm for calculating Z, which allowed their comparison based on Z values (supplementary material 1).

Algorithm for the Classification of Promoter Sequences from the A. thaliana Genome
The MAHDS algorithm developed in this study was applied to align promoter sequences from the A. thaliana genome (downloaded from https://epd.epfl.ch//index. php [33]). Each promoter had length K (600 nt), which included the region from −499 to +100 bp relative to the first base of the start codon (position +1). There were 22,694 promoter sequences in the analyzed set denoted as PM (supplementary material 1). Since the algorithm shown in Figure 1 requires considerable resources to align all the promoter sequences, we created a sample containing 500 randomly chosen promoters, which were combined into one sequence S with L = 500 × 600 = 30,000 nt. Then, we constructed the MSA as described in Figure 1 and Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6 and obtained mV(n 1 ), two-dimensional alignment of sequences S 1 and S, and PWM mW(600, 16).
However, the volume of the PM set was significantly larger than the 500 randomly selected promoters included in sequence S. Furthermore, promoter sequences from the PM set might not show statistically significant alignment with maxW(600, 16). Therefore, we aligned each promoter from the PM set with matrix maxW(600, 16) using Formula (5) and considering the promoter sequence as S with L = 600. As a result, F(L, L) for each promoter from the PM set was calculated and put into the Ves(i) vector (where i is the promoter number).
Then, the promoter sequences with statistically significant Ves(i) were selected from the PM set. To do this, we used PMR(i) sets obtained by random shuffling of the promoter sequence with number i; each PMR(i) set contained 10 3 random sequences of 600 bp. We aligned each sequence from PMR(i) relative to the maxW(600, 16) matrix, calculated F(L, L) denoted as Vesr(j) (j = 1, 2, . . . , 10 3 ), and then determined the mean Ves(j) and variance D(Vesr) and calculated Z for each Ves(i) using Formula (7). If Z > Z 0 , then the promoter was considered to have a statistically significant alignment with the maxW(600, 16) matrix. For Z 0 = 5.0, the probability of random similarity between the promoter and maxW(600, 16) was about 10 −6 . All promoter sequences with Z > 5.0 were assigned to the same class characterized by the maxW(600, 16) matrix.
When we created the first class of the A.thaliana promoter sequences in this way, we removed all the sequences with Z > 5.0 from the PM set and created PM(1) set. The resulting set PM(1) was used to create further classes. The described procedure was repeated for the PM(1) set, from the creation of a new set of 500 randomly selected promoters. As a result, we created a second class of promoters and a PM(2) set. We repeated this procedure for the sets PM(i), i = 1,2, . . . . Each iteration created a new class and the corresponding maxW(600, 16) matrix. If on some iteration, the volume of the PM(i) set became less than 500 sequences, then we chose all the sequences for carrying out the multiple alignment. The multiple alignments generated for each class are shown in Supplementary material 1. The procedure was stopped at the iteration i = i 0 when the size of classes with i > i 0 was less than 100 sequences. We defined the size of classes equal to 100 based on the random sequence analysis. When we performed the procedure on randomly shuffled promoter sequences (total number is 22,694), the volume of the classes ranged from 6 to 27 sequences with an average value of 16 sequences. This means that with using the threshold we kept the type I error rate less than 16%.

Divergence of Dinucleotide Positions in the Created Promoter Classes
We examined the difference in dinucleotide frequencies among the constructed MSAs based on the expected frequencies for each of the 599 positions (from −499 to +100) in the promoter regions. For this, we used the MSA obtained for sequence S in each class (see Section 2.8). Formula (1) was used to fill in the M k (600,16) matrix, where k is the class number from 1 to 25 (see Section 3.2). The frequencies of nucleotides f (j) (j = 1, 2, 3, 4 for a, t, c, and g, respectively) and probabilities p(j) = f (j)/(f (1) + f (2) + f (3) + f (4)) were calculated for S. Then, we calculated the expected dinucleotide frequencies as t(i, j) = Np(i)p(j), where N is the number of promoters in the constructed MSA, base i is observed at position l − 1, and base j is observed at position l in the MSA with class k. Finally, we calculated variance D(i, j) = Np(i)p(j)(1 − p(i)p(j) and then Z k (l, n) = (Mk(l, n) − t(i, j))/D(i, j) for each element of the M k (l,n) matrix, where n ranged from 1 to 16 (n = i + 4(j − 1)) and l-from 2 to 600. Therefore, we could access the difference between the frequency of a base pair (i, j) at position l in the MSA and the expected frequency. Z k (l, n) is the normal approximation for the binomial distribution. The larger Z k (l, n), the greater the difference between the observed and expected frequencies.
To estimate the conservation of position l, we used sum χ k (l) =

16
which follows the χ 2 distribution with 15 degrees of freedom. We transformed χ k (l) into an argument of normal distribution using the normal approximation for chi-squared distribution X k (l) = 2χ k (l) − 2 f − 1, where the number of degrees of freedom (f ) is 15. As a result, function χ k (l) was obtained for multiple alignment k, where l ranges from 2 to 600. The greater χ k (l), the more conserved is position l in the MSA.

Comparison of MSA Methods Using Artificial Sequences
To compare different MSA methods, we generated G(x) sets, where each G(x) contained 100 sequences 600 nt long and x was the average number of substitutions per nucleotide (ranging from 0 to 4.0) for each pair of sequences from the G(x) set. In each G(x) set, we made 25 insertions and 25 deletions in random positions of randomly selected sequences (the number of introduced indels was based on the average number of indels found in the multiple alignment of promoter sequences performed in Section 3.2). We estimated the dependence of statistical significance (Z) on x for the following MSA methods: ClustalW [27] (https://www.genome.jp/tools-bin/clustalw), Clustal Omega [28], T-Coffee [17], Kalign [29], MAFFT [14], and Muscle [18] (https://www.ebi.ac.uk/Tools/ msa/). The alignment was performed with the highest possible gap penalty. Figure 4 shows the Z(x) function for ClustalW and Clustal Omega. The results indicated that both algorithms created statistically insignificant alignments (Z < 5.0) for sets with x > 2.1 and produced a very large number of indels (over 6000) at x > 2.0. The same situation was observed for the other tested algorithms-MAFFT, T-Coffee, Kalign, and Muscle ( Figures 5 and 6), which produced statistically insignificant alignments with x > 1.7, 2.5, 1.6, and 2.2, respectively.
Next, we evaluated the quality of MSAs created by these methods based on the coincidence of the total number of indels in the MSA with those in model sequences. The results showed that at x > 0.5, the number of indels exceeded 50 and at x > 1.0, this number could be several hundred, indicating that the quality of alignments produced by these methods is not high.
The results obtained with the MAHDS algorithm developed in this work are shown in Figure 7. MAHDS could produce statistically significant MSAs at x < 3.7, indicating that our method can build MSA for more diverged sequences than the most successful program T-Coffee (x < 2.5). Furthermore, the mean number of indels was about 48 for all x < 3.7, which is very close to the original number of indels (50) we introduced in each G(x) set.
T-Coffee [17], Kalign [29], MAFFT [14], and Muscle [18] (https://www.ebi.ac.uk/Tools/msa/). The alignment was performed with the highest possible gap penalty. Figure 4 shows the Z(x) function for ClustalW and Clustal Omega. The results indicated that both algorithms created statistically insignificant alignments (Z < 5.0) for sets with x > 2.1 and produced a very large number of indels (over 6000) at x > 2.0. The same situation was observed for the other tested algorithms-MAFFT, T-Coffee, Kalign, and Muscle ( Figures 5 and 6), which produced statistically insignificant alignments with x > 1.7, 2.5, 1.6, and 2.2, respectively.   (https://www.ebi.ac.uk/Tools/msa/). The alignment was performed with the highest possible gap penalty. Figure 4 shows the Z(x) function for ClustalW and Clustal Omega. The results indicated that both algorithms created statistically insignificant alignments (Z < 5.0) for sets with x > 2.1 and produced a very large number of indels (over 6000) at x > 2.0. The same situation was observed for the other tested algorithms-MAFFT, T-Coffee, Kalign, and Muscle ( Figures 5 and 6), which produced statistically insignificant alignments with x > 1.7, 2.5, 1.6, and 2.2, respectively.   Next, we evaluated the quality of MSAs created by these methods based on the coincidence of the total number of indels in the MSA with those in model sequences. The results showed that at x > 0.5, the number of indels exceeded 50 and at x > 1.0, this number could be several hundred, indicating that the quality of alignments produced by these methods is not high.
We also evaluated PRRN [19], MAVID [34], FSA [35], and CHAOS/DIALIGN [36]. However, for all these algorithms, it was not possible to obtain a statistically significant The results obtained with the MAHDS algorithm developed in this work are shown in Figure 7. MAHDS could produce statistically significant MSAs at x < 3.7, indicating that our method can build MSA for more diverged sequences than the most successful program T-Coffee (x < 2.5). Furthermore, the mean number of indels was about 48 for all x < 3.7, which is very close to the original number of indels (50) we introduced in each G(x) set. We also applied MAHDS for G(x) sets containing 500 sequences with 250 indels (this number was based on the results of aligning promoter sequences in Section 3.3). In this case, MAHDS created statistically significant MSAs for x < 4.4 (Figure 7) and the total number of indels in the constructed alignments was around 254 for all x < 4.4, indicating We also applied MAHDS for G(x) sets containing 500 sequences with 250 indels (this number was based on the results of aligning promoter sequences in Section 3.3). In this case, MAHDS created statistically significant MSAs for x < 4.4 ( Figure 7) and the total number of indels in the constructed alignments was around 254 for all x < 4.4, indicating that the increase in the number of analyzed sequences allows creating significant alignment for more divergent sequences (higher x).
The performance of MAHDS was also tested on the set of AluY non-coding transposable repeats abundant in the human genome. The AluY subfamily comprises sequences with an average length of 311 bp and various degrees of similarity [37]. We selected repeats from the AluY subfamily, which had at least 100 Alu elements with same degrees of identity (ID), and created Al(ID) sets. For each of the eight ID intervals (Table 2), there was one sequence, which had the identity with all the others in a given interval and this sequence was not included in the Al set. MSAs were constructed using ClustalW and MAHDS and their statistical significance was calculated following the same procedure as for the model sequences (Figures 4 and 7). The results indicated that statistical significance decreased with ID and that ClustalW could not build statistically significant MSAs for IDs 0.4 ± 0.04 and 0.32 ± 0.04, whereas MAHDS could ( Table 2). For IDs below 0.32 ± 0.04, it was not possible to construct an Al set since there were no repeats with this level of similarity. A rough conversion of ID to x (x = 2 × (1.0 − ID)) is possible only for IDs over 0.45. In the case of lower IDs, we are in the so-called twilling zone, where there is no unambiguous relationship between ID and x and it is not possible to convert ID to x, since pairs of nucleotide sequences can have the same identity in a wide range of substitution numbers per nucleotide. For amino acid and nucleotide sequences, the thresholds are over 25% [38] and 40% [39], respectively. Therefore, for real DNA sequences, it is always possible to calculate ID but not x if ID is below 40%, since the real number of nucleotide substitutions is unknown. Therefore, x can be estimated only for model sequences. We could only conclude that at ID > 0.45 (x ≈ 1.1), MAHDS could produce statistically significant MSAs, but ClustalW could not. The decrease in the x threshold from 2.1 (model sequences, Figure 4) to 1.1 (Alu repeats), which is observed for ClustalW, is due to the decrease in the sequence length from 600 to 311 nt.

Creating classes for Promoter Sequences from the A. thaliana Genome
The iterative classification algorithm for promoter sequences described in Section 2.8 was used to calculate maxW(600, 16) for each class of A. thaliana promoters. The iterative algorithm stopped at i = 25, since the class size for i > 25 was less than 100 sequences. If such a procedure was performed for purely random sequences, the class size ranged from 6 to 27 sequences (16 in average), indicating that the promoter classification was performed with the type I error rate not exceeding 16%. Only the 18th and 24th classes deviated from this pattern, since their sets contained 65 and 47 sequences. We did not stop the classification procedure on these sets, since the size of the subsequent sets was more than 100 sequences.
Consequently, we obtained 25 classes of promoter sequences from the A. thaliana genome. The class distribution of promoter sequences is shown in Figure 8. The largest class contained more than 8888 promoters, followed by classes with 2419, 1071, and 1275 sequences. Classes from the 5th to 8th contained 300-400 promoters, and the smallest (25th) class contained 102 sequences. In total, 25 classes comprised 17,787 sequences constituting over 78% of all promoter sequences (22,703) present in the A. thaliana genome.  The number of classes strongly depended on the threshold value Z0 (Section 2.8). If Z0 increased, the number of classes for the 17,787 promoters was also increased and the number of promoters in each class was consequently decreased. At the same time, the size of the class obtained for random sequences (Section 2.8) was significantly reduced at Z0 > 6.0 (to 1-9 sequences). This classification can be performed for any group of sequences. The resulting MSAs for each class are shown in the appendix.

Conserved Positions in the Created Promoter Classes
Next, we analyzed the conservation of dinucleotides in the A. thaliana promoters based on the constructed MSAs. For this, we calculated χ k (l), where k is the class number (Section 2.9). The graphs for the first five classes (k 1-5) are shown in Figures 9-13. The results for the first class containing 8888 sequences indicated that conserved positions were present almost everywhere along the promoter sequences and that the dinucleotide frequencies were significantly different from the expected frequencies calculated for random sequences (Figure 9). The number of classes strongly depended on the threshold value Z 0 (Section 2.8). If Z 0 increased, the number of classes for the 17,787 promoters was also increased and the number of promoters in each class was consequently decreased. At the same time, the size of the class obtained for random sequences (Section 2.8) was significantly reduced at Z 0 > 6.0 (to 1-9 sequences). This classification can be performed for any group of sequences. The resulting MSAs for each class are shown in the appendix.

Conserved Positions in the Created Promoter Classes
Next, we analyzed the conservation of dinucleotides in the A. thaliana promoters based on the constructed MSAs. For this, we calculated χ k (l), where k is the class number (Section 2.9). The graphs for the first five classes (k 1-5) are shown in Figures 9-13. The results for the first class containing 8888 sequences indicated that conserved positions were present almost everywhere along the promoter sequences and that the dinucleotide fre-quencies were significantly different from the expected frequencies calculated for random sequences (Figure 9).

Conserved Positions in the Created Promoter Classes
Next, we analyzed the conservation of dinucleotides in the A. thaliana promoters based on the constructed MSAs. For this, we calculated χ k (l), where k is the class number (Section 2.9). The graphs for the first five classes (k 1-5) are shown in Figures 9-13. The results for the first class containing 8888 sequences indicated that conserved positions were present almost everywhere along the promoter sequences and that the dinucleotide frequencies were significantly different from the expected frequencies calculated for random sequences (Figure 9).   The region from +1 to +80 (501-580 nt in Figure 9) was the most highly conserved in the first class and a similar phenomenon was also observed in all other classes. It has been previously shown that there is a promoter element (DPE) located 28-33 nt downstream from the start codon [40][41][42], which is widely distributed among promoter sequences and is similar to the TATA box [43,44]. However, only relatively short conserved sequences (about 7 bases) elements are observed in promoter sequences. It can be suggested that the expression of various proteins may depend on the downstream 1-80-bp region, which is necessary for transcription initiation and may also play a role in the other processes. The peak at position 501 corresponded to the first codon (+1) and that at position 470-to the TATA-box (−30) in the first class ( Figure 9). The first codon was also well marked in the third class ( Figure 11) and the TATA box was best identified in the third and fifth classes (Figures 11 and 13), but could also be seen in the other classes (Figures 10 and 12).    Conservation was also found in the region upstream of the start codon (from −499 to -30 bp corresponding to 1-470 bp of the analyzed sequences), suggesting similarities in transcriptional mechanisms for the corresponding genes. The dinucleotide frequencies in this region differed from the expected frequencies calculated for random sequences. It is somewhat surprising that even sufficiently distant positions (the first base in Figure 13) can be conserved.
We also analyzed the statistical significance of multiple alignment of promoter sequences. MSAs for the first four classes with the largest numbers of promoters contained 220-280 indels. The classes were initially created by aligning 500 promoters (Section 2.8).
To estimate the degree of divergence among the promoter sequences, we conducted multiple alignments for 500 artificial sequences containing 250 indels and different numbers of substitutions per nucleotide (Figure 7   The region from +1 to +80 (501-580 nt in Figure 9) was the most highly conserved in the first class and a similar phenomenon was also observed in all other classes. It has been previously shown that there is a promoter element (DPE) located 28-33 nt downstream from the start codon [40][41][42], which is widely distributed among promoter sequences and is similar to the TATA box [43,44]. However, only relatively short conserved sequences (about 7 bases) elements are observed in promoter sequences. It can be suggested that the expression of various proteins may depend on the downstream 1-80-bp region, which is necessary for transcription initiation and may also play a role in the other processes. The peak at position 501 corresponded to the first codon (+1) and that at position 470-to the TATA-box (−30) in the first class ( Figure 9). The first codon was also well marked in the third class ( Figure 11) and the TATA box was best identified in the third and fifth classes (Figures 11 and 13), but could also be seen in the other classes (Figures 10 and 12). The alignments of the promoter sequences from classes 1-4 created using MAFFT, T-Coffee, Kalign, Muscle, ClustalW, and Clustal-Omega programs were not statistically significant (Z < −2.0) and contained a very large number of indels. Since according to our estimations, the promoter sequences of these classes carried approximately 3.7 mutations per nucleotide, the alignment cannot be calculated by these methods.

Discussion
In this work, we developed the MAHDS algorithm for MSA and created a server for its application available at http://victoria.biengi.ac.ru/mahds/main. MAHDS can perform the alignment for nucleotide sequences with a high degree of diversity, which cannot be done with the other method, including ClustalW [27], Clustal-Omega [28], T-Coffee [17], Kalign [29], and MAFFT, which align sequences that accumulated no more than 2.5 substitutions per nucleotide (x < 2.5). In this study, we used default parameters, however, we also tried other settings. Therefore, we examined the statistical significance of aligning model sequences by ClustalW using eight different values of gap open (Go) penalties in the interval from 2 to 30 and gap extension (Ge) penalties from 0.1 to 25% for each Go value. For Z = 5.0, the maximum x was 2.4 and in default conditions, x was 2.1 (Figure 4), which means that variations in Go/Ge penalties can increase x by~15%, which is within the error shown in Figure 4. To estimate the error in calculating Z, we used different G(x) sets (Section 3.1) generated from different initial sequences. It is quite unlikely that a heuristic algorithm ClustalW that progressively builds MSA from a series of pairwise alignments could produce results with Z > 5.0 for x > 2.5. Replacing the PAM matrix with BLOSUM did not significantly affect the results shown in Figure 4 and all changes were within the error shown there. For the other methods used in Section 3.1, the increase in x also did not exceed~15%.
In contrast to the other algorithms, MAHDS calculates statistically significant MSA for x < 3.7; furthermore, if the number of sequences to be aligned increases to 500, then the limit for x increases to < 4.4. We believe that such a capability of MAHDS is very important for MSA of different genes and regulatory sequences and could provide higher precision in their annotation. To align 100 sequences of 600 nt each by the MAHDS method, we used a computer cluster with 64 computing cores (eight Ryzen 7 1700 processors) and the alignment took less than 6 min.
It is usually a challenge to perform the alignment of promoter sequences due to their considerable dissimilarity [45]. Despite the large number of promoters, it has not been possible to obtain a statistically significant multiple alignment of their sequences [46]. By using MAHDS, we could construct a statistically significant multiple alignment for~78% of the known promoter sequences from the A. thaliana genome. The estimated average number of substitutions per nucleotide (x) in the created promoter classes was 3.7, whereas the other methods provided statistically significant MSAs at x < 2.5, which explains why such MSA of promoters has not been obtained previously. Therefore, MAHDS could perform multiple alignment of sequences with a low degree of similarity, i.e., those that have accumulated a considerable number (3.7) of substitutions per nucleotide, which enabled us to classify promoters and reveal some common properties within each class at a statistically significant level. However, the developed MAHDS algorithm, similar to the existing methods for computing MSA (MAFFT, T-Coffee, Kalign, Muscle, ClustalW, and Clustal-Omega), is purely mathematical and, therefore, can only reveal structural similarities among the aligned sequences but not their biological significance. At this point, it is difficult to conclude whether the revealed patterns are inherent to promoters or are also present in other sequences. Similarly, it is unclear whether the observed sequence similarity is associated with a common evolutionary origin of promoter regions or with the general functional role of promoters in gene transcription. These problems should be addressed in special studies, where the promoter classes and MSAs obtained here would be used to identify promoter sequences in various genomes and correlate them with experimental results, which can be done using HMMs [47,48].
The search for potential promoters in various genomes is a major challenge. The currently available predictive algorithms such as TSSW [45], PePPER [49], and G4PromFinder [50] use the existing mathematical methods, which do not provide a statistically significant alignment of diverse sequences. The best algorithms can predict one false positive point in 10 3 -10 4 DNA bases. As a result, it is not possible to distinguish the true promoter from false hits. For the promoter prediction, it will be convenient to use maxW(600, 16) matrices calculated here and the mathematical algorithm described in Section 2.3 as various successively cut genome fragments can be considered as sequence S. We believe that a number of false positive hits will be several hundred or even a few dozen per genome size of 3 × 10 9 bases.
In this work, we used MAHDS to construct the multiple alignment of promoter sequences. The constructed alignment can be used for the subsequent prediction of promoters in genome sequences and to improve the accuracy of these predictions. In this case, to reduce the number of false positives it is important to get MSA with the highest statistical significance (as we wrote above). However, if MAHDS is used for other purposes, it can be important to estimate the biological correctness of the constructed MSA. Methods developed for the evaluation of MSA accuracy and correctness utilize structural information about corresponding proteins [51]. However, here we consider DNA sequences for which, according to our estimate, x~3.7. To our knowledge, there are no biologically correct MSA constructed for the set of all promoter sequences. As well as there are no such MSA for any DNA sequences with x~3.7. The point is that both pairwise and multiple alignments could be constructed only for sequences with x < 2.5, as shown in Figures 4-6. Therefore, in this work, we focused on the statistical significance of the constructed alignments, which is a common statistical technique. We believe that a biologically correct alignment using MAHDS for x > 2.5 can be constructed in the future if one can find amino acids or DNA sequences with a similar structure but with sequences' x~3.5-3.7. Then, it will be possible to improve the accuracy of the MAHDS algorithm in the same way as it was done for other methods [51]. However, even the present form of MAHDS can be used to predict promoter sequences, as we wrote in the paragraph above.
In our future research, we will focus on the development of the software for protein MSA based on the same method. For protein sequences, the number of possible amino acid pairs is 400 and a large number of sequences (over 2000) would be required to fill in the maxW(L/N, 400) matrix, where L is the total sequence length, N is the number of sequences, and L/N is the average length of sequences. Therefore, it will be problematic to use this approach for small samples (less than 100 sequences), when the statistically significant filling of the maxW(L/N, 400) matrix is not possible. We intend to apply a series of optimization procedures to address this issue.

Conclusions
In this study, we applied a new mathematical method (MAHDS) for performing multiple alignment of highly divergent sequences, i.e., sequences that have in average more than 2.5 substitutions per position (x). The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We have created a web server for multiple alignment of nucleotide sequences located at http://victoria.biengi.ac.ru/mahds/ main. Then we performed multiple alignments of promoter sequences from the A. thaliana genome and created 25 classes of promoter sequences. Each class of promoter sequences has a statistically significant multiple alignment. The obtained multiple alignments can be used to improve methods for searching for promoter sequences in a variety of genomes.