4.1. Algorithm to Calculate Multiple Alignment
The MAHDS method was developed early for the multi alignment of promoter sequences and DNA sequences with weak similarity [
18,
19,
20]. This method uses mathematical approaches and programs that were previously created to search for tandem repeats in various sequences [
33,
45,
46,
47,
48,
49]. In the present work, the site
http://victoria.biengi.ac.ru/mahds/main (3 June 2020) was used to construct a multiple alignment by the MAHDS method for various amino acid sequences.
Let us briefly consider the mathematical algorithm that underlies the MAHDS method. This algorithm is described in detail in [
18]. Let us calculate multiple alignment for a set of
N sequences denoted as
SI. The optimal multiple alignment (
MA) for this set is characterized by a maximum of some function
Ψ (
maxΨ), which is calculated from
MA. Then, we define the image of multiple alignment (
IMA) as a function of
MA:
IMA =
f(MA) and also consider an inverse function to
f:
f−1(
IMA) =
MA, which allows us to construct
MA from
IMA and the
SI sequences. Specifying functions
f and
f−1 makes it possible to unambiguously convert
MA to its image (
IMA) and vice versa, i.e., to establish one-to-one correspondence between
MA and
IMA.
Several existing algorithms, including progressive alignment [
6], HMM [
7], and others [
3,
4] can produce alignment
MA’, which is close to the optimal
MA. In these methods, the construction of multiple alignment is defined by function
ζ:
MA’ =
ζ(
SI). The main disadvantage of such an approach is that the alignment is based on pairwise comparisons of sequences from set
SI, which precludes building of statistically significant
MA’ when the number of nucleotide substitutions
x is greater than 2.4 [
18]. A statistically significant
MA’ has a probability
P(
Ψ > Ψ0) < α, where
Ψ0 is the value of function
Ψ for
MA’. This probability is calculated by aligning a large number of sets
SR, each of which comprising a series of randomly shuffled sequences of set
SI and determining the distribution of function
Ψ for this alignment. The value of α can be chosen as 0.05, if the alignment is built once.
However, there is another way of constructing
MA’, which is not based on pairwise alignments and direct calculation of
MA’ =
ζ(
SI) but uses patterns (images) of random multiple alignments (
IMAR) [
18]. These images can be subjected to optimization using genetic algorithms in order to create from a random image
IMAR such an image
IMAm that would be the closest to
IMA =
f(MA). The degree of closeness between
IMAm and
IMA cannot be assessed, since
MA and its image
IMA =
f(MA) are unknown at
x > 2.4 for most protein families. However, we can arrive as close as possible to
IMA by increasing the similarity measure between each sequence from set
SI and
IMAm., which can be carried out using global two-dimensional alignment. Each such comparison would produce a maximum value of similarity function
Fmax(
i) (
i = 1, 2, ...,
N), and the sum
would be used as the measure of similarity between
IMAm and
IMA. The greater
mF is obtained with the same global alignment parameters, the more similarity there is between
IMAm and each sequence from set
SI and the closer
IMAm is to
IMA. We will also use
mF as an objective function when running a genetic algorithm to calculate
IMAm, which then would be used to define multiple alignment:
MAm =
f−1(IMAm).
In this approach to calculate multiple alignment, we only need to define functions
f and
f−1. As function
f, we can take the algorithm for creating a position weight matrix (PWM), which would be the image of multiple alignment, i.e.,
IMA = PWM and
IMAm = PWM
m (described below in
Section 4.2). As function
f−1, we take the algorithm for the global alignment of PWM
m and each sequence from set
SI (described earlier [
50] and in
Section 4.3 and
Section 4.4). As a result, it is possible to build multiple alignment with good accuracy using PWM
m. Since the sequences in set
SI can have different lengths, we calculate the average length
of the sequence in the set and create a PWM in the range from
to
.
Compared to the method described in [
18], here we did not combine all sequences from set
SI into one large sequence to calculate global alignment with the PWM but instead aligned the PWM with each
SI sequence separately, which significantly improved the quality of MSA. Furthermore, the alphabet of
SI sequences (
Ain) was expanded from 4 to 20 symbols to correspond to the number of amino acids so that protein sequences could be aligned. The scheme of the algorithm is shown in
Figure 1.
At the beginning, we have SI sequences whose number is N (step 1). In step 2, we created a set (Q) of random PWMs, each of which served as a random image IMAR. In step 3, we used a genetic algorithm to optimize each PWM from set Q and determine PWMm = IMAm. In step 4, we created multiple alignment MAm for sequences from set SI, which corresponded to the found PWMm. Finally, in step 5 we evaluated the statistical significance of the Monte Carlo alignment.
4.2. Creation of Set Q of Random PWMs
To create set Q of random PWMs, PWM rows were represented by 20 amino acids and columns by integers from 1 to L; thus, the PWM dimensions were 20 rows and L columns. A random PWM was obtained from a random amino acid sequence S1 of length Lk (k = 103), in which the frequencies of amino acids corresponded with those in the sequences of set SI. Then, a sequence containing integers from 1 to L was generated, copied k times, and joined in tandem to yield sequence S2.
Next, we filled in frequency matrix
M(
s1(
i),
s2(
i)) =
M(
s1(
i),
s2(
i)) + 1 for all
i from 1 to
Lx
k, determined sums
and
, and estimated probability
. After this, each PWM element was calculated as
. The procedure was repeated 500 times, yielding set
Q (step 2,
Figure 1).
Then, we performed transformation of the matrices from set
Q, which was necessary to ensure that the distribution function of different
Fmax(
i) (
i = 1, 2, ...,
N) was similar in the alignment of sequences from set
SI with each random PWM. For normalization purposes, two restrictions were imposed on PWMs:
where
p1(i) and
p2(i) are probabilities of amino acids in sequence
S1 and of symbols in sequence
S2., respectively. Upon transformation, any PWM used in the algorithm must be reduced to the given
R2 and
Kd. The matrix transformation procedure is described in detail in [
33].
However, it should be noted that R2 cannot be set constant, since in this case the number of cells in the PWM would increase and the cell values would not remain in approximately the same order (as needed) but would rather tend to approach zero in order to correspond to Equation (1). Therefore, R2 was specified as a function of the period and cardinality of the SI sequence alphabet: R2 = RLAinL, where RL is the multiplier parameter, which can be used to scale R2.
As the sequences in set
SI may have different lengths, we created sets of matrices
Q(
L) with length
L ranging from
to
(see
Section 4.1). Each
Q(
L) set contained 500 PWMs.
4.3. Using a Genetic Algorithm to Optimize PWMs
Then, we calculated PWM
m using a genetic algorithm described in detail in [
18,
33] (
Figure 1, step 3). Briefly, PWMs of sets
Q(
L) with the size |Q| = 500 PWMs (
Section 4.2) were considered as organisms. To calculate the objective function for each PWM from set
Q(
L), global alignment with each sequence from set
SI was carried out. We calculated similarity function
Fmax(l) at point (
L1(
l),
L), where
l is the sequence number in set
SI,
L is the number of columns in the PWM, and
L1 is the length of the
l-th sequence from set
SI. Then, the
mF value was calculated and used as an objective function for PWM matrices from set
Q(
L):
where
k is the index of the matrix from set
Q(
L).
The PWMs were ranked according to
mF in the descending order, and the matrix with the largest
mF value,
mF(1), was saved, whereas two PWMs with the smallest
mF values were excluded from set
Q(
L). Then, we created two children from the remaining PWMs; as before [
18,
33], the offspring were generated by gluing of two parent matrices, which were chosen randomly. However, the probability of selecting a particular matrix increased with its
mF value. In addition, random mutations were introduced into 50 randomly selected PWMs (except children created at this step) by replacing one random element by a random value evenly distributed in the interval from −10 to 10. After these changes, a new
Q(
L) set was generated, its matrices compared with the sequences from set
SI, and a new vector
mF(k) (
k = 1, 2, ..., 500) was obtained. The procedure of modifying set
Q(
L) was iterated until
mF(1) ceased to increase during the last 10 cycles. As a result of the genetic algorithm for set
Q(L),
mF(1) (denoted as
mFL(1)) and the corresponding PWM were obtained.
The iterative procedure was performed in each Q(L) set for L from to . Finally, we chose the L value for which mFL(1) was maximal; it was denoted as Lm and the corresponding matrix as PWMm.
4.4. Global Alignment of PWMs from Set Q and Sequences from Set SI
Next, sequence SI(l) of length L1(l) was aligned with a PWM of number k from set Q(L). Sequence SI(l) was denoted as S3 and its elements as s3(i), where i = 1, 2, ..., L1(l). We also used sequence S4 with elements s4(j), where j = 1, 2, ..., L.
To construct the alignment of sequences
s3(i) and
s4(j), we used the global alignment algorithm [
51] with an affine gap penalty function and PWM values from set
Q(
L) [
33] instead of substitution matrices. Then, similarity function F was calculated as:
where
d and
e are gap opening and gap extension penalties, respectively.
Fmax(l) =
F(
L1(
l),
L) (
Section 4.2) was used as a measure of similarity between sequences
S3 and
S4. We also filled in matrix
F’, where in each cell (
i,
j) we remembered the number of the cell from which we arrived at this cell using Formula (4). As a measure of similarity between set
SI(l) and the PWM from set
Q(
L), we have chosen:
The PWM
m of length
Lm (
Section 4.3) was aligned with set
SI(l) using matrix
F’, and the results were applied to calculate multiple alignments of sequences
SI(l) (
l = 1, 2, …,
N), which were evaluated according to
mF(PWMm) calculated with Formula (6).
4.6. Estimating the Statistical Significance of Multiple Alignments
4.6.1. Assessing the Statistical Significance of MAm
To evaluate the statistical significance of the constructed multiple alignment
MAm obtained with
PWMm (
Figure 1, step 5), we used the Monte Carlo method. First, we generated 300 sets of random sequences
Qr through random shuffling of residues in the
SI sequences. For each
Qr we calculated
mF(PWMm) using Formula (6), its arithmetic mean
and dispersion
D(
mF(
PWMm)), and, finally,
Z:
A
Z value greater than threshold
Z0 indicated that the alignment obtained with
PWMm was statistically significant.
4.6.2. Estimating the Statistical Significance of an Arbitrary MA
The algorithm described above can be applied to evaluate the statistical significance of any alignment. Let us take an MA with length K containing α sequences, each denoted as and its elements as (where k = 1, 2, ..., α and j = 1, 2, ..., K). Let us also introduce sequence S6 with elements s6(j), j = 1, 2, ..., K representing consecutive numbers from 1 to K; S6 would be used as column numbers for MA. First, we transform MA to remove those columns where the number of amino acids is less than α/2 and that of deletions is more than α/2; as a result, multiple alignment MA’ of length K’ is obtained. At the same time, sequence S6 is transformed by eliminating the numbers equal to those of the columns to be removed. As a result, we obtain sequence with length K’. Then, we calculate the amino acid frequency matrix for MA’ denoted as V(i,j), where i = 1, 2, …, 20 and j = 1, 2, …, K’; each element shows the number of type i residues at position j of MA’. Next, we calculate the PWM based on matrix V(i,j) as: , where U is the total number of amino acids in MA’, , X(i) is the number of type i residues, and Y(j) is the total number of residues in the j-th column of MA’. Then, we transform PWM(i,j) according to Formulas (2) and (3).
After calculating PWM′(i,j), the weight of MA can be determined. First, we compute the sum for all k = 1, 2, ..., α and j = 1, 2, ..., K: . Next, we obtain the Del(i) vector, which shows the number of deletions with size i in MA. Finally, we calculate , which is considered as the MA weight. Then, statistical significance should be evaluated by estimating mean and variance D(FMA). For this, symbols characterizing deletions are removed from sequences, which are then randomly shuffled to yield sequences, where k = 1, 2, ..., α, and the statistical significance is determined for matrix PWM′(i,j) and sequences according to Formula (7), assuming that , N = K, and PWMm = PWM′(i,j). Thus, we calculate the Z value for MA, which enables comparison of different MAs by their statistical significance.
4.7. Creation of Artificial Sequences to Compare Different Methods of Constructing MAs
To compare the performance of MA methods T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK with that of the MAHDS algorithm, we used artificial sequences, which can be created with a given number of amino acid substitutions and indels for convenience of analysis. To generate an artificial sequence, we first created a random ancestor sequence Anc of length L and then a set of descendant sequences Des(i) (i = 1, 2, …, 100) by adding a given number of random substitutions and/or indels to Anc in a random order. In case of random substitutions without indels, the lengths of child sequences were the same and equal to L, and the number of substitutions in each Des(i) sequence with respect to Anc was denoted as s1. Let us show that the presence of s1 random substitutions in each Des(i) sequence leads to 2s1 random substitutions between any two Des sequences.
When one random substitution is made in sequence
Anc, the probability that in sequence
Des(1) with the number of substitutions
s1 a residue at position
j will be changed is 1/
L. Then, the probability that after
s1 replacements residue
j will not be replaced is:
We assume that amino acid substitutions during the creation of the
Des(1) sequence occur with equal probability. For the rest of amino acids, it is likely that during the last replacement in a given position, the original amino acid will appear as substitution. The probability of falling into this category is:
where
is 20. Thus, the probability that residue
j in sequence
Des(1) will match the residue at
Anc is:
The events described in Formulas (9) and (10) are applicable to all situations, when a residue at position j remains unchanged after random substitutions.
Now, let us consider the case when sequences Des(2) and Des(3) are independently generated from sequence Anc by making s2 random substitutions. Then, for both sequences, Formulas (8)–(10) can be presented as: , , . Let us take Pm3 as a probability that in Des(2) and Des(3) two residues could match. If Pm3 = Pm, then the evolutionary distance between Anc and Des(1) is equal to that between Des(2) and Des(3), which means that, on average, the same number of amino acids coincide between Anc and Des(1) and between Des(2) and Des(3). In Des(2) and Des(3), the proportion of residues unchanged compared to Anc is P02; then, the proportion of residues preserved at the same positions of child sequences is , since it is the probability for a residue to remain in both Des(2) and Des(3). Other amino acids matching in Des(2) and Des(3) include those that have been substituted with the same residues in both sequences, and the probability of such events is .
Finally, there is a probability that a residue that remained from
Anc in
Des(2) could match the one randomly replaced in
Des(3) and vice versa. The sum of the two probabilities is:
. Then,
=
and
. If
Pm3 =
Pm, then
, i.e.,
. If the values for
P02 and
P0 are substituted, then:
Formula (11) means that the number of s2 random substitutions introduced in Des(i) sequences is equivalent to twice the number of s2 mutations in pairwise comparison of Des(i) sequences relative to Anc, which happens under the condition that all child sequences in set Des(i) (i = 1, 2, ..., 100) are generated independently of each other. We used this property in the calculation of the average number of mutations between Des(i) sequences.