Abstract
There are many different algorithms for calculating the distances between DNA chains. Different algorithms for determining such distances give different results. This paper does not consider issues related to which of the classical algorithms is better, but shows the inconsistency of two classical algorithms, specifically the algorithms of Jaro–Winkler and Needleman–Wunsch. To do this, we consider distance matrices based on both of these algorithms. We explain that, ideally, the triangles formed by the distance matrix corresponding to each triple of distances should be acute-angled isosceles. Of course, in reality, this fact is violated, and we can determine the badness for each such triangle. In this case, the two algorithms for determining distances will be consistent. In the case where such sequences of badness are located in the same order for them, and the greater the difference from this order, the less they are consistent. In this paper, we consider the distance matrices for the two mentioned algorithms, calculated for the mitochondrial DNA of 32 species of monkeys belonging to different genera. For them, 4960 triangles are formed in both matrices, and we calculate the values of the rank correlation between these sequences. We obtain very small results for these values (with different methods of calculating the rank correlation, it does not exceed the value 0.14), which indicates the inconsistency of the two algorithms under consideration.
Keywords:
heuristic algorithms; DNA chains; distance matrix; Jaro–Winkler algorithm; Needleman–Wunsch algorithm; pair correlation MSC:
62P10; 92B15
1. Introduction and Motivation
There are many different algorithms for calculating the distances between sequences of symbols of different natures, and, in particular, between DNA chains, which are the main subject of research in this article. At the same time, both biologists and specialists in applied mathematics consider some of the provisions used in these algorithms to be unshakable. (However, “Nothing in Biology Makes Sense Except in the Light of Evolution” (1973, Theodosius Dobzhansky). We believe that both the title of this essay and its content are directly related to the study of DNA sequences in general and to algorithms for calculating distances between them in particular.) Certainly, this applies to the once-calculated distance between the genomes of different species.
However, there are some important points to make about this case:
- Firstly, if we talk specifically about mammals, whose genomes are the object of the research of this paper, then one of three options is usually used as the object for analysis:
- –
- Mitochondrial DNA (mt DNA, which will be the main object of this research);
- –
- The “tails” of Y chromosomes;
- –
- The main histocompatibility complex.
- Secondly, different algorithms are used to determine the distances between genomes, and, according to the author’s opinion, they are all modifications of the Levenshtein metric (sometimes significant modifications, as a result of which their relation to the Levenshtein metric is not always obvious). At the same time, in this paper, we do not engage in comparing these different algorithms.
- Thirdly, the main difficulty encountered in calculating the distance between such sequences is their very long length. For example, the length of the human mt DNA sequence, i.e., very short DNA, exceeds 16,000 characters, while the total length of human DNA exceeds 3,000,000,000 characters. Therefore, it is impossible to solve real problems with the exact calculation of the Levenshtein distance, and all the algorithms used in them can be called heuristic.
- Fourthly, it is possible to conduct distance studies either before “combining triples of letters into one” or after such combining. However, for the task described in this paper, this is not fundamental.
- Fifthly, for 4 variants of nucleotides in the genome, natural selection results not in 64 variants of the triples but in 21 variants only. Each of these options can be considered an encoded letter. Moreover, as is written in the popular scientific literature (we shall not give specific references), at least four artificial amino acids have been designed, which can “on full grounds” enter artificial DNA chains, and the triples of amino acids in such artificial DNA chains can be replaced by fours or even fives.
Thus, different algorithms for determining the distances between DNA chains give different results. At the same time, the general trend is, of course, correct; for any adequate algorithm, the distance between the genomes of humans and chimpanzees is, of course, less than that between the genomes of humans and, e.g., an elephant. However, of course, we would like to obtain a more detailed answer to the question of quantifying this distance. The paper is devoted to one of the issues of this big topic.
However, when considering such different algorithms for determining the distances between genomes, the author has long had an assumption about the great inconsistency of the Jaro–Winkler and Needleman–Wunsch algorithms. The main subject of the paper is the quantitative verification of this hypothesis. We shall show in the paper that it is fulfilled, i.e., these algorithms are not consistent.
We calculate such a quantitative characteristic as follows: First, for the algorithm in question, we calculate the distance matrix between pairs of genomes. Note that, for example, for a matrix of dimension 32, we have
cells required to fill the matrix, but the algorithms mentioned above have been working for quite a long time. Using the Needleman–Wunsch algorithm for mt DNA, we fill 496 cells of the 32-dimensional matrix in about a day of operation of an average modern computer.
Next, in this matrix, we consider all the triangles. It is important to note that there are quite a lot of them. For example, for a matrix of dimension 32, there are
Each of the triangles has a special characteristic, the so-called badness, which will be discussed in detail in Section 2.
For two different algorithms, as a result of such constructions, we can calculate sufficiently long sequences of badness values for all triangles. Next, we calculate the values of the pair correlation for these sequences, and we believe that acceptably large values of the latter value (that is, or more) indicate that the algorithms are consistent.
However, in our situation, real calculations give results that are very far from such values. Specifically, with different methods of calculating the correlation coefficient, values for the examples we are considering are obtained in the range from to (see details below). Furthermore, it would not be an exaggeration to say that it is good that the correlation coefficients are positive in general. Thus, as we have already said, the Jaro–Winkler and Needleman–Wunsch algorithms are inconsistent.
At the end of this section, we note a few small general remarks:
- Firstly, everything said here is described in more detail below, but we do not provide detailed content by section in the introduction.
- Secondly, the technique considered in the paper, which can be called an algorithm for determining the consistency of algorithms for calculating distances between lines, is applicable to any pair of such distance calculation algorithms and to any set of types.
- Thirdly,
- –
- Algorithms for determining the distances between two specific lines (in particular, DNA sequences) are heuristic due to the total size of the data under consideration;
- –
- Algorithms for calculating the badness for triples of DNA sequences are therefore heuristic algorithms for analyzing heuristic algorithms;
- –
- Algorithms for determining consistency between two distance calculation algorithms can therefore be called heuristic algorithms for analyzing heuristic algorithms designed to analyze heuristic algorithms.
In other words, a “triple embedding” appears.
2. Preliminaries: DNA Chains, Their Distance and Statistical Characteristics
The theory presented in the paper related to the analysis of DNA sequences is based on the author’s previous works, among which it is primarily worth noting [1,2,3,4]. The standard concepts and formulas of mathematical statistics are consistent with the monographs [5,6].
Above, we talked about the triangles formed by the distances between genomes, that is, where they come from. We continue the example of chimpanzees and humans but add a third very close species.
For this interesting example, let us consider the three following species: human (H), chimpanzee (C) and bonobo (B). According to biologists,
- The ancestors of both of apes and humans diverged about 7,000,000 years ago;
- The ancestors of chimpanzees and bonobos diverged about 2,500,000 years ago.
At the same time, the exact values are not particularly important. The only important thing is that the triangles formed by the corresponding three distances should ideally be acute-angled isosceles. Moreover, all of the above must be fulfilled for any three species.
Table 1.
Some triangles and options for their badness.
Table 1.
Some triangles and options for their badness.
| Sides 1,2 | Angles 2 | Bad. (0) | Bad. (1) | Bad. (2) | Bad. (3) | Bad. (5) |
|---|---|---|---|---|---|---|
| , , | , , | |||||
| 0 | 0 | 0 | 0 | 0 | ||
| 0 | 0 | 0 | 0 | 0 | ||
| − | ||||||
| − |
1 We round it up to an integer of degrees; therefore, the sum may not be the same as 180. 2 , .
Thus, we consider distance matrices based on both of these algorithms. Certainly, in reality, the fact that the triangles formed in these matrices are acute-angled isosceles is violated. Then, for each such a triangle, we determine the numerical value of the density. In the process of the calculations, several variants of such badness are considered. Examples for some specified triangles are shown in Table 1. Below, we will use the value of badness, indicated by “Bad. (0)”, which we consider to be the most adequate.
It should be noted that such matrices are used primarily by biologists, in particular in the popular science literature, but are little used in related mathematical modeling tasks. Of course, there are successful exceptions. As one of them, we note the most recent study [7]. Among earlier works, we note [8,9].
The author hopes that the presented work will be considered not only for the use of such distance matrices by biologists, but also, above all, as one of the applications for creating mathematical models and algorithms for working with such matrices.
Based on such matrices, the total badness of all triangles can be considered and it can be argued that algorithms with lower badness values are better than algorithms with higher values. However, such consideration is not the subject of this paper. Here, we consider another natural assumption. It can be assumed that two algorithms for determining distances are consistent in the case when the sequences of badness of their corresponding matrices are ordered in the same order for them, and the greater the difference from this order, the less they are consistent.
Two such sequences of badness values can be compared by applying rank correlation algorithms. At the same time, as in many statistical experiments, if we obtain a value exceeding, e.g., , then it could be argued that the considered sequences of badness are consistent, and, therefore, the two algorithms under consideration are also consistent. However, we do not obtain such values (see details below). We also note in advance that different classical variants for calculating rank correlation give approximately the same results; therefore, the specific algorithm of rank correlation is unimportant.
Now, let us move on to the description of the standard statistical characteristics we use, as well as their small variations. Sometimes, we use “more customary” notation. For example, we do not use “standard statistical” notation (we write instead), etc.
The two random variables under consideration are denoted by X and Y; their observed implementations are denoted in the same way with the corresponding subscripts, i.e.,
Firstly, let us formulate the usual definition of correlation. Recall that the pair correlation coefficient can be calculated using the usual formula:
where
In our further tables and program fragments, this variant of the coefficient has the number 0.
Secondly, let us formulate a modified Kendall correlation coefficient. For it, we define the number of discrepancies (“entropy coefficient”) as follows: a discrepancy holds if, for a pair where , we have
Let us denote the number of such discrepancies by , or simply E in the next formula. We should also note that the correlation calculated in any way between the usual Kendall correlation coefficient and our variant is always equal to 1 (“correlation between correlations”). This is easily obtained by simply considering the formulas.
Since the maximum possible number of such discrepancies is , we will consider the modified Kendall correlation coefficient by
This value is equal to 1 in the case of 0 discrepancies, and is equal to in the case of the maximum possible number of discrepancies. In our further tables and program fragments, this variant of the coefficient has the number 2.
Note that we could calculate this coefficient as follows: We define the “entropy coefficient” considered before for each pair of pairs by (1). Then, we calculate the sum of these coefficients and divide the result by the value already used earlier.
Different publications provide different versions of criticism of the Kendall criterion, but the authors of the current paper consider the following flaw to be the most important: it does not give very adequate results with a large number of coincidences in the values of the considered random variables. Therefore, we also consider the following “very modificated” Kendall correlation coefficient.
It is most convenient to consider it as a search for pairs of pairs, like in the last remark. However, unlike in (1), we also use the value 0 (not only 1 and ). Specifically, the value 0 is selected if and only if the values of at least one of the random variables in the considered pairs match. In our further tables and program fragments, this variant of the coefficient has the number 3.
Thirdly, the Spearman correlation coefficient is calculated in the usual way, i.e.,
In our further tables and program fragments, this variant of the coefficient has the number 1.
Note in advance that, in Section 5, we will briefly describe and apply another way to calculate the rank correlation where it will have the number 4.
3. Problem Statement
Continuing what was said in the introduction, we formulate the problem statement. Based on general considerations, we can say in our expert assessment that we believe that the Needleman–Wunsch algorithm [10] is much more adequate than the Jaro–Winkler algorithm [11]. However, we do not show this in this paper (many indirect arguments were given in our publications cited above), but rather show a less significant fact, namely, the inconsistency of these two algorithms.
By using correlation analysis, such inconsistency can be shown by simply calculating the rank correlation for the sequences of the corresponding elements of the two distance matrices, i.e., matrices for the Jaro–Winkler and Needleman–Wunsch algorithms. However, we do not consider such work to be important, for the following three reasons:
- Firstly, there are not very many such elements. In square matrices of size 32, we have 496 elements, which are located from the top of the main diagonal.
- Secondly, based on the calculations performed, we came to the conclusion that the results of such a correlation comparison are not very informative. The values of the correlation coefficient (with different methods of calculating it) do not differ much from (specifically, from to ), and this fact, apparently, does not allow us to draw unambiguous conclusions.
- Thirdly, we are considering a specific task (and not just comparing any two abstract matrices), and, as we noted in Section 2, our matrices must have an important special property, i.e., a small value of badness (we use the Bad. (0) value), and, moreover, they also have in our case the consistency of these values for both matrices.
Thus, we use some more complex comparisons. We will talk about them in the next section.
4. Algorithms, Methods and Some More About the Motivation
In this section, we discuss different opinions about whether the conclusions that can be drawn based on the simplest study of the distances between DNA sequences are correct or not. In particular (and this is the simplest question, to which there is no definitive answer), whether the initial results themselves expressing the distances between genomes, i.e., the algorithms of Jaro–Winkler and Needleman–Wunsch (as well as several other algorithms not considered in this paper), are correct and adequate. In any case, the most basic conclusion is as follows: the research related to the consideration of algorithms for determining the distances between DNA strands should be continued.
On the one hand, it is worth adding that there are scientific publications where this difference is significantly greater (see [12] and many others); estimates of the difference between the genomes of humans and chimpanzees reach up to 19% (and this is only in works known to the author; see also some links in [12]). This is explained by the authors as follows: Geneticists allegedly sequenced “small pieces of chimpanzee DNA”, i.e., using conventional chemical laboratory procedures, they determined the sequence of the chemical symbols. Then, these small chains of “symbols” were connected to the human genome in those places where, in their opinion, they should match. After that, the human genome was removed and a chimpanzee pseudo-genome was obtained, which allegedly indicated a common kinship with humans. Thus, a mixed sequence was obtained, which, apparently, is not real. Hence, it is concluded that the real differences are significantly more than 1.
However, on the other hand, there are arguments for the fact that the resulting distances between genomes are more or less correct, and the grounds for the noticeable difference between humans and their closest relatives lie elsewhere.
According to the general DNA sequence, humans actually stand apart from other hominids. Moreover, this is not according to the “formal set of genes”, but according to their distribution on chromosomes. It is precisely the following factors that apply:
- Multiple chromosomal aberrations;
- The deletion of a huge section;
- The transition of another section to the other chromosome, due to which humans have one pair of autosomes fewer;
- The reversal of another section.
Most likely, it leads to a radical change in the phenotype. We consider the phenotype to be a set of internal and external features, properties and traits of a specific organism. There are some other definitions in the scientific literature.
This last, i.e., the change in the phenotype, can be described primarily by the following signs (also based on material taken from numerous scientific and popular science publications):
- The absence of a massive, protruding jaw in humans, and, consequently, a significantly different structure of the oral cavity, which is the most important resonator in speech formation;
- The structure of the nose (as well as the larynx) is significantly different;
- Lack of wool cover;
- Walking upright;
- Rebuilt work of sebaceous and sweat glands;
- Reconstruction of the upper part of the skull;
- Many other things that distinguish humans from anthropoids in general.
However, as can be understood, all of the above are general reflections at the level of “apparently”, not supported by specific genomic studies. At the same time, it is precisely this “lack of strength”, i.e., the inability to strictly prove the above dependencies, at least in the near future, that is exactly what
suggests the need to continue detailed studies of DNA strands, in particular, to analyze their similarity.
Such tasks remain and they will remain very relevant for a long time. As we said before, the research related to the consideration of algorithms for determining the distances between DNA strands should be continued. In particular, it is possible in future studies to try to algorithmically formulate the grounds for the strong difference between humans and great apes.
Thus, for the considered matrices corresponding to 32 species of monkeys of different genera, we consider two sequences of badness values for all the corresponding triangles (as we said above, there are 4960 of them) and calculate the badness values for them.
We also repeat once again that we do not consider in this paper the issues related to determining which of the algorithms is better; we are considering only a method that can show the inconsistency of these algorithms.
5. Description of Computational Experiments and the Results
Firstly, let us list the species of monkeys we are considering (see Table 2). It is important to remark that all the species belong to different genera. Apparently, this fact leads to a more or less successful distribution of the elements of the distance matrix.
Table 2.
The considered monkey species in alphabetical order.
After that, we present the distances calculated for the mt DNA of these species in the form of two tables. Everything is considered for two different distance calculation algorithms. Namely, for our article, we have reviewed the algorithms of Jaro–Winkler and Needleman–Wunsch.
Table 3 is the calculated distance matrix for the Jaro–Winkler algorithm. The species numbers correspond to those shown in Table 2. The peculiarity of this algorithm is that it gives very close answers for these types; therefore, the three-digit numbers shown in the table correspond to three decimal places after . For instance, 541 means .
If it is necessary to verify the algorithms described by us and the calculation results, the values of the table elements can simply be copied from a PDF file and pasted to the computer program. The author can also submit them by e-mail.
Table 3.
The matrix obtained by applying the Jaro–Winkler algorithm.
Table 3.
The matrix obtained by applying the Jaro–Winkler algorithm.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 000 | 541 | 677 | 583 | 592 | 541 | 589 | 536 | 562 | 633 | 465 | 610 | 530 | 370 | 512 | 565 | 545 | 800 | 624 | 640 | 520 | 556 | 548 | 562 | 515 | 570 | 726 | 524 | 511 | 589 | 589 | 540 |
| 2 | 541 | 000 | 635 | 387 | 342 | 369 | 396 | 381 | 386 | 733 | 600 | 686 | 463 | 542 | 409 | 549 | 349 | 722 | 698 | 708 | 515 | 440 | 401 | 543 | 462 | 455 | 681 | 388 | 452 | 464 | 383 | 532 |
| 3 | 677 | 635 | 000 | 665 | 676 | 627 | 668 | 626 | 670 | 714 | 728 | 739 | 666 | 678 | 655 | 777 | 617 | 731 | 744 | 760 | 737 | 661 | 663 | 767 | 692 | 680 | 690 | 646 | 648 | 710 | 661 | 753 |
| 4 | 583 | 387 | 665 | 000 | 334 | 396 | 385 | 384 | 396 | 767 | 630 | 727 | 457 | 579 | 422 | 577 | 383 | 677 | 723 | 733 | 546 | 447 | 411 | 571 | 442 | 434 | 637 | 403 | 474 | 447 | 378 | 568 |
| 5 | 592 | 342 | 676 | 334 | 000 | 384 | 395 | 321 | 397 | 777 | 644 | 736 | 481 | 584 | 433 | 570 | 375 | 672 | 742 | 751 | 554 | 451 | 421 | 579 | 429 | 444 | 650 | 418 | 498 | 453 | 393 | 562 |
| 6 | 541 | 369 | 627 | 396 | 384 | 000 | 401 | 319 | 406 | 706 | 581 | 665 | 455 | 528 | 387 | 526 | 383 | 753 | 676 | 675 | 510 | 458 | 381 | 499 | 481 | 457 | 693 | 320 | 436 | 475 | 400 | 527 |
| 7 | 589 | 396 | 668 | 385 | 395 | 401 | 000 | 397 | 389 | 763 | 630 | 727 | 471 | 580 | 425 | 584 | 392 | 695 | 738 | 741 | 556 | 429 | 346 | 573 | 458 | 451 | 657 | 400 | 488 | 463 | 382 | 573 |
| 8 | 536 | 381 | 626 | 384 | 321 | 319 | 397 | 000 | 400 | 723 | 595 | 691 | 453 | 537 | 396 | 527 | 345 | 724 | 687 | 696 | 518 | 457 | 392 | 534 | 474 | 457 | 685 | 312 | 448 | 470 | 392 | 526 |
| 9 | 562 | 386 | 670 | 396 | 397 | 406 | 389 | 400 | 000 | 747 | 585 | 700 | 462 | 561 | 415 | 565 | 390 | 725 | 703 | 722 | 532 | 448 | 403 | 571 | 467 | 469 | 681 | 409 | 477 | 482 | 327 | 546 |
| 10 | 633 | 733 | 714 | 767 | 777 | 706 | 763 | 723 | 747 | 000 | 628 | 635 | 706 | 661 | 676 | 699 | 723 | 674 | 653 | 678 | 634 | 720 | 693 | 677 | 767 | 758 | 538 | 712 | 697 | 793 | 775 | 656 |
| 11 | 465 | 600 | 728 | 630 | 644 | 581 | 630 | 595 | 585 | 628 | 000 | 560 | 584 | 462 | 549 | 535 | 594 | 859 | 568 | 579 | 494 | 596 | 589 | 526 | 636 | 608 | 790 | 582 | 560 | 631 | 639 | 464 |
| 12 | 610 | 686 | 739 | 727 | 736 | 665 | 727 | 691 | 700 | 635 | 560 | 000 | 673 | 610 | 631 | 601 | 687 | 871 | 379 | 381 | 556 | 688 | 669 | 589 | 724 | 706 | 795 | 667 | 646 | 729 | 731 | 571 |
| 13 | 530 | 463 | 666 | 457 | 481 | 455 | 471 | 453 | 462 | 706 | 584 | 673 | 000 | 535 | 446 | 467 | 449 | 741 | 665 | 678 | 434 | 391 | 454 | 454 | 414 | 402 | 678 | 463 | 454 | 413 | 461 | 448 |
| 14 | 370 | 542 | 678 | 579 | 584 | 528 | 580 | 537 | 561 | 661 | 462 | 610 | 535 | 000 | 502 | 566 | 545 | 790 | 614 | 627 | 526 | 545 | 539 | 558 | 578 | 549 | 723 | 511 | 492 | 545 | 582 | 539 |
| 15 | 512 | 409 | 655 | 422 | 433 | 387 | 425 | 396 | 415 | 676 | 549 | 631 | 446 | 502 | 000 | 515 | 400 | 772 | 630 | 642 | 477 | 478 | 395 | 506 | 510 | 483 | 705 | 390 | 437 | 509 | 426 | 493 |
| 16 | 565 | 549 | 777 | 577 | 570 | 526 | 584 | 527 | 565 | 699 | 535 | 601 | 467 | 566 | 515 | 000 | 529 | 913 | 580 | 571 | 401 | 484 | 548 | 350 | 483 | 461 | 836 | 528 | 513 | 481 | 589 | 379 |
| 17 | 545 | 349 | 617 | 383 | 375 | 383 | 392 | 345 | 390 | 723 | 594 | 687 | 449 | 545 | 400 | 529 | 000 | 719 | 684 | 701 | 514 | 442 | 391 | 543 | 462 | 461 | 673 | 376 | 443 | 468 | 387 | 503 |
| 18 | 800 | 722 | 731 | 677 | 672 | 753 | 695 | 724 | 725 | 674 | 859 | 871 | 741 | 790 | 772 | 913 | 719 | 000 | 871 | 884 | 851 | 708 | 759 | 897 | 664 | 690 | 538 | 759 | 763 | 694 | 709 | 874 |
| 19 | 624 | 698 | 744 | 723 | 742 | 676 | 738 | 687 | 703 | 653 | 568 | 379 | 665 | 614 | 630 | 580 | 684 | 871 | 000 | 366 | 579 | 701 | 682 | 565 | 734 | 711 | 799 | 668 | 647 | 721 | 729 | 547 |
| 20 | 640 | 708 | 760 | 733 | 751 | 675 | 741 | 696 | 722 | 678 | 579 | 381 | 678 | 627 | 642 | 571 | 701 | 884 | 366 | 000 | 585 | 717 | 688 | 567 | 752 | 718 | 806 | 679 | 656 | 729 | 739 | 551 |
| 21 | 520 | 515 | 737 | 546 | 554 | 510 | 556 | 518 | 532 | 634 | 494 | 556 | 434 | 526 | 477 | 401 | 514 | 851 | 579 | 585 | 000 | 446 | 515 | 386 | 469 | 462 | 787 | 508 | 498 | 485 | 549 | 344 |
| 22 | 556 | 440 | 661 | 447 | 451 | 458 | 429 | 457 | 448 | 720 | 596 | 688 | 391 | 545 | 478 | 484 | 442 | 708 | 701 | 717 | 446 | 000 | 438 | 473 | 377 | 369 | 644 | 465 | 471 | 379 | 451 | 469 |
| 23 | 548 | 401 | 663 | 411 | 421 | 381 | 346 | 392 | 403 | 693 | 589 | 669 | 454 | 539 | 395 | 548 | 391 | 759 | 682 | 688 | 515 | 438 | 000 | 539 | 492 | 478 | 705 | 380 | 451 | 490 | 416 | 528 |
| 24 | 562 | 543 | 767 | 571 | 579 | 499 | 573 | 534 | 571 | 677 | 526 | 589 | 454 | 558 | 506 | 350 | 543 | 897 | 565 | 567 | 386 | 473 | 539 | 000 | 503 | 479 | 822 | 522 | 509 | 465 | 569 | 372 |
| 25 | 515 | 462 | 692 | 442 | 429 | 481 | 458 | 474 | 467 | 767 | 636 | 724 | 414 | 578 | 510 | 483 | 462 | 664 | 734 | 752 | 469 | 377 | 492 | 503 | 000 | 346 | 627 | 484 | 486 | 344 | 467 | 486 |
| 26 | 570 | 455 | 680 | 434 | 444 | 457 | 451 | 457 | 469 | 758 | 608 | 706 | 402 | 549 | 483 | 461 | 461 | 690 | 711 | 718 | 462 | 369 | 478 | 479 | 346 | 000 | 621 | 460 | 453 | 366 | 451 | 471 |
| 27 | 726 | 681 | 690 | 637 | 650 | 693 | 657 | 685 | 681 | 538 | 790 | 795 | 678 | 723 | 705 | 836 | 673 | 538 | 799 | 806 | 787 | 644 | 705 | 822 | 627 | 621 | 000 | 694 | 699 | 634 | 663 | 805 |
| 28 | 524 | 388 | 646 | 403 | 418 | 320 | 400 | 312 | 409 | 712 | 582 | 667 | 463 | 511 | 390 | 528 | 376 | 759 | 668 | 679 | 508 | 465 | 380 | 522 | 484 | 460 | 694 | 000 | 389 | 478 | 409 | 525 |
| 29 | 511 | 452 | 648 | 474 | 498 | 436 | 488 | 448 | 477 | 697 | 560 | 646 | 454 | 492 | 437 | 513 | 443 | 763 | 647 | 656 | 498 | 471 | 451 | 509 | 486 | 453 | 699 | 389 | 000 | 476 | 488 | 500 |
| 30 | 589 | 464 | 710 | 447 | 453 | 475 | 463 | 470 | 482 | 793 | 631 | 729 | 413 | 545 | 509 | 481 | 468 | 694 | 721 | 729 | 485 | 379 | 490 | 465 | 344 | 366 | 634 | 478 | 476 | 000 | 466 | 479 |
| 31 | 589 | 383 | 661 | 378 | 393 | 400 | 382 | 392 | 327 | 775 | 639 | 731 | 461 | 582 | 426 | 589 | 387 | 709 | 729 | 739 | 549 | 451 | 416 | 569 | 467 | 451 | 663 | 409 | 488 | 466 | 000 | 541 |
| 32 | 540 | 532 | 753 | 568 | 562 | 527 | 573 | 526 | 546 | 656 | 464 | 571 | 448 | 539 | 493 | 379 | 503 | 874 | 547 | 551 | 344 | 469 | 528 | 372 | 486 | 471 | 805 | 525 | 500 | 479 | 541 | 000 |
The following Table 4 is the calculated distance matrix for the Needleman–Wunsch algorithm. The species numbers also correspond to those shown in Table 2.
Table 4.
The matrix obtained by applying the Needleman–Wunsch algorithm.
This algorithm gives not very close answers for these types; therefore, the three-digit numbers shown in the table correspond to three decimal places after (not ). For instance, 375 means . It is important to note that such a 10 times increase in values does not change any of the values of the badness of the triangles we are considering. Indeed, considering the first triangle of Table 3, with the sides , and , we can say that its badness is exactly equal to the badness of the triangle with the sides , and .
As follows from the previous material, we can work with Table 3 and Table 4 (they are given after the text of the paper), as well as with any other tables built on the same principle. Simply, as with tables of integers, the values of badness that we are interested in will be the same.
In general, all the calculation results are shown in Table 5. The column designations are clear; they are related to the options described above for calculating the rank correlation. The string designations have the following meaning:
- “Simple” means counting sequences of matrix elements above the main diagonal, while “main” means counting sequences of badness (Bad. 0) of triangles;
- “With” (unlike “without”) means that we used normalization before calculations. As usual, normalization is what we call the linear mapping of all the received data into the segment .
In the next section, we will discuss the numerical results obtained and some conclusions.
Table 5.
The calculation results.
Table 5.
The calculation results.
| Option | Corr-0, Usual | Corr-1, Spearman | Corr-2, Kendall+ | Corr-3, Kendall++ |
|---|---|---|---|---|
| simple, with | ||||
| main, without | ||||
| main, with |
At the end of this section, we note that, in a recently published work [13], our method of calculating rank correlation was given. See its detailed description in that paper. This method differs from all “classical” methods and at the same time gives adequate results in all fields of application known to us. The first computational experiments show its good applicability in the subject area considered in this paper. For example, the values of the rank correlation in all three calculation variants (i.e., in three lines) for the same set of genomes and a pair of Needleman–Wunsch and Jaro–Winkler algorithms produce results not exceeding , while the pair of the values obtained by the Needleman–Wunsch and Damerau–Levenshtein [14] algorithms are greater than . However, of course, computational experiments related to this method of calculating the rank correlation should be continued.
6. Discussion and Conclusions
Summarizing some of the thoughts of this paper, we can formulate the following: The difference between genomes is very different in different studies, although the vast majority of both scientific and popular scientific papers give a distance between the genomes of humans and chimpanzees in the range from % to 2% (i.e., the similarity is from 98% to %). For example, according to [15], the genomes of humans and chimpanzees are “identical by more than %”, and this statement is very often quoted “as the ultimate truth”. However, for the sake of completeness, in the next section, we give even more detailed reflections on the topic of specific values of genome proximity.
Regarding the various rank correlation algorithms used in this article, we note that very interesting results related to their different ways of calculation (including the method described in [13]) are provided by the following interesting example, specially selected by the author:
1001 1002 1003 1004 1005 1006 2001 2002 2003 2004 2005 2006
1006 1005 1004 1003 1002 1001 2006 2005 2004 2003 2002 2001
(The corresponding elements of the two rows form the pairs of elements.)
From our point of view, this example, as well as some other specially selected ones, shows the need to use special algorithms for calculating rank correlation and improving existing ones. Therefore, in the following publications, we propose to return to the consideration of this example and its connection with the methods of calculating rank correlation that we have considered.
Here are some references to biologists’ works that use distance matrices between genomes. The following are recent works that are not related to the study of mammalian mitochondrial DNA: [16,17,18]. However, we have already noted that the application of mathematical methods and the creation of algorithms for analyzing such matrices, including heuristic algorithms, are reflected in a very small number of publications. We have already cited the following: [7,8,9].
Thus, the main work performed is as follows: Based on the above tables of calculation results, we can say that the method we described as “simple” can hardly answer the question we posed about the consistency of the two algorithms. Correlation values of approximately 0.5, as a rule, do not say much. However, everything is clarified by a more complex method that examines the rank correlation of the badness of all the triangles under consideration. For large values of pairs obtained using the algorithms of Needleman–Wunsch and Damerau–Levenshtein, we obtain very small values of pairs of algorithms using the Needleman–Wunsch and Jaro–Winkler algorithms (not exceeding 0.14) on the same input data.
Therefore, we think that we have shown the inconsistency between two well-known algorithms for determining the distances between genomes, namely, the algorithms of Jaro–Winkler and Needleman–Wunsch. Specifically, there is an assumption (not yet fully confirmed) that the Needleman–Wunsch algorithm is significantly more adequate than the Jaro–Winkler one.
Here are possible directions for continuing the work described in the paper, including outcomes that can be drawn based on its material:
- We hope to obtain a matrix for all types of monkeys (500 to 850 types, according to various sources), and at first these will be algorithms for restoring a partially filled matrix.
- This problem is best used for the Needleman–Wunsch algorithm, ignoring the rest of the described algorithms.
- The author believes that the following task is very important. This problem consists of viewing, based on the given distance matrix, all five variants of badness, and choosing “the best” of them. In previous papers and in Section 2, it was said that, ideally, this value should be equal to 0. Then, “the best” badness can be obtained by minimizing the linear combination of the considered options. At the same time, of course, functions like the identity zero are pointless to consider. Therefore, in our model, we consider a linear combination of several of the above functions for variants of badness.
- We hope to continue the consideration of the tasks described in the paper, our algorithm for calculating rank correlation [13], which can be called corr-4.
Funding
This work was partially supported by a grant from the scientific program of Chinese universities, the “Higher Education Stability Support Program” (section “Shenzhen 2022—Science, Technology and Innovation Commission of Shenzhen Municipality”).
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Acknowledgments
This work was partially supported by a grant from the scientific program of Chinese universities, the “Higher Education Stability Support Program” (section “Shenzhen 2022—Science, Technology and Innovation Commission of Shenzhen Municipality”). The author also expresses gratitude to post-graduate students Li Jiamian and Mu Jingyuan (Shenzhen MSU–BIT University), who received the values of Table 3 and Table 4, for which they independently implemented these algorithms.
Conflicts of Interest
The author declares no conflicts of interest.
References
- Melnikov, B.; Pivneva, S.; Trifonov, M. Various algorithms, calculating distances of DNA sequences, and some computational recommendations for use such algorithms. CEUR Workshop Proc. 2017, 1902, 43–50. [Google Scholar]
- Melnikov, B.; Melnikova, E.; Pivneva, S.; Trenina, M. An approach to analysis of the similarity of DNA-sequences. CEUR Workshop Proc. 2018, 2212, 67–72. [Google Scholar]
- Melnikov, B.; Zhang, Y.; Chaikovskii, D. An inverse problem for matrix processing: An improved algorithm for restoring the distance matrix for DNA chains. Cybern. Phys. 2022, 11, 217–226. [Google Scholar] [CrossRef]
- Melnikov, B.; Chaikovskii, D. On the application of heuristics of the TSP for the task of restoring the DNA matrix. Front. Artif. Intell. Appl. 2024, 385, 36–44. [Google Scholar]
- Lagutin, M. Visual Mathematical Statistics; BINOM. Laboratoriya Znaniy: Moscow, Russia, 2012; 472p. (In Russian) [Google Scholar]
- Wasserman, L. All of Statistics: A Concise Course in Statistical Inference; Springer Science & Business Media: Berlin, Germany, 2013; 442p. [Google Scholar]
- Young, S.; Gilles, J. Use of 3D chaos game representation to quantify DNA sequence similarity with applications for hierarchical clustering. J. Theor. Biol. 2024, 5967, 111972. [Google Scholar] [CrossRef] [PubMed]
- Ballester, P.J.; Richards, W.G. Ultrafast shape recognition to search compound databases for similar molecular shapes. J. Comput. Chem. 2007, 28, 1711–1723. [Google Scholar] [CrossRef] [PubMed]
- Bodenhofer, U.; Bonatesta, E.; Horejs-Kainrath, C.; Hochreiter, S. Msa: An R package for multiple sequence alignment. Bioinformatics 2015, 31, 3997–3999. [Google Scholar] [CrossRef] [PubMed]
- Needleman, S.; Wunsch, C. A general method is applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef] [PubMed]
- Winkler, W. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Survey Research Methods Sections, Anaheim, CA, USA, 6–9 August 1990; American Statistical Association: Alexandria, VA, USA, 1990; pp. 354–359. [Google Scholar]
- Cohen, J. Relative differences: The myth of 1%. Science 2007, 316, 1836. [Google Scholar] [CrossRef] [PubMed]
- Melnikov, B.; Lysak, T. On some algorithms for comparing models of femtosecond laser radiation propagation in a medium with gold nanorods. Cybern. Phys. 2024, 13, 261–267. [Google Scholar] [CrossRef]
- Levenshtein, V. Binary codes capable of correcting. Deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Polavarapu, N.; Arora, G.G.; Mittal, V.; McDonald, J. Characterization and potential functional significance of human-chimpanzee large INDEL variation. Mob. DNA 2011, 2, 13. [Google Scholar] [CrossRef] [PubMed]
- Sampaio, J.R.; Oliveira, W.D.d.S.; Junior, L.C.d.S.; Nascimento, F.d.S.; Moreira, R.F.C.; Ramos, A.P.d.S.; Santos-Serejo, J.A.d.; Amorim, E.P.; Jesus, R.D.M.d.; Ferreira, C.F. Diversity of Improved Diploids and Commercial Triploids from Musa spp. via Molecular Markers. Curr. Issues Mol. Biol. 2024, 46, 11783–11796. [Google Scholar] [CrossRef] [PubMed]
- Memon, J.; Patel, R.; Patel, B.N.; Patel, M.P.; Madariya, R.B.; Patel, J.K.; Kumar, S. Genetic diversity, population structure and association mapping of morphobiochemical traits in castor (Ricinus communis L.) through simple sequence repeat markers. Ind. Crops Prod. 2024, 221, 119348. [Google Scholar] [CrossRef]
- Mansueto, L.; Tandayu, E.; Mieog, J.; Garcia-de Heer, L.; Das, R.; Burn, A.; Mauleon, R.; Kretzschmar, T. HASCH—A high-throughput amplicon-based SNP-platform for medicinal cannabis and industrial hemp genotyping applications. BMC Genom. 2024, 25, 818. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).