An Application of the Eigenproblem for Biochemical Similarity

: Protein alignment ﬁnds its application in reﬁning results of sequence alignment and understanding protein function. A previous study aligned single molecules, making use of the minimization of sums of the squares of eigenvalues, obtained for the antisymmetric Cartesian coordinate distance matrices D x and D y . This is used in our program to search for similarities between amino acids by comparing the sums of the squares of eigenvalues associated with the D x , D y , and D z distance matrices. These matrices are obtained by removing atoms that could lead to low similarity. Candidates are aligned, and trilateration is used to attach all previously striped atoms. A TM-score is the scoring function that chooses the best alignment from supplied candidates. Twenty essential amino acids that take many forms in nature are selected for comparison. The correct alignment is taken into account most of the time by the alignment algorithm. It was numerically detected by the TM-score 70% of the time, on average, and 15% more cases with close scores can be easily distinguished by human observation.


Introduction
Just visualizing two simple similar structures leads to an immediate detection of patterns.Similarity is of convenience for humans, but to power automatic decision mechanisms for a PC, it must be measurable.It is mostly used for comparing proteins, but the growing number of PDB structures (currently over 180,000) is many orders of magnitude higher than what the human eye can compare.Because of the large number, it takes days even for current programs to search the database for a query structure.A more reasonable time can be achieved by developing new algorithms [1].
Protein alignment finds its application in refining results of sequence alignment and understanding protein function [2,3].Choosing the alignment that is most geometrically similar is an easier task compared to evaluating its biological significance [4].The pursuit of the best method is in progress, with multiple programs being developed during the past decades: • CAB-Align uses the residue-residue contact area to identify regions of similarity [5].

•
Caretta uses rotation-invariant technique signals of distances derived from overlapping contiguous stretches of residues to find an initial superposition [6].

•
LS-align generates fast and accurate atom-level structural alignments of ligand molecules through an iterative heuristic search of the target function that combines comparisons of inter-atom distance with mass and chemical bonds [8].

•
MATT uses a fragment-based approach that allows for local flexibility between fragment pairs from two input structures and then a dynamic programming algorithm to assemble these intermediate pairs [9].
Symmetry 2021, 13, 1849 2 of 29 • TM-align uses the length-independent TM-score as a measure of similarity between two proteins in a dynamic programming approach [10].
The 3D variant of the distance matrix alignment method (DALI) uses rotation and translation in order to achieve a smaller distance between equivalent points in the two molecules [14].
In a previous study, the eigenproblem was employed to achieve the proper alignment of single molecules, or the mirror of the proper alignment, and this can be exploited to reduce the number of rotations for which a scoring function needs to run [15].
The eigenproblem is thus defined in the literature as follows: Given the quadratic matrix A, of the order n, λ ∈ C is called the eigenvalue of the matrix A and X = 0 its associated eigenvector if the relationship AX = λX is satisfied.The matrix λI − A is singular (because det(λI − A) = 0), where I is the unit matrix of the order n.The solutions of the equation det(λI − A) = 0 represent the eigenvalues of the matrix A.
The determinant det(λI − A) is called the characteristic polynomial (ChP) associated with the matrix A. It has a degree equal to the order of the matrix so that the eigenvalues of the matrix A are its roots.
The eigenproblem in relation to geometrical alignment was stated before in the context of surface analysis [16] and control and can go in another direction in the context of amino acids.A subject of the study is a solution to the eigenproblem of amino acid alignment.The Cartesian system is rotated and eventually translated and reflected until the structure arrives at a position characterized by the highest absolute values of the eigenvalues observed on the Cartesian coordinates.
The aim of this study is to find the best geometric alignment of 20 selected amino acids with regard to each other.An extension to the previous study described by Jäntschi [15] has been elaborated.Sums of the squares of eigenvalues (S T = −2S x − 2S y − 2S z ) for all three Cartesian coordinate distance matrices (D x , D y , and D z ) are compared.By removing atoms, smaller D x , D y , and D z matrices are obtained and more S T sums are added to the comparison.Percentual similarities are found between these sums.Candidates are aligned by the eigenproblem algorithm, and trilateration is used to attach all previously striped atoms.To verify, a TM-score is run on the resulting full-structure candidates.

Materials and Methods
In [15], it was shown that the Cartesian distance matrix is antisymmetric and therefore its eigenvalues are purely imaginary, as well as the fact that the best alignment of a molecule is obtained for the minimum value of the sum of the squares of eigenvalues of the Cartesian distance matrix.
Thus, the angle of rotation of the structure must be found around an axis for which the minimum of this amount is obtained.One method of finding the angle of rotation around an axis for the best alignment is as follows: in the case of an amino acid with 5 atoms, we note the vertices of the graph corresponding to the organic compound with V i (x i , y i , z i ), i = 1, 5. We want to find the optimal angle of rotation around the Oz axis, for example.The Symmetry 2021, 13, 1849 3 of 29 characteristic polynomial associated with the matrix of Cartesian distances on Ox can be approximated in this way: which leads to the problem of finding the rotation angle in the xOy plane so as to obtain the maximum value of the sum Because the term x i − x j 2 becomes maximum when V j V i , Ox = 0, we calculate the amount S x using the law of motion of the rotation of a body about a fixed axis: where ϕ, in turn, takes the value V j V i , Ox ; j = 1, 4; i = 2, 5; j < i.
Using the interpolation method, we find the value of the angle of rotation around the Oz axis.Similarly, we proceed to find the angle of rotation of the structure around one of the other two axes.
The eigenvalues of the associated Cartesian coordinate distance matrix Dx are always two conjugate purely imaginary solutions: λ 2 1 = λ 2 2 = −S x .Sums of the form S T = −2S x − 2S y − 2S z , associated with Dx, Dy, and Dz matrices, are compared in order to find similarities.
Starting from the eigenproblem approach, 20 essential amino acids that take many forms in nature are selected from available databases.

•
3D structural data for heavy atoms • 3D distance matrix for heavy atoms Tables 1-3 depict the Cartesian coordinate distance matrices for heavy atoms.They are antisymmetric, so their eigenvalues, in Table 4, are imaginary.[Dx] 6.065i −6.065i It can be observed that unlike eigenvalues for a symmetric matrix, we obtain a single pair of complementary imaginary numbers regardless of the number of atoms in the compound.Another good part of this approach is that, as shown in Table 5, the polynomial can be expressed with real-value coefficients as a product of a polynomial of degree 2 and a monomial of degree (n − 2), leading to a faster response from the program.Making use of the eigenproblem approach (named the OrigEig function), the other amino acids are aligned to glycine.Candidates with a lower number of atoms than the original are processed while searching for S T similarities.The rest of the atoms are later added using a trilateration algorithm found and used from the literature [17].Some capabilities are added, such as importing original data (*.sdf or *.xyz by the impCart function); performing *.sdf to *.xyz file conversion; removing hydrogen atoms for convenience; and exporting all compared rotated structures as *.xyz (by the writexyz function), a scoring function based on the TM-score and the creation of *.xls files.The code and its explanation can be found in the Supplementary Materials section, and a schematic overview is available in Figure 1.The requirements for this application are:

•
The "in" and "results" directories, the former containing an "xyz" directory and the latter containing "aligned," "rotated," and "tables" directories • Geometrically optimized amino acid *.xyz or *.sdf files that need to be located in the "in" folder • The name of the file representing the selected reference amino acid or the number associated with the file (1 representing the first file in the "in" directory) After the requirements are met, the original eigenproblem algorithm is run in order to be sure that the starting point of the program is a good initial alignment.Then all possible combinations with a smaller number of atoms are found by eliminating atom by atom in the AllE function.Eigenvalues are found for each combination without rotating the candidates.S T sums are compared until the input variables are satisfied or all combinations with a minimum of three atoms are compared.Candidates are aligned by the original eigenproblem approach, possibly good pi/2 rotations are taken into consideration, and trilateration is run.Since the TM-score compares distances between atoms of molecules, candidates are translated on top of the reference structure.Good final candidates are exported.
The following tables are exported as *.xls files in the "results\tables" directory: 1. 3D structural data for heavy atoms as T1 2.
3D distance matrix for heavy atoms as T2 3.
Eigenvalues for above Cartesian coordinate distance matrices as T6 5.
Polynomials for the same Cartesian coordinate distance matrices as T7 6.
A table containing data such as Table A1 available in Appendix A, but no images, named Tscore The following files are exported as *.xyz geometry files: • Initial *.sdf files are converted in the "in\xyz" directory.

•
In the "results\aligned" directory, the results from the original eigenproblem program are exported.

•
In the "results\rotated" directory, all *.xyz files related to the Tscore table can be found.

Results
Eigenvalues of all combinations of atoms are computed for each structure.The −2S x , −2S y , and −2S z values of Dx, Dy, and Dz matrices for aligned glycine are −73.557,−27.349, and −0.004, respectively; sum S T = −100.91.
Comparing alanine 005950.sdf to glycine, six possible combinations of five atoms can be found, the fifth having the closest sum to −100.91, as seen in Table 6.All possible candidates are parsed by the moreData function in the search for a lower percentage difference between S T sums (in the indx function).The targeted percentage difference is defined by Num.low.A multiplier is chosen to extend the search range at the cost of time, Num.M, since the best alignment might not necessarily be the one with the lowest difference between sums.In this case, the following three are chosen by the program: 1, 3, and 5.
The eigenproblem approach is used on the chosen candidates to obtain an eigenvaluewise rotation alignment.It is suggested that compounds are obtained in their correct alignment or in the mirror of the proper alignment [15].The search is extended to these possible good rotations (by the first "for" instruction of the align function).To obtain the position of the other unmatched unaligned atoms, a trilateration algorithm (receiving data from the rest of the align function) is found and used from the literature [17].
Since one of these rotations should lead to a good superposition of the two amino acids, the mean values on each of the axes are found for selected atoms of both structures.The selection is based on atoms indexed in the candidate search presented in Table 6.Subtracting for each of the axes, the candidate structure is translated on top of glycine (by the trans function).
For the resulting candidate combinations, distances are found between pairs of a number of atoms.A MATLAB function matchpairs is used to find atoms that will be superposed based on a linear assignment problem that allows for minimum-cost solutions.These pairs are introduced into a scoring function chosen from the literature, in this case the geometric part of UniAlign-TMscore [2].All these are executed by the choice function.One change was made since our chosen structures contain a small number of atoms: the 15 subtraction was set to 0 so that we obtained a positive distance under the square root of the empirical scaling factor for distance normalization, d0.This can be modified in empi3.Other scoring functions may be applied.The best result for alanine is superposed in Figure 2 in tube style, on top of glycine, which is presented in ball-and-stick-with-noncolored-bond style.The best score for each compared structure is exported to the final results in Table A1 available in Appendix A. Using another parameter (Num.low2),scores close to it are added.Elements selected for candidates with fewer atoms are presented in the table since they help make an easy choice between close scores.A *Tscore.xls file is generated at the end of the choice function.

Discussion
The TM-score can be used to select a best match from all candidates found by the eigenproblem algorithm, as seen in Tables A1 and A2 of Appendix A. Of the total of 19 amino acids aligned to glycine, 13 results are singular high-confidence alignments, of which 11 give a high TM-score.Another three (cysteine, lysine, and arginine) give two possible good results each, and the TM-score can be used to distinguish the best one.
There are some mismatches made by the program.For example, in the case of glutamine 005961, the best score is found for a four-atom alignment instead of the correct five-atom alignment case number 483.Another difficulty can be observed in the cases of tryptophan 006305 and glutamic acid 033032, where a small score is given to the aligned case numbers 4/115, which are the only ones with elemental similarities, as depicted in Table 7.
Cysteine is the second amino acid taken as a reference for alignment, and all the candidates that our program outputs are depicted in Table A2 of Appendix A. From the total of 19 amino acids aligned to cysteine, six results can be chosen by the highest TM-score, of which two are singular results.Another five give two or more possible good results each, and the TM-score can be used to distinguish the best one.
The following eight mismatches are presented for cysteine, of which the first four are available in Table 8:

•
In the case of alanine 005950, a small score is given to the aligned case number 269, which is the only one with elemental similarity.

•
For valine 006287, threonine 006288, and arginine 006322, the best scores are found for candidates with a lower number of aligned atoms.The best candidates with more aligned atoms are 006287-1, 006288-1, and 6322-19.

•
The outputs for aspartic acid 005960, lysine 005962, histidine 006274, and tryptophan 006305 did not contain the expected alignments.
As stated above, a parameter is introduced such that close scores are not ignored.In this case, a score of 80% of the maximum is accepted for output.This percentage can be indicated in the Num.low2 parameter.This is needed so that the best alignment is given as a result, even though it is not the one with the highest TM-score.
Another easy way to choose from these candidates is to view the chosen elements and eliminate candidates that might have close numerical scores but wrong atom types.Other scoring functions or a combination of such means could lead to even better results.The use and applicability of the eigenproblem goes beyond the alignment of molecules [15] and biochemical similarity.Recent reports include analysis of regular graphs for their properties, including eigen-spectra and automorphisms [18], molecular topology [19][20][21][22], characteristic equations, principal component decomposition [23], algebraic topology and generalized Bertrand curves [24], treatment of fuzzy decisions [25] and tridiagonal matrices [26], commutator tables, and Laplacian [27], systems of differential [28], and integro-differential [29] equations, while challenging problems appear in polynomial root evaluation [30] and the characteristic equation of a square matrix of a great order [31].

Conclusions
An application of the eigenproblem was elaborated, aiming to find the best geometric alignment of selected amino acids with regard to each other.
We can conclude that the best alignment does not obey a strict trend.The close results of the same algorithm can be taken into account.Even after running a score function, we can conclude that the alignment with the highest score is not always the best alignment.
To reduce the number of rotations for which a scoring function is run, the present algorithm needs to be restricted with a few parameters.In addition, a combination of multiple approaches could lead to faster results.
Taking glycine as a reference, 84% of the best alignments can be numerically pointed by a scoring function such as the TM-score, of which 68% are exported as single candidates, meaning that the restrictive parameters are relevant to the present comparison.For cysteine, only 58% can benefit from the presented scoring function.An extensive database would reveal a logical way of choosing them and help training for machine learning.
After running the present algorithm with the other amino acids as a reference, the correct alignment was numerically detected by the TM-score 70% of the time, on average, and 15% more cases with close scores can be easily distinguished by human observation.The present algorithm can be sped up by full vectorization.Machine learning needs to be added to scoring functions as a means to reduce the impact of limited description capabilities and predetermined theory-inspired functional form.These shortcomings can be solved by not imposing a strict algorithm but letting machine learning capture properties that are hard to model because of many unmeasured/unknown/undiscovered quantitative structure-activity relationships (QSAR).Machine learning can assimilate the fast-growing volume of high-quality structural and interaction data found in the literature.• MATT uses a fragment-based approach that allows for local flexibility between fragment pairs from two input structures and then a dynamic programming algorithm to assemble these intermediate pairs [9].• TM-align uses the length-independent TM-score as a measure of similarity between two proteins in a dynamic programming approach [10].
The 3D variant of the distance matrix alignment method (DALI) uses rotation and translation in order to achieve a smaller distance between equivalent points in the two molecules [14].• MATT uses a fragment-based approach that allows for local flexibility between fragment pairs from two input structures and then a dynamic programming algorithm to assemble these intermediate pairs [9].• TM-align uses the length-independent TM-score as a measure of similarity between two proteins in a dynamic programming approach [10].
The 3D variant of the distance matrix alignment method (DALI) uses rotation and translation in order to achieve a smaller distance between equivalent points in the two molecules [14].
In a previous study, the eigenproblem was employed to achieve the proper alignment of single molecules, or the mirror of the proper alignment, and this can be exploited to reduce the number of rotations for which a scoring function needs to run [15].
The eigenproblem is thus defined in the literature as follows: Given the quadratic matrix A, of the order n, λ ∈ ℂ is called the eigenvalue of the matrix A and ≠ 0 its associated eigenvector if the relationship = λX is satisfied.The matrix λI − A is singular (because λI − A = 0), where I is the unit matrix of the order n.The solutions of the equation λI − A = 0 represent the eigenvalues of the matrix A.
The determinant λI − A is called the characteristic polynomial (ChP) associated with the matrix A. It has a degree equal to the order of the matrix so that the eigenvalues of the matrix A are its roots.The eigenproblem in relation to geometrical alignment was stated before in the context of surface analysis [16] and control and can go in another direction in the context

Figure 1 .
Figure 1.A schematic overview of the algorithm.

Figure 2 .
Figure 2. A 3D view of the best alignment of alanine to glycine.

•
Input variable Num.M, which defines how many extra candidates can be taken into consideration in case Num.low is satisfied by only one candidate • Input variable Num.low, which defines the target percentage differences between S T of two candidates in order to accept and stop searching for candidates with fewer atoms • Input variable Num.low2, which defines the percentage of the maximum found TM-score such that even lower-scored candidates are exported in *.xls tables and *.

Table 6 .
All combinations of five atoms in the case of alanine and their S T sums.

Table 7 .
3D views of the problematic choice of alignments for glycine.

Table 8 .
3D views of the problematic choice of alignments for cysteine.

Table A1 .
All structures aligned to glycine; their candidate indexes as exported by the program in *.xyz format; TM-scores; selected elements; and −2S x , −2S y , and −2S z .

3D Views of Alignment Aligned Structure and Index TM-Score Selected Atoms from 000750
−2S x , −2S y , and −2S

z of the Reference Candidate Selected Atoms from the Aligned Structure
OONCC −53.2066 −1.5086 −15.0372

Table A2 .
All structures aligned to cysteine; their candidate indexes as exported by the program in *.xyz format; TM-scores; selected elements; and −2S x , −2S y , and −2S z .