Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps

Cryo-electron microscopy (cryo-EM) is a structural technique that has played a significant role in protein structure determination in recent years. Compared to the traditional methods of X-ray crystallography and NMR spectroscopy, cryo-EM is capable of producing images of much larger protein complexes. However, cryo-EM reconstructions are limited to medium-resolution (~4–10 Å) for some cases. At this resolution range, a cryo-EM density map can hardly be used to directly determine the structure of proteins at atomic level resolutions, or even at their amino acid residue backbones. At such a resolution, only the position and orientation of secondary structure elements (SSEs) such as α-helices and β-sheets are observable. Consequently, finding the mapping of the secondary structures of the modeled structure (SSEs-A) to the cryo-EM map (SSEs-C) is one of the primary concerns in cryo-EM modeling. To address this issue, this study proposes a novel automatic computational method to identify SSEs correspondence in three-dimensional (3D) space. Initially, through a modeling of the target sequence with the aid of extracting highly reliable features from a generated 3D model and map, the SSEs matching problem is formulated as a 3D vector matching problem. Afterward, the 3D vector matching problem is transformed into a 3D graph matching problem. Finally, a similarity-based voting algorithm combined with the principle of least conflict (PLC) concept is developed to obtain the SSEs correspondence. To evaluate the accuracy of the method, a testing set of 25 experimental and simulated maps with a maximum of 65 SSEs is selected. Comparative studies are also conducted to demonstrate the superiority of the proposed method over some state-of-the-art techniques. The results demonstrate that the method is efficient, robust, and works well in the presence of errors in the predicted secondary structures of the cryo-EM images.


Introduction
Proteins are one of the essential parts of all organisms that perform most of the tasks of living species. To study the relationship between protein structure and function, it is necessary to have access to precise three-dimensional (3D) structural information [1]. Hence, understanding the protein structure is of great interest to biologists. Traditionally, protein structures have been obtained using experimental techniques such as X-ray crystallography and NMR spectroscopy. X-ray crystallography has been used to study thousands of protein complexes which are crystallizable. NMR spectroscopy is limited to small molecules of an atomic mass less than 50 kDa. Therefore, neither of these techniques can be used to study molecular complexes which can be found in nature in their near-native state [2]. More recently, cryo-electron microscopy (cryo-EM) has emerged as an experimental technique to address most of the scalability concerns of the traditional techniques by being able to image large macromolecular complexes, such as ribosomes and viruses, in their native conformations. This widely used technique does not require crystalizing before data acquisition and it is applicable on a molecule larger than~100 kDa [3,4]. In recent years, there have been significant advances in cryo-EM imaging techniques [5]. However, for some cases, the cryo-EM reconstructions are limited to medium-resolution (~4-10 Å), where the secondary structure elements can be computationally and visually identified, but not the individual amino acid residues [6]. This lack of atomic-level resolution leads to many computational challenges for protein 3D structure determination. For the density maps at high-resolution (~2-4 Å), the backbone is recognizable, and the protein structure at the atomic level can be directly derived. However, for the low (~10-25 Å) or mediumresolution (~4-10 Å) density maps, the backbone of the protein and the atomic information cannot be directly achieved from the cryo-EM maps. This limitation has motivated the development of many computational methods that use the medium-resolution cryo-EM map to collect protein structural information [7][8][9][10][11][12][13][14][15]. In the cryo-EM modeling pipeline, some major steps should be handled, such as extracting the secondary structure elements on a cryo-EM density map and matching them to a sequence/model, the C ∝ placement of SSEs, building an atomic structure, and structure optimization [6]. One of the main challenging and critical steps is finding the mapping of the secondary structures of the modeled structure to the cryo-EM map. This is because this step provides the initial anchor point to find the location of the C ∝ atoms and to construct the protein backbone. The precise identification of SSEs correspondence enables us to produce an accurate initial 3D structure of a protein that can be refined further by later steps in the model-building pipeline.
At medium-resolution, the analyses of cryo-EM maps rely on the availability of the known protein structures obtained by other high-resolution experimental methods (X-ray crystallography, NMR). When the atomic structure from other sources of information is not accessible, a de novo modeling approach could be utilized [9,[16][17][18][19][20]. S. Abeysingh et al. [16] introduced a research study on solving the α-helix correspondence problem through shape matching by modeling both a 1D sequence and a 3D volume to attributed relational graphs. Furthermore, they developed Gorgon [21], which is an interactive molecular modeling toolkit with an interactive visualization platform. Al Nasr et al. developed a weighted directed graph to solve the secondary structure assignment and presented an approach to enumerate the top-ranked topologies instead of enumerating all possible topologies [18]. The authors conducted another study, DP-TOSS, to solve the topology determination based on a layered graph using a dynamic programming approach into a constrained k-shortest path algorithm [19]. DP-TOSS was compared with Gorgon in our previous study [19]. The results indicated that DP-TOSS was superior to Gorgon. Afterwards, Biswas et al. [22] enhanced the performance of DP-TOSS by combining the information from multiple secondary structure prediction servers. They utilized some different structural information, such as the length of secondary structures, the loop length, and the skeleton between two secondary structure traces as a scoring function. Al Nasr et al. enhanced the DP-TOSS accuracy using the efficient scoring methodology. The proposed scoring functions were a skeleton-based scoring function, a geometry-based function, and a multi-well potential energy-based function [20].
In the presence of a high-resolution structure for an insufficient resolution cryo-EM map, the fitting methods, which are categorized into flexible and rigid-body fitting, could be utilized to derive the atomic structure from the cryo-EM map [9,12,14,17]. Early studies have concentrated on searching for the optimal position and orientation of a protein's secondary structure components with the best overlaps with the SSEs extracted from a cryo-EM density map [23][24][25][26]. Dou et al. proposed a flexible fitting of an atomic structure into a cryo-EM map which is guided by the correspondences between α-helices in the atomic model and the cryo-EM map [27]. In the work of [28], a computational method is presented to quantify the agreement between two sets of central axes of α-helices which are relevant to atomic structures and cryo-EM maps. It utilized an arc-length association strategy to characterize the lateral and the longitudinal differences of the two axes.
Our approach in this study is to introduce a novel geometrical matching approach to find the correct matches between SSEs-C and SSEs-A (SSEs correspondence). The central theme of our approach is to cast the SSEs mapping problem as that of three-dimensional graph matching. For this purpose, the SSEs matching problem is formulated as a 3D vector matching problem in Cartesian coordinate space. Then, the 3D vector matching problem is transformed into a 3D graph matching problem. To solve the 3D graph matching problem, three novel mathematical-based features, as well as two robust statistical scoring functions, are proposed. Finally, to obtain the final SSEs assignment among all possible ones, a similarity-based voting algorithm combined with the PLC concept is developed. Furthermore, the results show the superiority of the proposed method compared to some of the state-of-the-art techniques.

Materials and Methods
In this section, an automatic assignment method for finding the SSEs correspondence in three-dimensional space is proposed. An overview of the method is illustrated in Figure 1. The method takes the modeled structure and the medium-resolution cryo-EM density map as inputs (Figure 1a,b) and produces SSEs correspondence as output. Initially, in the preprocessing step, the α-helices and β-strands from the modeled structure (SSEs-A) and the cryo-EM map (SSEs-C) are extracted (Figure 1c,d). Then, the extracted SSEs from both the structure and the map are constructed as vectors in the three-dimensional Cartesian coordinate systems (Figure 1e,f). After that, utilizing the novel strategy and innovative mathematical-based features (i.e., angle, Euclidian distance, and relative length), the 3D vector matching problem is transformed into the 3D graph matching problem (Figure 1g,h). To solve the 3D graph matching problem, two robust statistical scoring functions, which are Bhattacharyya distance (BD) and modal assurance criterion (MAC), are proposed. At the end, a similarity-based voting algorithm has been developed ( Figure 1i) to extract the SSEs correspondence.  [29]; (b) the density map simulated at 10 Å resolution using protein structure 1BJ7 and Chimera package [29]; (c) the secondary structure elements extracted from the 3D modeled structure in the preprocessing step (SSEs-A); (d) the secondary structure elements extracted from the cryo-EM density map (SSEs-C); (e) the 3D vectors constructed based on the extracted SSEs-A; (f) the 3D vectors constructed based on the extracted SSEs-C; (g,h) the 3D graphs are constructed; (i) the similarity-based voting algorithm is proposed as a decision making strategy for finding the SSEs correspondence; (j) the secondary structure elements correspondence. Figure 1. Different stages of the framework pipeline: (a) the inputs, including the modeled structure (PDB ID: 1BJ7, chain A) visualized by Chimera [29]; (b) the density map simulated at 10 Å resolution using protein structure 1BJ7 and Chimera package [29]; (c) the secondary structure elements extracted from the 3D modeled structure in the preprocessing step (SSEs-A); (d) the secondary structure elements extracted from the cryo-EM density map (SSEs-C); (e) the 3D vectors constructed based on the extracted SSEs-A; (f) the 3D vectors constructed based on the extracted SSEs-C; (g,h) the 3D graphs are constructed; (i) the similarity-based voting algorithm is proposed as a decision making strategy for finding the SSEs correspondence; (j) the secondary structure elements correspondence.

Preprocessing
In this step, the model, generated by I-TASSER [30][31][32][33], and the cryo-EM density map are used as initial inputs and the geometrical features are returned as outputs. Generally, the protein modeling can be performed using various modeling tools such as Modeller [34], AlphaFold [35,36], RaptorX [37][38][39], and I-TASSER. I-TASSER (Zhang-Server) and AlphaFold (A7D) are two efficient and robust methods, which are based on deep residual-convolutional networks. AlphaFold utilizes artificial intelligence and deep learning methods to generate the 3D structure of proteins. The framework of the AlphaFold is based on a deep two-dimensional convolutional residual network that enables this method to create high-accuracy structures even under sequences with fewer homologous sequences. I-TASSER is developed for automated protein structure prediction, which performs the model construction by collecting the high-scoring structural templates based on the threading approaches. The hierarchical architecture is composed of four steps, including threading, structural assembly, model selection, and structure-based functional annotation. I-TASSER finds a protein template of similar super-secondary structures from the Protein Data Bank (PDB) through LOMETS [40,41]. Then, the extracted segments from the templates are reconstructed through replica-exchange Monte Carlo simulations. The performance of the generated model is assessed based on the reliability of the threading templates and the convergence parameters of the structural assembly. The server was successful in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition in recent years. Hence, in this study, the authors opted for I-TASSER, which is available at (https://zhanggroup.org/I-TASSER/, accessed on 30 September 2021) due to its simplicity and high accuracy.
The geometrical features are Cartesian coordinate voxels of the SSEs (α-helices and β-strands). For more clarification, the α-helices and β-strands are the primary elements of the secondary structures, as illustrated in Figure 2. These elements are formed by amino acid residues. Each residue consists of four primary atoms (N, C ∝ , C, and O). The C ∝ . atom is the most important one in the backbone of the SSEs. For the first input (i.e., the 3D model), all the C ∝ coordinates of the SSEs-A (the geometrical location of the backbone alpha carbon of the α-helices and β-strands) are extracted. The second input is the cryo-EM map. At a medium-resolution cryo-EM map, the secondary structure components can be observed as density rods [17]. Various computational methods, such as SSEhunter [42], SSELearner [43], SSETracer [44], and Emap2sec [45] have been developed to detect the position, orientation, and length of α-helices and β-strands on the cryo-EM images. In this study, the Cartesian coordinate voxels of the SSEs-C have been extracted using SSETracer [44].
Biomolecules 2021, 11, x 5 of 19 In this step, the model, generated by I-TASSER [30][31][32][33], and the cryo-EM density map are used as initial inputs and the geometrical features are returned as outputs. Generally, the protein modeling can be performed using various modeling tools such as Modeller [34], AlphaFold [35,36], RaptorX [37][38][39], and I-TASSER. I-TASSER (Zhang-Server) and AlphaFold (A7D) are two efficient and robust methods, which are based on deep residualconvolutional networks. AlphaFold utilizes artificial intelligence and deep learning methods to generate the 3D structure of proteins. The framework of the AlphaFold is based on a deep two-dimensional convolutional residual network that enables this method to create high-accuracy structures even under sequences with fewer homologous sequences. I-TASSER is developed for automated protein structure prediction, which performs the model construction by collecting the high-scoring structural templates based on the threading approaches. The hierarchical architecture is composed of four steps, including threading, structural assembly, model selection, and structure-based functional annotation. I-TASSER finds a protein template of similar super-secondary structures from the Protein Data Bank (PDB) through LOMETS [40,41]. Then, the extracted segments from the templates are reconstructed through replica-exchange Monte Carlo simulations. The performance of the generated model is assessed based on the reliability of the threading templates and the convergence parameters of the structural assembly. The server was successful in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition in recent years. Hence, in this study, the authors opted for I-TASSER, which is available at (https://zhanggroup.org/I-TASSER/, accessed on 30 September 2021) due to its simplicity and high accuracy.
The geometrical features are Cartesian coordinate voxels of the SSEs (α-helices and β-strands). For more clarification, the α-helices and β-strands are the primary elements of the secondary structures, as illustrated in Figure 2. These elements are formed by amino acid residues. Each residue consists of four primary atoms (N, C ∝ , C, and O). The ∝ atom is the most important one in the backbone of the SSEs. For the first input (i.e., the 3D model), all the ∝ coordinates of the SSEs-A (the geometrical location of the backbone alpha carbon of the α-helices and β-strands) are extracted. The second input is the cryo-EM map. At a medium-resolution cryo-EM map, the secondary structure components can be observed as density rods [17]. Various computational methods, such as SSEhunter [42], SSELearner [43], SSETracer [44], and Emap2sec [45] have been developed to detect the position, orientation, and length of α-helices and β-strands on the cryo-EM images. In this study, the Cartesian coordinate voxels of the SSEs-C have been extracted using SSETracer [44].

Construction of 3D Vectors from SSEs-A and SSEs-C
This study aims to find the correspondence between the α-helices and β-strands detected on the cryo-EM map with those extracted on the modeled structure. To deal with this issue, the extracted SSEs from the map and the 3D model are converted to the 3D vectors in the Cartesian coordinate system. For visualization, a simple α-protein 1FLP (PDB ID) is selected from the data set of interest, as demonstrated in Figure 3. The start and end voxels of the SSEs-A have been utilized to construct the 3D vectors (Figure 3a,b). Since we do not have any information regarding the C ∝ atom of the medium-resolution cryo-EM map, the coordinate voxels of the central axis of the SSEs-C have been used to construct the 3D vectors (Figure 3c,d).

Construction of 3D Vectors from SSEs-A and SSEs-C
This study aims to find the correspondence between the α-helices and β-strands detected on the cryo-EM map with those extracted on the modeled structure. To deal with this issue, the extracted SSEs from the map and the 3D model are converted to the 3D vectors in the Cartesian coordinate system. For visualization, a simple α-protein 1FLP (PDB ID) is selected from the data set of interest, as demonstrated in Figure 3. The start and end voxels of the SSEs-A have been utilized to construct the 3D vectors (Figure 3a,b). Since we do not have any information regarding the ∝ atom of the medium-resolution cryo-EM map, the coordinate voxels of the central axis of the SSEs-C have been used to construct the 3D vectors (Figure 3c,d).  (PDB ID) is shown with chimera [29]; (b) each α-helix in the atomic model is considered as a helix vector (HV) in the Cartesian coordinate system ( ); (c) the cryo-EM density map and the SSEs-C detected on it. The map is simulated at 10 Å resolution using protein structure 1FLP (PDB ID). The location of SSEs-C is illustrated as purple cylinders with Gorgon [21]; (d) extracted SSEs-C on the map considered as stick vector (SV) in three-dimensional Cartesian space ( ).

Three-Dimensional Vector Matching
In order to solve the vector matching problem, three effective mathematical-based features, which are the angle, the Euclidean distance, and the relative length, are proposed. These features are computed with the aid of all vectors in and . Afterward, the 3D vector matching problem is transformed into the 3D graph matching problem based on the extracted features. The construction of the graph is elaborated in the following.

Construction of Weighted Fully Connected Graphs of SSEs-A and SSEs-C
Based on the problem at hand, the central idea of the method is to find the correspondence between the constructed 3D vectors of and . Hence, two weighted fully connected graphs (i.e., and ) have been constructed from and . Figure 4 illustrates the transformation of the 3D vectors to the 3D graphs. For the sake of simplicity, only the relevant edges of one node in the weighted fully connected graphs are illustrated.  [29]; (b) each α-helix in the atomic model is considered as a helix vector (HV) in the Cartesian coordinate system (R 3 SSEs−A ); (c) the cryo-EM density map and the SSEs-C detected on it. The map is simulated at 10 Å resolution using protein structure 1FLP (PDB ID). The location of SSEs-C is illustrated as purple cylinders with Gorgon [21]; (d) extracted SSEs-C on the map considered as stick vector (SV) in three-dimensional Cartesian space R 3 SSEs−C .

Three-Dimensional Vector Matching
In order to solve the vector matching problem, three effective mathematical-based features, which are the angle, the Euclidean distance, and the relative length, are proposed. These features are computed with the aid of all vectors in R 3 SSEs−A . and R 3 SSEs−C . Afterward, the 3D vector matching problem is transformed into the 3D graph matching problem based on the extracted features. The construction of the graph is elaborated in the following.

Construction of Weighted Fully Connected Graphs of SSEs-A and SSEs-C
Based on the problem at hand, the central idea of the method is to find the correspondence between the constructed 3D vectors of R 3 SSEs−A and R 3 SSEs−C . Hence, two weighted fully connected graphs (i.e., G SSEs−A . and G SSEs−C ) have been constructed from R 3 SSEs−A and R 3 SSEs−C . Figure 4 illustrates the transformation of the 3D vectors to the 3D graphs. For the sake of simplicity, only the relevant edges of one node in the weighted fully connected graphs are illustrated. , respectively. Note that, since the process of construction of the G and G graphs are the same, for summarizing, the construction of the G graph in the following has been elaborated.
Given G = (N , E , V , W ) , the first element of the G graph is , which is a nonempty set of nodes that represent the vectors of SSEs-A in the 3D space. | | denotes the number of nodes, which is equal to the number of vectors in . The second element of the graph is E , which is defined as a set of edges representing all possible interactions of nodes. The third element, V , is a set of labels of the nodes and they are defined based on the spatial position of ∝ atoms. It is appropriate to assign a pair ( → , → ) = (〈 , , 〉, 〈 , , 〉 ) from the start and end points of the vector to SSEs-A node of the graph. → and → are the first and the last ∝ coordinate voxels of the SSEs-A which is corresponded to the start and end voxel of the SSEs-A vector (HV ). The last element of the graph, W , is defined for assigning weights to the edges of the graph according to the mathematical-based features. More details about the construction of the three graphs based on the three mathematical-based features are provided as follows: i.
Angle-based fully connected graph ( ): This graph uses the angle of vectors for assigning weights to the edges of the graph.
( , ) is defined to calculate the weights of the graph based on the angle of every two vectors: Let A = (A 1 A 2 , . . . , A m ) be a set of SSEs-A detected from the atomic structure and C = (C 1 C 2 , . . . , C n ) be a set of SSEs-C extracted on the cryo-EM map. The weighted fully connected graph of SSEs-A and SSEs-C are undirected fully connected graphs that are represented as a 4-tuple G SSEs−A = (N A , E A , V A , W A ) and G SSEs−C = (N C , E C , V C , W C ), respectively. Note that, since the process of construction of the G SSEs−A and G SSEs−C graphs are the same, for summarizing, the construction of the G SSEs−A graph in the following has been elaborated.
Given G SSEs−A = (N A , E A , V A , W A ), the first element of the G SSEs−A graph is N A , which is a nonempty set of nodes that represent the vectors of SSEs-A in the 3D space. |N A | denotes the number of nodes, which is equal to the number of vectors in R 3 SSEs−A . The second element of the graph is E A , which is defined as a set of edges representing all possible interactions of nodes. The third element, V A , is a set of labels of the nodes and they are defined based on the spatial position of C ∝ atoms. It is appropriate to assign a pair ii.
Euclidean distance-based fully connected graph (G ED SSEs−A .): This graph utilizes the Euclidean distance (ED) metric for assigning weights to the edges of the G ED SSEs−A . graph. The edge's weight of the graph is computed based on the Euclidean distance of the midpoint of two vectors as follows: iii. Relative length-based fully connected graph (G RL SSEs−A .): This graph determines the weight of the edge based on the relative length (RL) of two vectors. This characteristic is defined to specify the relative length between two vectors and is computed based on Equation (3).
According to the aforementioned three constructed graphs, three weighted adjacency matrices for G SSEs−A have been constructed. Based on the same principle, three graphs and three weighted adjacency matrices for G SSEs−C . have been constructed. The G SSEs−A and G SSEs−C matrices are m × m and n × n, respectively. The characteristics of the matrices are: • All entries on the main diagonal are zero (x ii = 0); The matrices are a symmetric matrix (x ij = x ji ).
In the following phase of the study, to compute the similarity of the nodes between the G SSEs−A . and G SSEs−C . graphs, two robust statistical scoring functions, BD and MAC, have been proposed. The Bhattacharyya distance (BD) computes the distance of two probability distributions or variables based on the statistical moments of the data [46]. These statistical indicators have been widely applied in signal processing [47], image processing [48], speaker recognition [49], and pattern recognition [50]. In this study, the metric is utilized to measure the geometrical similarity and to calculate the distance between all nodes of the G SSEs−A and G SSEs−C graphs. For more clarification, suppose that r SSEs−A indicates the weights of all adjacency edges for the jth SSEs-C node. To compute the similarity score between the two nodes of G SSEs−A and G SSEs−C , the following formula has been applied: The calculated distance score (BD) determines the relative closeness of two nodes in two peer graphs. The BD scoring function varies between 0 to 1 (0 ≤ BD ≤ 1), in which BD=0 represents two nodes with high similarity, and vice versa. We applied the BD scoring function for all nodes of three peer graphs (i.e., < G Angle SSEs−A , G Angle SSEs−C >, < G ED SSEs−A , G ED SSEs−C >, < G RL Helix , G RL stick >) to achieve the initial correspondence set for each pair of graphs.
The second proposed scoring function, the modal assurance criterion (MAC), is a robust statistical metric that provides a measure of consistency between two linear arrays [51,52]. The basic idea behind the metric comes from the modal assurance criterion, which computes a measure of consistency between the experimental and the analytical modal arrays. In this study, the MAC considers as a scoring function to calculate the similarity of nodes in each two peer graphs based on Equation (5). Similar to the BD scoring function, the MAC metric takes two rows (i.e., r SSEs−A i and r SSEs−C j .) of two peer matrices (e.g., G Angle SSEs−A , G Angle SSEs−C ) as inputs and computes the similarity score. The generated similarity score is in the range of 0-1, where a zero score indicates no consistency between the two peer nodes of the graphs, and one indicates complete consistency.
After applying the two aforementioned distance/similarity scoring functions on the three peer graphs, three candidate SSEs correspondence sets were generated. To extract the final SSEs correspondence among the three obtained candidate SSEs correspondence sets, a similarity-based voting algorithm has been developed.

Similarity-Based Voting Algorithm (SimVA)
The similarity-based voting algorithm (SimVA) has been proposed as a decisionmaking strategy to extract the final SSEs correspondence among the three generated correspondence sets. The SimVA initially takes the three obtained correspondence sets as inputs and then generates the final SSEs correspondence as output. The final correspondences are extracted in three steps, including (i) unanimous voting, (ii) majority voting, and (iii) the principle of least conflict (PLC). These steps are presented in the following in detail.

Unanimous Voting
In this step, the SimVA algorithm considers an assignment as an acceptable assignment if it is repeated in all the three candidate correspondence sets. In the other words, if ith SSEs-A matches with the jth SSEs-C based on the three mathematical-based features (angle, Euclidian distance, and relative length), this assignment is a great choice, and it is reported as an acceptable assignment.

Majority Voting
This routine supposes an assignment to be an acceptable assignment when it is repeated in the two candidate correspondence sets among the three correspondence sets. For example, if the ith SSEs-A match with the jth SSEs-C based on two of the mathematicalbased features out of three, it is considered as an acceptable assignment and is inserted into the final correspondence set.

Principle of Least Conflict
The main idea behind the principle of least conflict (PLC) approach is to find the assignments in the case that there is a remaining assignment that has not been selected in the two previous steps. In this step, the assignment with the minimum conflict has been recognized and selected as an acceptable assignment. The minimum conflict assignment is a <SSEs-A, SSEs-C> pair that has the least conflict with the other pairs. As an example, if the 1st SSEs-A should match with the 4th SSEs-C (i.e., the pair <1, 4> is a true assignment), all the other assignments except <1, 4> for the 1st SSEs-A (e.g., <1, 2>, <1, 3>, . . . <1, n>) are considered as conflict pairs. On the other hand, for the 4th SSEs-C, all other assignments except <1, 4> are also in conflict (e.g., <2, 4>, <3, 4>, . . . <m, 4>). After all the conflict pairs have been detected for all assignments, the number of conflict pairs for each assignment has been enumerated and the pair with the minimum number of conflicts is selected as an acceptable assignment. The proposed concept allows the SimVA algorithm to continue at times when we could not find the assignment from the two aforementioned voting routines in each iteration of the algorithm. At the end, all the acceptable assignments obtained from the SimVA algorithm are considered as a final SSEs correspondence.

Results
This section presents experiments which have been designed to evaluate the robustness of the presented method. The effectiveness of the method was validated on 25 experimental and simulated cryo-EM maps in terms of precision, sensitivity, F-measure, and accuracy. The validity of the proposed approach was carried out by comparing the SSEs correspondence computed by the method presented in this study with the native correspondence (true SSEs correspondence). The native correspondence is obtained from the manual labeling of the SSEs in the density map based on the known atomic structure (for simulated data) or a structural homolog (for experimental data). We calculate the accuracy, precision, sensitivity, and F-measure based on the following formula:

Experimental and Simulated Cryo-EM Density Maps
The efficiency and accuracy of the automatic method were tested using 25 α-β proteins. The data set of interest consists of 10 experimental and 15 simulated cryo-EM maps. The experimental cryo-EM maps, which are reported in Table 1, were obtained from the Electron Microscopy Data Bank (EMDB) [53] so that their resolutions ranges from 3.7 to 8.9 Å. a The EMDB ID of the protein used in the test; b the PDB ID of the protein used in the test. β-containing proteins are marked with *; c the protein chain; d the number of amino acid residues in the sequence; e the total number of secondary structure elements (α-helices and β-strands) in the atomic structure; f the total number of secondary structure elements (α-helices and β-strands) extracted from the cryo-EM map; g the resolution of the experimental map in angstrom (Å).
The simulated maps, which are represented in Table 2, are synthesized at 10 A resolution using the Chimera package [29], and the structure of the proteins were downloaded from the Protein Data Bank (PDB) (https://www.rcsb.org/, accessed on 30 September 2021) [54].
In the dataset of interest, the lengths of the proteins range from 117 (PDB ID: 3FIN) to 1703 (PDB ID: 6UXW) amino acid residues. The largest test case (PDB ID: 5KBU) in this dataset includes 65 SSEs-A and 54 SSEs-C. Therefore, the selected data set is appropriate to evaluate the robustness and effectiveness of the method in handling large samples. a the name of the protein; b the PDB ID of the protein used in the test. β-containing proteins are marked with *; c the Uniport ID of the protein; d the protein chain; e the number of amino acid residues in the sequence; f the total number of secondary structure elements (α-helices and β-strands) in the atomic structure; g the total number of secondary structure elements extracted from the cryo-EM map.

Performance Comparison of Two Scoring Functions
As described in the earlier section, three peer graphs from SSEs-A and SSEs-C (i.e., < G Angle SSEs−A , G Angle SSEs−C >, G ED SSEs−A , G ED SSEs−C , G RL SSEs−A , G RL SSEs−C ) have been constructed based on the three mathematical-based features. To measure the similarity of the nodes in each peer graph, two statistical scoring functions, BD and MAC, have been utilized. To assess the quality of the algorithm, we have evaluated our work based on the three proposed mathematical-based features using the BD and MAC scoring functions. The accuracy of the achieved SSEs correspondence sets (angle-, ED-, and RL-based correspondence sets) is calculated based on the Equation (6), as reported in Table 3.
As can be seen in Table 3, the percentage of the average accuracy based on the angle-, ED-, and RL-based correspondence sets concerning the BD scoring function are equal to 53.20%, 69.39%, and 50.63%, respectively. For the MAC scoring function, these values are identical to 57.59%, 70.58%, and 53.76%, respectively. The results indicate that the MAC metric is more reliable than BD in finding the similarity of the nodes of the graphs.
To extract the final SSEs correspondence set from the three produced correspondence ones, the SimVA algorithm has been designed and implemented. In the following, the effectiveness of the developed algorithm is assessed on the experimental and simulated cryo-EM map.

Impact of the SimVA Algorithm on the SSEs Correspondence Result
To improve the efficiency of the matching process, the SimVA algorithm has been proposed. The SimVA algorithm has been developed to extract the final SSEs correspondence based on the feature integration strategy. Here, the accuracy of the SimVA algorithm using two scoring functions, BD and MAC, is analyzed. Table 4 compares the performance of the method before and after incorporating the SimVA algorithm.  A comparison of the reported results in Table 4 shows that for 24 out of 25 test cases, the accuracy has been improved by incorporating the SimVA algorithm. The total average accuracy obtained from the three mathematical-based features using BD and MAC is 57.74% and 61.51%, respectively. After incorporating the SimVA algorithm in the final step, the total average of the accuracy using BD and MAC are equal to 76.17 % and 76.09%, respectively. This reveals that incorporating the SimVA algorithm led to an 18.43% and a 14.58% improvement in the accuracy of the method.

Assessment of the Method
To analyze the robustness of the method, four performance measurements (precision (P), sensitivity (S), F-measure (F), and accuracy (A)) were used. Figure 5 demonstrates the efficiency of the method using the measurements on the data set of interest.   As can be observed in Figure 5, for most of the proteins in the data set with the aid of the SimVA_MAC, the accuracy is more than 70%. The results show that the method is robust and works well even under the presence of errors and uncertainties in the extracted SSEs in the cryo-EM images. This is a valuable outcome of this study.

Comparison of Method with DP-TOSS
In this section, the accuracy of the SimVA algorithm using two scoring functions, BD and MAC, has been compared with DP-TOSS [20]. Many approaches have recently been developed to solve the SSEs mapping problem for medium-resolution cryo-EM maps, as discussed in the introduction. Here, the proposed method is compared with the latest version of DP-TOSS. As can be seen in Table 5, the average of accuracy on the data set of interest for DP-TOSS, SimVA_BD, and SimVA_MAC are equal to 61.35%, 76.17%, and 76.09%, respectively. 1FLP As can be observed in Figure 5, for most of the proteins in the data set with the aid of the SimVA_MAC, the accuracy is more than 70%. The results show that the method is robust and works well even under the presence of errors and uncertainties in the extracted SSEs in the cryo-EM images. This is a valuable outcome of this study.

Comparison of Method with DP-TOSS
In this section, the accuracy of the SimVA algorithm using two scoring functions, BD and MAC, has been compared with DP-TOSS [20]. Many approaches have recently been developed to solve the SSEs mapping problem for medium-resolution cryo-EM maps, as discussed in the introduction. Here, the proposed method is compared with the latest version of DP-TOSS. As can be seen in Table 5, the average of accuracy on the data set of interest for DP-TOSS, SimVA_BD, and SimVA_MAC are equal to 61.35%, 76.17%, and 76.09%, respectively.
Based on the obtained results, it can be concluded that SimVA is more efficient than DP-TOSS. More specifically, the percentages of the accuracy improvement of the proposed method compared to DP-TOSS using the BD and MAC are equal to 14.82% and 14.74%, respectively. Furthermore, SimVA is able to work on large protein with a total number of 65 SSEs (PDB ID 5KBU). This is one of the valuable achievements of this study that can cope with the problem of using large complex proteins with many secondary structure elements. Working on large complex proteins has been a challenging issue in recent studies [18][19][20]54]. As reported in the state-of-the-art studies, the largest protein in their dataset includes 33 SSEs-A and 20 SSEs-C. In the current study, we have been able to run the designed automatic method on two experimentally huge cryo-EM maps, 6UXW (PDB ID) and 5KBU (PDB-ID), which consist of 1034 and 1703 amino acids, respectively.

Runtime of the Method
The proposed automatic matching algorithm consists of four main steps. The first step is to extract the SSEs from two sources of information (i.e., PDB and map), the second step is to construct the 3D vectors from extracted SSEs, the third step is to transform the 3D vectors into the 3D graphs, and the last step is to develop a similarity-based voting algorithm in order to obtain the final SSEs correspondence. Here, the runtime of the method has been computed for the last three steps. The total runtime has been computed on a workstation with MacBook Pro, 2.2 GHz 6-Core Intel Core i7 Processor, and 16 GB of memory. The running time of the method on the benchmark data set is illustrated in Figure 6.
As can be observed in Figure 6, the runtime of the algorithm increases as the number of SSEs-A increases. For example, the least running time (0.46 s) is related to the protein 1BZ4 (PDB ID) with 5 SSEs-A, and the most running time (10.58 s) is relevant to the protein 5KBU (PDB ID) with 65 SSEs-A.  As can be observed in Figure 6, the runtime of the algorithm increases as the number of SSEs-A increases. For example, the least running time (0.46 s) is related to the protein 1BZ4 (PDB ID) with 5 SSEs-A, and the most running time (10.58 s) is relevant to the protein 5KBU (PDB ID) with 65 SSEs-A.

Discussion and Conclusions
Cryo-EM has played an increasing role in the structure determination of molecular complexes in recent years. Despite many advances in cryo-EM technologies, in some cases, the resolution of the generated maps ranges between 4Å to 10Å. Therefore, the medium-resolution cryo-EM map may not be adequate to directly determine the atomic structure of the protein. At medium-resolution, the secondary structure elements have

Discussion and Conclusions
Cryo-EM has played an increasing role in the structure determination of molecular complexes in recent years. Despite many advances in cryo-EM technologies, in some cases, the resolution of the generated maps ranges between 4Å to 10Å. Therefore, the mediumresolution cryo-EM map may not be adequate to directly determine the atomic structure of the protein. At medium-resolution, the secondary structure elements have been extracted and visualized by various methods. In this study, the automatic assignment method has been developed to find the mapping of the secondary structures of the modeled structure to the cryo-EM map. Knowing this assignment allows us to form an initial hypothesis on the structure of the protein backbone. The key idea of the 3D matching strategy proposed in this study is to represent the extracted SSEs from the density map and the modeled structure in a common way, and then build up the correspondence between these two representations. Our common approach is 3D weighted fully connected graphs, with nodes representing the SSEs and the edges representing the connectivity between the SSEs. The key contributions of the geometrical matching method can be summarized as follows: (i) the modeling of the SSEs to the geometrical vectors in 3D space, (ii) transforming the 3D vectors into the 3D graphs based on the proposed mathematical-based features, (iii) introducing two robust statistical scoring functions, BD and MAC, to measure the similarity of nodes of the graphs, and (iv) developing the innovative similarity-based voting algorithm combined with the PLC concept to find the true correspondence. It is important to mention that the SSEs correspondence may not be a bijection. Due to the noise and uncertainty in a typical map, the SSEs detection algorithms may fail to find the location of all the SSEs within the map and may also identify false SSEs. We demonstrated the performance of the method on the simulated as well as experimental data sets in the presence of errors. Comparative studies have also been conducted to demonstrate the superiority of the 3D matching method over some of the existing state-of-the-art techniques. The results show that the automatic method is highly efficient (76.09% overall accuracy) and works well for large cryo-EM maps. Moreover, the key strength of the matching method is that it does not require any prior segmentation of the density map and does not need skeleton data to obtain the SSEs correspondence. Besides, the automatic method is able to work on the large cryo-EM data (PDB ID 5KBU) containing 65 SSEs-A and 54 SSEs-C with 81.62% accuracy in less than 11 s.

Code Availability
The source code and data of the method is publicly available at https://github.com/ Bahareh-Behkamal/Match_SSEs_CryoEM, accessed on 20 November 2021. Moreover, the instruction for utilizing the method can be found in the shared readme file.

Conflicts of Interest:
The authors declare no conflict of interest.