Finding High-Quality Metal Ion-Centric Regions Across the Worldwide Protein Data Bank

As the number of macromolecular structures in the worldwide Protein Data Bank (wwPDB) continues to grow rapidly, more attention is being paid to the quality of its data, especially for use in aggregated structural and dynamics analyses. In this study, we systematically analyzed 3.5 Å regions around all metal ions across all PDB entries with supporting electron density maps available from the PDB in Europe. All resulting metal ion-centric regions were evaluated with respect to four quality-control criteria involving electron density resolution, atom occupancy, symmetry atom exclusion, and regional electron density discrepancy. The resulting list of metal binding sites passing all four criteria possess high regional structural quality and should be beneficial to a wide variety of downstream analyses. This study demonstrates an approach for the pan-PDB evaluation of metal binding site structural quality with respect to underlying X-ray crystallographic experimental data represented in the available electron density maps of proteins. For non-crystallographers in particular, we hope to change the focus and discussion of structural quality from a global evaluation to a regional evaluation, since all structural entries in the wwPDB appear to have both regions of high and low structural quality.


Introduction
Metal ions are important components in biological processes, especially at the biochemical and cellular levels. An estimated 30% to 40% of proteins across the combined proteome of the biosphere binds at least one metal ion [1,2]. Protein metal binding is part of many biochemical mechanisms including signal transduction, enzyme catalysis, and protein structural integrity [3][4][5]. The local protein structure environment around bound metal ions can provide clues to the biochemical and cellular function of the binding [6][7][8] and reveal how sequence-based structural changes modulates metal binding [9,10]. However, the quality of 3D protein structural data around metal binding sites can vary dramatically from structure to structure, and especially from region to region [8,11]. Therefore, when analyzing metal binding site structure and dynamics, the quality of the utilized worldwide Protein Data Bank (wwPDB) [12] entries should be evaluated, especially in the metal binding site region [2,8,13]. Moreover, the presence of non-biologically bound metal ions in the wwPDB entries due to crystallization conditions and artifacts make evaluation imperative. The potential impact of crystallization artifacts on computed ligand binding affinities has already been demonstrated [14].
Also, current methods and tools for this regional evaluation around metal ions, focused only on the PDB structural entry itself, have proven useful for weeding out some metal binding sites with poor regional structural quality [13]. The best approach for identifying and evaluating a modeled metal ion is during model building and refinement using occupancy, B factor, anomalous scattering, and the chemical environment [15]. However, after a structure is deposited into the wwPDB, comparison to the raw electron density itself represents the best standard of evaluation against experimental data that can demonstrate the reliability and usability of a given metal binding site region [8,16], given the data most often available in the wwPDB. These comparisons of the metal binding site structure to the underlying electron density data have been facilitated by structure factor deposition requirements of the wwPDB since 2008, by the electron density maps made available previously by the Uppsala Electron Density Server [17], and now by the PDB in Europe (PDBe) [18]. Still, this electron density evaluation of regional structural quality has been a tedious process done by manual visual inspection, without objective metrics of quality. To alleviate these shortcomings in electron density evaluation, we have developed new analysis and evaluation methods in a Python package called pdb-eda [19], which facilitate the systematic quality control of protein structural regions of interest across large numbers of wwPDB entries and their corresponding electron density maps. In this study, we apply pdb-eda to a systematic electron density analysis of all metal binding sites containing a bound metal ion. This analysis provides an evaluation of the quality of metal binding sites in the wwPDB based on the metrics of regional structural quality with respect to the underlying electron density data used to derive the metal binding site structure. Our goal is to provide an approach for evaluating metal binding sites against experimental electron density data that could improve the outcomes for a wide variety of downstream structural, dynamic, and functional analyses. Potential downstream analyses that could benefit include, but are not limited to, molecular dynamics simulations [20], molecular mechanics and quantum mechanical calculations [21,22], and molecular docking [23,24]; however, any potential downstream analysis of metalloprotein structure that treats the PDB entry as ground truth would benefit from this type of experimental evaluation of a region of interest. Moreover, we demonstrate our evaluation approach with the generation of a current set of metal binding sites that are of high regional quality, also enabling others to screen this set with more stringent criteria specific to their data analysis needs or even to regenerate the set with a future version of the wwPDB.

Methods
Structural data from wwPDB listed on 3 July 2018 were used for the analysis. Their electron density data, if available, were acquired from the PDBe website. We used version 1.0 of the pdb-eda Python package [19] to analyze all downloaded PDB entries and matching electron density maps. Metal ions were detected across these PDB entries and filtered against four major quality control criteria: (1) Electron density resolution less than or equal to 2.5 Å; (2) Atom occupancy greater than or equal to 0.9; (3) No symmetry atoms within 3.5 Å; (4) The sum of discrepant electrons within a 3.5 Å region surrounding the metal ion point position is less than the data-derived cutoff.
The resolution and occupancy information were retrieved directly from the PDB structure entry file in PDB Molecular Format (ent) format. We considered a resolution less than or equal to 2.5 Å and an occupancy greater than or equal to 0.9 as meeting an acceptable level of quality for most downstream structural and dynamic studies, since water and small ligands are typically visible below this resolution. Symmetry-related atoms were calculated from the REMARK SMTRY records in the PDB structure data, as we took into account nearby asymmetric units. Atom-atom distance between a metal ion and all symmetry related atoms were computed and metal ions were filtered out if any symmetry atom point positions were present within 3.5 Å of the metal ion point position. Electron density maps were analyzed using the self-developed Python package, pdb-eda. This package provides methods for converting the electron density discrepancies in the experimentally observed minus calculated difference Fo-Fc density maps into numbers of discrepant electrons when a significant protein component exists in the entry. The sum of the absolute value of both positive and negative electron discrepancies was computed for all significant discrepancies within 3.5 Å of the metal. Significant discrepancies were decided by a standard deviation cutoff of 3, based on each individual electron density map, which is the commonly accepted cutoff for visualizing significant electron density map discrepancies. After filtering by criteria 1 and 2, we derived the distribution of all metal ion electron discrepancy sums and filtered out outliers based on a standard deviation cutoff of 2. With the resulting filtered distribution, we set a max electron discrepancy sum cutoff to 1 standard deviation of this distribution. The electron density overlay graphs were prepared using the LiteMol Viewer [25]. All results and code used to generate the results for this study are available on FigShare: https://doi.org/10.6084/m9.figshare.8044451. All code was run on a 20-core Intel(R) Xeon(R) E5-2670v2 CPU with hyperthreading and 256 GB RAM, utilizing all hyperthreaded cores. It took roughly 2 days to analyze the first three criteria, while it took roughly 14 days to analyze the electron density criterion for the entire PDB.

Results
We started with a list of 141,616 usable PDB entries, and 53,146 of them contained at least one metal ion, including both biological and non-specific metal ions. The total was about 38%, which agrees with other studies and predictions [1,2]. In this study, we considered four major criteria in filtering "high-quality" metal ions, usable for downstream structural and dynamic analyses: resolution, occupancy, presence of symmetry atoms, and significant discrepancies in terms of numbers of electrons. Figure 1 shows both high-and low-quality examples based on these four criteria, as illustrated in an overlay of the electron density map over the structural model. provides methods for converting the electron density discrepancies in the experimentally observed minus calculated difference Fo-Fc density maps into numbers of discrepant electrons when a significant protein component exists in the entry. The sum of the absolute value of both positive and negative electron discrepancies was computed for all significant discrepancies within 3.5 Å of the metal. Significant discrepancies were decided by a standard deviation cutoff of 3, based on each individual electron density map, which is the commonly accepted cutoff for visualizing significant electron density map discrepancies. After filtering by criteria 1 and 2, we derived the distribution of all metal ion electron discrepancy sums and filtered out outliers based on a standard deviation cutoff of 2. With the resulting filtered distribution, we set a max electron discrepancy sum cutoff to 1 standard deviation of this distribution. The electron density overlay graphs were prepared using the LiteMol Viewer [25]. All results and code used to generate the results for this study are available on FigShare: https://doi.org/10.6084/m9.figshare.8044451. All code was run on a 20-core Intel(R) Xeon(R) E5-2670v2 CPU with hyperthreading and 256 GB RAM, utilizing all hyperthreaded cores. It took roughly 2 days to analyze the first three criteria, while it took roughly 14 days to analyze the electron density criterion for the entire PDB.

Results
We started with a list of 141,616 usable PDB entries, and 53,146 of them contained at least one metal ion, including both biological and non-specific metal ions. The total was about 38%, which agrees with other studies and predictions [1,2]. In this study, we considered four major criteria in filtering "high-quality" metal ions, usable for downstream structural and dynamic analyses: resolution, occupancy, presence of symmetry atoms, and significant discrepancies in terms of numbers of electrons. Figure 1 shows both high-and low-quality examples based on these four criteria, as illustrated in an overlay of the electron density map over the structural model.  Table 1 shows a tabulation of 56 different elemental types of metal ions observed in the wwPDB, with respect to four quality-control criteria. Zinc is present in the highest number of PDB entries, while magnesium has the highest number of metal sites. This is probably due to the presence of large  Table 1 shows a tabulation of 56 different elemental types of metal ions observed in the wwPDB, with respect to four quality-control criteria. Zinc is present in the highest number of PDB entries, while magnesium has the highest number of metal sites. This is probably due to the presence of large numbers of magnesium ions in certain PDB entries, such as those of the ribosome [26]. Overall, nine metals had over 1000 examples across the PDB that passed the four criteria. An additional eight metals had over 100 examples that passed all four criteria. For the rest of this study, we look at each of the four criteria in more detail, with respect to five of the most important and common metal ions in biochemistry: zinc, calcium, iron, manganese, and copper.

X-Ray Crystallographic Resolution
The electron density resolution represents an overall metric of structural quality for an X-ray crystallographic structure. Structural entries with a resolution below 2.5 Å are generally considered usable for many structural and dynamics analyses. Figure 2 illustrates the distribution of resolution for the top five most essential metals in biology. In general, the overall and individual metal ions have similar distributions. The distribution for manganese has several spikes, which is mainly due to the over-representation of replicate values from structural entries with large numbers of manganese ions. This filter removes about 34% of all metal ions.

X-Ray Crystallographic Resolution
The electron density resolution represents an overall metric of structural quality for an X-ray crystallographic structure. Structural entries with a resolution below 2.5 Å are generally considered usable for many structural and dynamics analyses. Figure 2 illustrates the distribution of resolution for the top five most essential metals in biology. In general, the overall and individual metal ions have similar distributions. The distribution for manganese has several spikes, which is mainly due to the over-representation of replicate values from structural entries with large numbers of manganese ions. This filter removes about 34% of all metal ions.

Occupancy and Symmetry-Related Atoms
The majority of metal ions have an occupancy of 1. However, there are two general cases where low occupancy occurs. In the first general case, when there is more than one conformation available during the structure determination, multiple conformations (typically two) are often kept in the data and are often marked as "ALT". Thus, different conformations will only possess the metal ions with partial occupancy. Typically for two conformations, the occupancy will be 0.5 for each conformation. For the second general case, only a fraction of the repeating unit cells in the protein crystal has the

Occupancy and Symmetry-Related Atoms
The majority of metal ions have an occupancy of 1. However, there are two general cases where low occupancy occurs. In the first general case, when there is more than one conformation available during the structure determination, multiple conformations (typically two) are often kept in the data and are often marked as "ALT". Thus, different conformations will only possess the metal ions with partial occupancy. Typically for two conformations, the occupancy will be 0.5 for each conformation. For the second general case, only a fraction of the repeating unit cells in the protein crystal has the observed metal ion, and the occupancy will represent the percentage of the crystal structure with a bound metal ion. In either case, low occupancy sites can be considered low quality for aggregated analyses, since only a fraction of the experimental data supports the given model of the metal ion position. A filter of 0.9 occupancy removes about 10% of all metal ions and is consistent for most individual metal ions. Therefore, this criterion only removes a minority of metal ion sites.
Crystal contacts can pose as an artifact, affecting the binding of the metal ion, especially on the surface of a protein structure. This may represent a false binding that does not biologically exist, i.e., when the crystal packing environment is no longer available. Also, crystal contacts can affect protein-ligand binding [14]. Our study demonstrates that only about 7% of metal ion sites are filtered out by a 3.5 Å symmetry atom exclusion criterion. Figure 1 demonstrates why electron density maps can be extremely useful for validating high-quality regions within protein structures. As described in the methods, we used our pdb-eda Python package to compare every metal binding site to its Fo-Fc electron density map. Figure 3 shows the distributions of the sum of absolute electron disagreement within a 3.5 Å radius of each metal ion. Overall and individual metal ion distributions are very similar, justifying the calculation of a single data-driven cutoff from the overall distribution. The final data-driven cutoff for the sum of absolute electron discrepancy is approximately 19.3 electrons, based on one standard deviation of the Figure 3F distribution with outliers removed. This is a purposely conservative quality control criterion, representing roughly two water molecules worth of electron discrepancy. However, only 24.8% of metal ion sites with usable electron density data (221,494) are filtered out by this criterion. In comparison, we also calculated a background difference density based on the average absolute significant electron discrepancy per Å 3 over the whole density map, which was then multiplied by a 3.5 Å radius sphere volume. The resulting background discrepancy distribution across all metalloprotein structures is shown in Figure S1. The average number of electrons for this distribution is 1.9e, whereas the average for the metal ion sites distribution ( Figure 3F) is 18.4e. Therefore, the regions around metal binding sites typically have a higher number of structural discrepancies. These discrepancies are due to experimental distortions around metal ions [27]. One possible explanation is heterogeneity in the metal ion oxidation state at a particular binding site across the crystal. Furthermore, these distortions are more pronounced around metal ion clusters, such as iron-sulfur clusters [27]. Thus, we defined metal ion clusters as any metal ion that has another metal ion within a distance of 3 Å [28,29], and then performed a similar analysis. The distribution of the electron discrepancy for metal ion clusters is shown in Figure S2. It demonstrates a very similar distribution to all categories in Figure 3, but with a higher average discrepancy of 29.6e. With the 19.3e maximum discrepancy criterion, 36.9% of cluster metal sites are filtered out. The higher level of electron discrepancy around metal ions and especially metal ion clusters emphasizes the importance of this study in finding high-quality metal-centric regions for potential downstream studies.

Discussion
As illustrated by previous studies, regional structural quality affects the usability of bound ligand structure, including bound metal ions, for accurately interpreting structural, dynamic, and chemical properties of ligand binding sites [14,30,31]. Moreover, from the comparison of electron discrepancy distributions in Figures 3F and S1, metal binding regions clearly have a higher amount of electron discrepancy than the structural background. Therefore, steps should be taken to ensure the quality of metal binding sites for various downstream structural and dynamics analyses. As the number of structures available in PDB continues to grow dramatically every year, more attention is being paid to ensuring that only high-quality datasets are used in these studies. Toward this goal, we have developed new methods in the open source pdb-eda Python package that enable the evaluation of regional structural quality with respect to the underlying experimental data.
In this study, we demonstrate the use of electron density maps for a systematic evaluation of the regional structural quality around all metal binding sites in the PDB with matching electron density maps provided by the PDBe. This is one of four criteria used for evaluating the structural quality of metal binding sites for the purpose of generating large high-quality datasets for downstream analyses. The maximum resolution criterion ensures a baseline quality of the overall structure. The combination of a minimum 0.9 occupancy criterion with a 3.5 Å symmetry atom exclusion criterion should remove most bound crystallographic artifact metal ions present in the structures, as they tend to be either inconsistently present across the asymmetric units and/or nonspecifically bound to the surface of the protein and near symmetry atoms. However, there could still be instances where a metal ion from the crystallization buffer is bound to the protein in a non-biological manner with high affinity that is comparable to bound metal ions that are biologically relevant. Distinguishing such cases requires much closer examination, often involving the use of biochemical assays, and is beyond the scope of this study.

Discussion
As illustrated by previous studies, regional structural quality affects the usability of bound ligand structure, including bound metal ions, for accurately interpreting structural, dynamic, and chemical properties of ligand binding sites [14,30,31]. Moreover, from the comparison of electron discrepancy distributions in Figure 3F and Figure S1, metal binding regions clearly have a higher amount of electron discrepancy than the structural background. Therefore, steps should be taken to ensure the quality of metal binding sites for various downstream structural and dynamics analyses. As the number of structures available in PDB continues to grow dramatically every year, more attention is being paid to ensuring that only high-quality datasets are used in these studies. Toward this goal, we have developed new methods in the open source pdb-eda Python package that enable the evaluation of regional structural quality with respect to the underlying experimental data.
In this study, we demonstrate the use of electron density maps for a systematic evaluation of the regional structural quality around all metal binding sites in the PDB with matching electron density maps provided by the PDBe. This is one of four criteria used for evaluating the structural quality of metal binding sites for the purpose of generating large high-quality datasets for downstream analyses. The maximum resolution criterion ensures a baseline quality of the overall structure. The combination of a minimum 0.9 occupancy criterion with a 3.5 Å symmetry atom exclusion criterion should remove most bound crystallographic artifact metal ions present in the structures, as they tend to be either inconsistently present across the asymmetric units and/or nonspecifically bound to the surface of the protein and near symmetry atoms. However, there could still be instances where a metal ion from the crystallization buffer is bound to the protein in a non-biological manner with high affinity that is comparable to bound metal ions that are biologically relevant. Distinguishing such cases requires much closer examination, often involving the use of biochemical assays, and is beyond the scope of this study.
As demonstrated in Figure 3, metal ions and especially metal ion clusters ( Figure S2) have higher levels of electron discrepancy due to the experimental distortions around metal ions. Therefore, the maximum electron density discrepancy criterion was derived from the metal ion electron discrepancy distribution itself (i.e., one standard deviation representing 19.3e). For commonly bound metal ions, additional criteria can be used for quality control [13], including the evaluation of bond lengths between ligating atoms (using the coordination chemistry definition of ligand) and the metal ion and coordination shell considerations [8,29,32]. However, several of these evaluations require the accurate identification of ligating atoms and are complicated by the wide variation in coordination geometries. Moreover, these additional criteria cannot be practically applied to all 56 elemental types of metals analyzed in this study, given the current examples available in the PDB. Therefore, we limited our method to four criteria that could be straightforwardly applied to all elemental types of metal ions currently present in the PDB.
A full list of metal binding sites that pass all four criteria utilized in this study can be found in Table S1, along with the values used for criterion evaluation. In addition, all code used to generate this table is available in a FigShare repository along with all metals binding sites evaluated in this study. Therefore, metal binding sites can be re-evaluated against modified criteria that match the quality-control requirements of a given downstream analysis. Also with this code, future versions of the PDB can be analyzed in a similar manner to regenerate an updated list of metal binding sites.
In conclusion, we have demonstrated an approach for the pan-PDB evaluation of metal binding site structural quality with respect to underlying X-ray crystallographic experimental data represented in available electron density maps of proteins. Especially for non-crystallographers, we hope to change the focus and discussion of structural quality from a global evaluation to a regional evaluation, since all structural entries in the wwPDB appear to have both regions of high and low structural quality.
Supplementary Materials: The following are available online. Table S1: A list of high-quality metal ions that pass all four criteria; Figure S1: Distribution of the average absolute electron discrepancy sum within a 3.5 Å radius sphere volume; Figure S2: Distribution of electron discrepancy within 3.5 Å of the metal ion for metal ions within 3 Å of another metal ion.