1. Introduction
Many proteins require a cofactor to function correctly, and present a region of their surface which has an affinity for that cofactor. Of the metallic cofactors, zinc is one of the most common. Approximately 10% of proteins require zinc to function [
1] and so have at least one zinc binding site, making it the second-most prevalent metal in biological systems, after iron. In proteins, it typically performs either a rôle in catalysis (despite, or more likely because of, its lack of variable redox states), or in stabilising a region of the protein [
2].
While there are many proteins which are known to bind zinc because the full three-dimensional structure of the protein has been solved in the presence of zinc, leading to the identification of a zinc binding site, it would be useful to be able to determine whether a protein binds zinc without needing to do this. There are experimental means of doing so, but computational approaches offer a more convenient means of performing initial searches at greater scale and speed. These would take either the protein’s sequence, or a structure of some kind (either a hypothetical model, an experimental structure generated in the absence of zinc, or an experimental structure solved at low resolution where a zinc cannot be identified, perhaps because of the presence of heavy metals used for isomorphous replacement) and try to predict whether the protein binds zinc, and, if so, where.
There have been numerous studies in this area in the past. Early attempts at predicting zinc binding from sequence were largely done manually, such as by identifying the ‘C…C…H…H’ (cys-cys-his-his) motif as being a characteristic indicator of zinc binding [
3,
4], or by identifying approximate spacing patterns typical of catalytic binding sites—the so-called ‘short and long spacers’ [
5]. As the number of available sequences grew and this manual approach became infeasible, sequence alignment with known zinc binding proteins became a useful tool for discovering new zinc binding sites [
6,
7]. Resources such as PROSITE [
8] provide a refinement of manual motif searching by providing motifs for zinc binding in a number of homologous families. At the time of writing, there are 70 motifs for zinc fingers, one for zinc-containing alcohol dehydrogenases, two for copper/zinc superoxide dismutase signature, two for zinc carboxypeptidases and one for the zinc import ATP-binding protein znuC family.
By the early 2000s, machine learning became the typical approach for identifying possible metal binding sites—a collection of algorithms which are trained on a dataset of known zinc binding sites in order to identify for themselves what the characteristic properties of zinc binding are, rather than having a human manually identify what those properties might be. Typical algorithms used in the past include Support Vector Machines (SVMs) [
9,
10,
11] and Random Forests [
12,
13]. In recent years, deep learning, which relies on multi-layer neural networks to represent the inputs at multiple layers of abstraction, has been used more widely [
14,
15].
Predicting zinc binding from structure has proceeded in a similar fashion, although the nature of structural data means that it has taken longer for there to be enough data to justify the use of machine learning techniques. Early efforts relied on human-observed characteristics of zinc binding sites, such as the ‘hydrophobicity contrast function’, which used the fact that metal binding sites tend to be composed of an inner shell of hydrophilic atoms such as nitrogen and sulphur, which was, in turn, surrounded by a stabilising shell of hydrophobic atoms [
16,
17]. As the number of available structures grew, geometric patterns were also observed—both by humans and by machine learning algorithms [
17,
18,
19,
20]. As with the sequence prediction models, the complexity of the algorithms, and of the zinc binding site features, has grown with the increase in available training data.
One recurring feature, particularly in the sequence-based predictive models, is the focus on zinc binding residues rather than zinc binding sites. In most cases, the entity examined by the predictive model is the individual residue, often with a surrounding linear sequence ‘window’ of residues. The model then assigns a probability as to whether that residue is a zinc binding residue. As outlined above, this approach has had a measure of success, but it is a somewhat artificial concept. There is, after all, no such thing as a zinc-binding residue in isolation. The individual residues of a high-affinity zinc binding site of the kind considered here are only zinc-binding when the other residues are present, and conversely many non-zinc-binding residues could bind zinc if other residues were present in the correct locations. It is particular combinations of residues, not individual residues, which are zinc binding—an important fact not usually considered in research of this kind.
Another commonality is the treatment of zinc binding sites as a single category, and the presumption of properties that are common to them all regardless of the residues of which they are comprised. This may well be sufficient—particularly as there are essentially only four residues that make up the vast majority of zinc binding sites—but it is possible that properties used for prediction have much tighter distributions within particular sub-categories of zinc binding sites.
Previously, we created ZincBindDB [
21], a database of zinc binding sites. This resource continuously collates all zinc atoms found in the Protein Data Bank [
22], identifies their binding sites (where appropriate), and stores them in a centralised database along with useful properties such as their protein sequence and how different sites cluster together. Sites are classified into ‘families’, not based on homology, but based on the residue composition of the site—the C4 family contains binding sites with four cysteines, H3 those with three histidines, and so on. These data are available over the web via a web ‘application programming interface’ (API), and using a web interface which provides three dimensional graphical representations of all the binding sites. As of July 2020, there were 35,811 zinc binding sites in ZincBind, originating from 16,635 PDB structures.
We have now used this single, definitive dataset of zinc binding sites to train predictive models of zinc binding. Here, we present models which are trained to detect entire zinc binding sites, rather than just zinc binding residues, and each predictive model is trained to detect a particular family of zinc binding sites. There are distinct models for sequence and for structure, and predictions can be made via the ZincBind website, or via the ZincBindPredict GraphQL API.