In Silico Prediction of Protein Adsorption Energy on Titanium Dioxide and Gold Nanoparticles

The free energy of adsorption of proteins onto nanoparticles offers an insight into the biological activity of these particles in the body, but calculating these energies is challenging at the atomistic resolution. In addition, structural information of the proteins may not be readily available. In this work, we demonstrate how information about adsorption affinity of proteins onto nanoparticles can be obtained from first principles with minimum experimental input. We use a multiscale model of protein–nanoparticle interaction to evaluate adsorption energies for a set of 59 human blood serum proteins on gold and titanium dioxide (anatase) nanoparticles of various sizes. For each protein, we compare the results for 3D structures derived from experiments to those predicted computationally from amino acid sequences using the I-TASSER methodology and software. Based on these calculations and 2D and 3D protein descriptors, we develop statistical models for predicting the binding energy of proteins, enabling the rapid characterization of the affinity of nanoparticles to a wide range of proteins.


Introduction
As the growth of nanotechnology accelerates, it can be expected that the number of new nanomaterials with unknown properties increases day after day. In the 21st century, the interaction of humans with NPs is commonplace, with NPs used in many household chemicals like paints and lotions [1] or novel medicines [2]. Furthermore, humans are constantly exposed to NPs from unintentional sources, including exhaust fumes from internal combustion engines [3], construction sites [4] and waste processing [5]. The protection of the health of not only the affected workers but of the general population is of growing concern.
To reduce the health risks coming from such exposure, one should identify the resulting adverse outcomes [6], trace them back to the underlying bio-nano interactions [7], and then optimise the nanomaterial itself or the relevant process to avoid the properties of concern. Ideally, this process would be performed at the stage of material's design or production. One of the essential stages of bio-nano interaction is the formation of a biomolecular shell around a NP entering the body. This shell is known as the nanoparticle (NP) protein corona and is the biological signature of the NP that can be related to the NP physicochemical properties [8]. The contents of this corona determine which host cells interact with the NP, the type of interaction that occurs, and any adverse effects [9]. The composition of the corona will depend on many factors: the kind of the biomolecular medium which the NP entered on the way, the adsorption dynamics between different species of proteins, and many other factors.
The most informative quantity that describes the NP-protein interaction is the adsorption free energy between the NP and the protein. This quantity controls whether a protein will adhere to the NP Table 1. Characteristics of the two nanomaterials considered in this paper, detailing the parameters required for the atomistic simulations and the optical characteristics required for the calculation of Hamaker constants.

Proteins
A list of all human blood proteins with a concentration in the serum above 10 ng/mL was obtained from the literature [13]. This list was parsed through the Research Collaboratory for Structural Bioinformatics (RCSB) mapping tool [14] to convert the UniProtIds obtained from the list into the relevant entry in the Protein DataBank (PDB) [11]. Most UniProtIds matched to more than one entry in this database due to crystallographers studying variations of the protein with different mutations, other molecules present as a complex, or different sections of the protein. A manual review of each set of PDB structures associated with a given UniProt ID was performed to select the structure with the greatest coverage of the AA sequence and as few ligands or complexed molecules as possible to ensure that the protein structure was close to that which can be predicted directly from the sequence. The exact function of a specific protein was irrelevant for this study.
To generate structures directly from the AA sequence, Iterative Threading Assembly Refinement (I-TASSER) version 5.1 was used [12]. In brief, I-TASSER is a bioinformatics method that estimates a 3D structure model of protein molecules through AA sequences and is described in more detail elsewhere [15,16]. The input for I-TASSER is a FASTA sequence file, which for this work was obtained by downloading the FASTA sequence provided by the RCSB PDB site for each PDB file. As I-TASSER is currently unable to distinguish between different chains within a FASTA sequence input, structures were calculated only for proteins with a single chain and experimental structures with multiple chains removed from the study. I-TASSER often generates more than one structure file for each protein, as many possible sequence-matches can be found. Each model produced by I-TASSER is given a confidence score (c-score). This c-score ranges from −5 to +2, where the negative values indicate that I-TASSER believes it has predicted a poor-quality 3D structure, which is unlikely to match reality. Here, we simply select the model with the highest c-score.
Ideally, the PDB structure would contain precisely one set of co-ordinates for each residue present in the AA sequence. However, due to the limits of experimental resolution and the residual thermal motion of AA side chains, this is not always possible, and PDB files may provide multiple possible locations for a single residue or indeed omit a residue (or a sequence of residues) entirely. The former is indicated using the occupancy field in the entry for a residue, which lists the probability for a given residue to be in the location stated. If the residue cannot be localised at all, the entry is simply missing, although the residue itself still contributes to the overall structure of the PDB. Structures obtained using I-TASSER, on the other hand, do not suffer from these issues and all AAs present in the sequence will be assigned a location in the final predicted structure. Thus, it may appear that an I-TASSER structure for a given protein has a different number of residues present to the PDB structure for the same protein.
To take this into account, we also produce a "masked" version of each I-TASSER structure in which residues that cannot be located in the PDB structure are hidden to produce a computational structure with the same residues present as the PDB structure. This masking is performed using the sequence alignment tools provided in the BioPython library [17], which is further used to calculate the root mean square deviation (RMSD) between the PDB and (masked) I-TASSER structures. Finally, we note that the replacement of methionine by selenomethionine is not taken into account when processing PDB files and generating I-TASSER structures, as this is a random, infrequent event. The proteins considered in this study therefore consist of only the 20 canonical amino acids in their standard states. Nanomaterials 2020, 10,1967 4 of 21

Protein Descriptors
Previous studies demonstrated that statistics of adsorbed proteins and their weighted properties are predictive of a NP's biological activity [9]. Obviously, the type of proteins the NP predominantly adsorbs reflects its own properties responsible for bio-nano interactions. For the sequence-derived descriptors, we used the pepstat program of the European Molecular Biology Open Software Suite (EMBOSS, v6.6.0, EMBOSS group, international) [18]. Several additional 3D descriptors were obtained from the protein structures, ranging from geometrical properties such as the surface area, volume, and sphericity of the protein to chemical identifiers such as the numbers of different types of amino acids on the surface. In the latter case, we evaluate both counts of individual types of AAs and of groups such as charged or aromatic AAs. A full description of the predictors is provided in Appendix A.

Adsorption Free Energy Calculation
The UA method was used to calculate the interaction energy between each protein and a spherical NP. A full theoretical description of the methodology can be found in [10,19,20] and all Supplementary Materials for this article are available online. The multiscale modelling performs tiered summation of the pairwise interactions between amino acids of the protein and the nanomaterial, treating each AA as a single bead located at the α-carbon position. The position of the protein relative to the NP is determined by the distance h between the centre-of-mass of the protein and the surface of the NP, and a pair of angles φ, θ defining the orientation of the protein. From these and the structure of the protein, the distance of each AA bead from the centre of the NP can be calculated, with the distance for bead i denoted as r i (h, φ, θ) = r i . Each AA-NP interaction is computed from three contributions: a long-range term arising from the van der Waals force acting between the bulk of the NP and the AA, a short-range surface potential computed through atomistic simulations, and an electrostatic (screened Coulomb) interaction. Each of these interactions depends on factors such as the density of the nanomaterial, charged patches on the surface of the protein and the size of the NP. Here, we calculate binding energies only for spherical NPs defined by their radius R and neglect the electrostatic interaction. We approximate the long-range interaction between an AA bead and the NP through the Hamaker potential obtained by integration over the volumes of the NP and AA bead, where R AA is the radius of the AA bead, calculated as discussed in Appendix B, D is the distance between the centre of the AA bead and the centre of the NP, A 123 is the Hamaker constant for interaction of the nanomaterial with the AA through water calculated as described in Appendix B and in previous work [10,21], and r c is the cutoff range for the surface potential of mean force (PMF) [10]. At short range (i.e., at distances less than r c ), part of the volume of the NP included in the above expression is already accounted for in the surface potential. We thus correct this expression by subtracting the potential for the lens segment formed by the region of the NP included in the short-range surface potential, resulting in [10], The total core potential is then obtained by summing over the potential for AA beads present in the protein, U c (r, φ, θ, R) = i α i U c,i (r i , R), where α i is the occupancy of the AA bead as extracted from the PDB file. Likewise, the short-range potential is calculated through summation of the short-range PMF for each bead, U s (r, φ, θ, R) = i α i U s,i (r i , R) [10,20,21]. The total potential energy U as a function of the NP-protein distance, radius of the NP and orientation of the protein is then given by, In the UA methodology, the adsorption free energy for a fixed orientation is evaluated from this expression, then a Boltzmann-weighted average over all orientations is performed to obtain the adsorption free energy [10,20,21]. Surface adsorption calculations were performed between all the proteins in the list (both the experimentally-derived and computationally-derived PDB structures) on two different nanomaterials: Au and TiO 2 . The calculations were performed for four different NP radii: 5 nm, 50 nm, 100 nm, and 200 nm. The physicochemical properties of each material can be found in Table 1. The PMFs used to define the surface contribution to the binding energy for TiO 2 are calculated as described in [21] and for Au were calculated and reported in [10] and also presented in the Supplementary Materials, Figures S1 and S2.

Adsorption Affinity Ranking
To assign a numeric value to the accuracy of the affinity rankings obtained from computational structures relative to the experimental structures, we calculate the Kendall tau distance between these two lists and normalise this value to obtain the Kendall tau coefficient [22] where d K is the number of times adjacent elements in list B that must be swapped to produce the same ordering as list A, and N is the number of entries in the list. Under this normalization convention, the coefficient is equal to 1 if lists A and B are originally in the same ordering and -1 if one is in reverse order compared to the other. A coefficient of 0 indicates no correlation between the ordering of the two lists.

Prediction of Adsoprtion Energy from Protein and NP Descriptors
A further goal is the prediction of the binding energy of a protein to an NP directly from physical properties of the protein, without requiring the calculation of the binding energy via UA or some other method. To do so, we require both a training set of known sets of values of predictors and binding energies, and a model to link the two together. The simplest such model would be a linear model, in which each of these predictors is weighted by some coefficient and the sum of these weighted predictors gives the binding energy. A key disadvantage of linear models is that they are typically relevant only over a narrow range of parameters and become increasingly inaccurate outside this range. Consider, for example, the binding energy of a protein to an NP of radius R. It is reasonable to assume that generally this binding energy becomes larger in magnitude as R increases, but for very large values of R most of the additional NP volume is sufficiently far from the protein that it contributes very little to the binding, and thus the binding energy should saturate for sufficiently large R. Conversely, a linear model would simply predict that the binding scales indefinitely as the NP becomes larger. Likewise, the same argument applies for why the binding energy should not scale indefinitely as the protein grows larger for a fixed size of NP. Clearly, identifying a suitable non-linear function to describe the binding energy is a non-trivial problem. Thus, we turn to a machine learning approach and train an artificial neural network (ANN) to predict the binding energy based on the protein predictors and nanomaterial properties.
Given the very large set of protein predictors, overfitting of the ANN is a potential issue. Many of these predictors are correlated with each other, for example, a protein with many residues of a specific type will typically have a higher molecular weight. We therefore first apply a principle components analysis (PCA) to obtain a smaller set of protein predictors. The PCA is a linear transformation of the original predictors to a set of new predictors with two useful properties. Firstly, each of these predictors are uncorrelated from each other, and secondly each are associated with a value describing how much Nanomaterials 2020, 10,1967 6 of 21 of the variance of the original predictors they describe. Thus, by selecting the PCA predictors that describe the most variance of the original predictors and discarding the rest, we obtain a more efficient representation of the proteins. We perform the PCA on the z-scores obtained for each predictor, that is, each predictor is normalised to have zero mean and a standard deviation of 1.
We represent the NP size in terms of ln(R) and a simple categorical value 0 for gold and 1.0 for titania. We divide the binding energies by the average Hamaker constant for the material in question to produce a dimensionless value with similar ranges for both gold and titania. This average Hamaker constant is obtained by averaging over all the Hamaker constants for the NP-AA interactions with weights given by the Dayhoff statistic w AA for that AA, In order to avoid the requirement to manually select all parameters required for the network, we rely on the automated procedures available in Mathematica 12.1 via the Predict routine, specifying only the maximum depth for the network [23]. We initially scan over the number of layers and number of PCA variables to include to optimise these values by finding the point at which there is no further improvement in the R 2 coefficient between the input and predicted values, then use these optimised values to generate a final network, generated as follows. To allow for a validation of the final results, we randomly divide the data set of binding energies such that two-thirds is available for training of the network and the remaining third is used for final validation. To ensure we produce a robust final result and have some measure of the uncertainty in the predicted values, we employ a bootstrap aggregation method in which we train multiple networks, each using a different random sample with replacement (bootstrap sample) of the training set. A prediction for the binding energy for a given set of protein predictors is then obtained by passing the predictors to each of these networks and averaging over their results, with an uncertainty given by the standard deviation of these predictions. We employ 30 such networks to produce this ensemble, finding that increasing the number does not significantly alter the final results.

Protein Adsorption Energies
We used the UA methodology described in [10] to compute binding energies of 59 blood serum proteins on spherical TiO 2 and Au NPs of radii 5, 50, 100, and 200 nm, using both the experimentally-derived structures obtained from the PDB and the structures predicted using I-TASSER method, see Methods section for details. To enable a direct comparison, in this section we show only results for the I-TASSER structure with a mask applied to exclude residues that were absent from the PDB structure. Tables 2 and 3 show the free energy values of five of these proteins for gold and titania NP respectively over a range of radii: 5, 50, 100, and 200 nm, calculated for both PDB and masked I-TASSER structures. All the results for the complete set, including unmasked I-TASSER structures, are available in the Supplementary Materials, Table S1.
The adsorption energies shown in Tables 2 and 3 demonstrate that the two materials have different ranges for the calculated binding energies. The Au NPs have a range between −100 to −350k B T, whereas the TiO 2 NP show energies in the range between 0 to −50k B T. The strongest binding energy on 50 nm gold NP was found for 2NSM protein with binding energy equal to −288k B T and the most weakly binding protein was 5IR3 with binding energy approximately −137k B T. For 50 nm TiO 2 NP, the strongest binding was found around −27k B T for 6JE7, and the lowest was 1GQV with a binding free energy of −6k B T. For most proteins, the adhesion becomes stronger as the radius of the NP increases and we note that proteins with a greater number of AAs typically also bind more strongly. Both of these effects, however, saturate at sufficiently large radii or numbers of residues if only one of these is varied.  Building on this observation, an analysis of correlations between the protein descriptors and the adsorption energy allows us to single out the most important variables. For Au NPs, the most correlated descriptors are Molecular Weight, Molar Fractions of various AA types (non-polar, charged, small, tiny, aromatic), Surface Area, Volume, and Sphericity. Obviously, there is some redundancy in this list as the surface, volume and mass as well as AA counts are not completely independent of each other. It is also clear that the interaction is dominated by the van der Waals attraction, which is additive and thus increases with the increase of the protein size, although not indefinitely. For TiO 2 NPs, the most significant variables are Surface GLU, Surface LYS, Surface Tiny, Surface Charged, Surface Acidic, and GLU Number. Here, again, we see some redundancy. The variables reflect the dominating contribution of charge-charge interactions at the NP surface.

Impact of Structural Error on Binding Energies
The protein structures predicted by I-TASSER are assigned a confidence score (c-score) indicating how reliable the predictions are thought to be. This c-score is known to be correlated to the deviation between experimental structures and those generated by I-TASSER. Given that the protein adsorption energy is a function of the structure, it is of interest to see the extent to which errors in the predicted structure lead to an error in the predicted energy. We analysed the relative error in the binding energy for the TiO 2 and Au NPs for each of the proteins as a function of their c-score and the RMSD between the I-TASSER and PDB structures. We found that the relative error is typically quite low and does not show any obvious correlation to either the c-score of the predicted structure or the RMSD, indicating that the UA approach is robust against small deviations in the protein structure. The results are slightly more dispersed for the small TiO 2 NP than the Au NP. We attribute this to the decreased relevance of the Hamaker interaction for the small TiO 2 NP relative to the Au NP. This interaction is reasonably insensitive to the structure of the protein, and so if it is strong the binding energy is less strongly affected by alterations to the structure of the protein.
To further investigate the error in the binding energy caused by small errors in the location of residues, we generated an additional set of structures based on protein 1AX8. In each of these structures, the locations of the alpha carbon atoms are perturbed by small displacements drawn from a normal distribution with zero mean and standard deviation chosen to produce a known value of the RMSD relative to the initial structure. The binding energies of these structures were calculated on 5 nm anatase particles due to the high sensitivity of a small NP to the exact structure of the protein. These results are shown in Figure 1. We find that a small RMSD error of up to 1 Å does not significantly alter the mean binding energy, but does induce a large spread of binding energies, suggesting that it may be necessary to average over multiple structures if possible. As the RMSD grows larger, the spread in binding energies increases and the mean energy begins to become less strongly binding, which we attribute to the decreased density of the protein at higher RMSDs and increased likelihood of a single residue dominating the binding due to preventing the NP from making more contacts. error in the binding energy for the TiO2 and Au NPs for each of the proteins as a function of their cscore and the RMSD between the I-TASSER and PDB structures. We found that the relative error is typically quite low and does not show any obvious correlation to either the c-score of the predicted structure or the RMSD, indicating that the UA approach is robust against small deviations in the protein structure. The results are slightly more dispersed for the small TiO2 NP than the Au NP. We attribute this to the decreased relevance of the Hamaker interaction for the small TiO2 NP relative to the Au NP. This interaction is reasonably insensitive to the structure of the protein, and so if it is strong the binding energy is less strongly affected by alterations to the structure of the protein.
To further investigate the error in the binding energy caused by small errors in the location of residues, we generated an additional set of structures based on protein 1AX8. In each of these structures, the locations of the alpha carbon atoms are perturbed by small displacements drawn from a normal distribution with zero mean and standard deviation chosen to produce a known value of the RMSD relative to the initial structure. The binding energies of these structures were calculated on 5 nm anatase particles due to the high sensitivity of a small NP to the exact structure of the protein.
These results are shown in Figure 1. We find that a small RMSD error of up to 1 Å does not significantly alter the mean binding energy, but does induce a large spread of binding energies, suggesting that it may be necessary to average over multiple structures if possible. As the RMSD grows larger, the spread in binding energies increases and the mean energy begins to become less strongly binding, which we attribute to the decreased density of the protein at higher RMSDs and increased likelihood of a single residue dominating the binding due to preventing the NP from making more contacts.

Prediction of Ranking by Binding Affinity
The UA multiscale methodology for the calculation of binding energies [10] contains several approximations, such that the output binding energy may contain inaccuracies. One of the major approximations is the assumption of rigidity of the protein 3D structure used as an input. While this approximation is crucial to enable a high throughput scanning of proteins, we would like to reduce the dependence of the result on other factors as much as possible. One specific concern is the veracity of the protein 3D structure used in the calculation. Since accurate experimental 3D structures are not known for most proteins, we must rely on a structure obtained computationally. While this necessary approximation leads to errors in the energy evaluation, we hope that the relative errors are small as

Prediction of Ranking by Binding Affinity
The UA multiscale methodology for the calculation of binding energies [10] contains several approximations, such that the output binding energy may contain inaccuracies. One of the major approximations is the assumption of rigidity of the protein 3D structure used as an input. While this approximation is crucial to enable a high throughput scanning of proteins, we would like to reduce the dependence of the result on other factors as much as possible. One specific concern is the veracity of the protein 3D structure used in the calculation. Since accurate experimental 3D structures are not known for most proteins, we must rely on a structure obtained computationally. While this necessary approximation leads to errors in the energy evaluation, we hope that the relative errors are small as some of the contributions to the interaction do not depend on the detail of 3D structure. Moreover, the approximations are likely to affect most proteins equally, such that the ranking of proteins by affinity to the NP produced by UA is likely a reasonable estimate of the true ranking. In Table 4, we give the 10 most strongly binding proteins for gold and titania NPs of radius 200 nm ordered in terms of their affinity to the NP for both the experimental and computational (masked) structures.
In general, although some proteins appear in similar positions between the two lists for a given NP, overall there is not a perfect agreement. Nonetheless, it can be seen that e.g., the experimental structure most strongly binding to anatase NPs takes second place in terms of the I-TASSER structures. Given the results shown in Figure 1, it is reasonable to expect some degree of error due to the fact that the binding energies are relatively sensitive to perturbations in the structure of the protein, especially if this typical error is larger than the difference in binding energies between different proteins.
To quantify this correspondence, we calculate the normalised Kendall tau distance between the rankings of all proteins obtained for the experimental and computational structures for each of the eight NPs, as calculated using Equation (4) and presented in Table 5. The coefficients obtained are typically on the order of 0.6, indicating that although there is a correlation between the binding energies obtained for experimental and computational structures, the exact ranking may differ. A clear outlier is the lower value of 0.48 for the 5 nm gold NP, which we attribute to the increased significance in minor variations of the protein structure for binding to small NPs as this may alter which residues are in close-contact with the NP. A similar effect is observed to a lesser extent for the TiO 2 NP, providing further confirmation that the poorer agreement is due to the combination of size of the NP with the size of the specific cavities on the rigid protein globule. Overall for Au 200 nm NP, 7 out of 10 strongest binding proteins coincide for the two methods, for TiO 2 200 nm NP 5 out of top 10 are the same. Having confirmed that the ranking of proteins by the binding affinity is mostly consistent between the experimental and computational structures, we next turn to evaluating if there is a simple relationship between the values of the binding energies obtained from computational and experimental structures. Despite the fact that there is not a one-to-one correspondence between the ranking of experimental and computational proteins for a given NP, we nonetheless observe a linear correlation between these two binding energies for both nanomaterials over all radii as shown in Figure 2. We also observe that the relative error between the binding energies for the experimental and computational structures are quite small, and the vast majority fall within ±20%. The R 2 statistics for the best-fitting linear relationship between experimental and computational binding energies are listed in Table 6 for each NP. These are again in most cases around the 0.6-0.8 range, reflecting the fact that the binding energy may vary due to relatively small changes in the structure. experimental structures. Despite the fact that there is not a one-to-one correspondence between the ranking of experimental and computational proteins for a given NP, we nonetheless observe a linear correlation between these two binding energies for both nanomaterials over all radii as shown in Figure 2. We also observe that the relative error between the binding energies for the experimental and computational structures are quite small, and the vast majority fall within ±20%. The R statistics for the best-fitting linear relationship between experimental and computational binding energies are listed in Table 6 for each NP. These are again in most cases around the 0.6-0.8 range, reflecting the fact that the binding energy may vary due to relatively small changes in the structure.

Metamodel of Adsorption
Given the relatively high correlation between the binding energies obtained using experimental and computational structures, even when the predicted confidence in the structure is low, it is reasonable to assume that the binding energy is not highly dependent on the exact structure of the protein in general. Thus, we proceed with the construction of a model that includes the structure of the protein indirectly using the predictors discussed in Section 2.3. To enable ab initio prediction of protein binding energies, we calculate these predictors using the I-TASSER structures without masking of the proteins, that is, using the full sequence modelled, and train the network on the energies predicted using UnitedAtom for these structures. To prepare input for the neural network approach, we first perform principal component analysis (PCA) of the set of protein predictors to

Metamodel of Adsorption
Given the relatively high correlation between the binding energies obtained using experimental and computational structures, even when the predicted confidence in the structure is low, it is reasonable to assume that the binding energy is not highly dependent on the exact structure of the protein in general. Thus, we proceed with the construction of a model that includes the structure of the protein indirectly using the predictors discussed in Section 2.3. To enable ab initio prediction of protein binding energies, we calculate these predictors using the I-TASSER structures without masking of the proteins, that is, using the full sequence modelled, and train the network on the energies predicted using UnitedAtom for these structures. To prepare input for the neural network approach, we first perform principal component analysis (PCA) of the set of protein predictors to eliminate the high degree of redundancy between the predictors. The coefficients describing the contributions to each of the first 10 of the resulting variables from each of the protein predictors are provided in Supplementary Materials, along with the mean and standard deviation required to convert the predictor to its associated z-score. The proteins vary significantly in structure and composition and this is reflected in the percentages of variation captured by each PCA variable, with the most significant variable covering 30% of the variation. This procedure produces a dimensionless set of input variables, together with the material index (0 for gold, 1 for titania) and the logarithm of the radius of the NP expressed in nanometres. If a larger set of materials is available, one could replace this index by a physics material descriptor to make a universal model. We divide the binding energies by the average Hamaker constant for that material calculated using Equation (5), A Au = 72.1k B T, A Anatase = 8.78k B T, such that the network predicts a dimensionless variable with a similar range for both materials.
To optimise the neural network with respect to the depth of the network and number of PCA variables to use as input, we perform a brute-force search over these two parameters and calculate the R 2 coefficient between the predicted and input binding (dimensionless) energies. The results are shown in Figure 3, from which it can be seen that R 2 effectively saturates above four layers and quite rapidly with respect to the number of PCA variables included. A network with too many free parameters, e.g., number of layers and number of variables, is prone to overfitting, limiting its ability to make meaningful predictions for variables outside the training set. Thus, we select four layers and the first three PCA variables as a trade-off between accuracy and preventing overfitting. Nanomaterials 2020, 10, x FOR PEER REVIEW 11 of 21 eliminate the high degree of redundancy between the predictors. The coefficients describing the contributions to each of the first 10 of the resulting variables from each of the protein predictors are provided in Supplementary Material, along with the mean and standard deviation required to convert the predictor to its associated z-score. The proteins vary significantly in structure and composition and this is reflected in the percentages of variation captured by each PCA variable, with the most significant variable covering 30% of the variation. This procedure produces a dimensionless set of input variables, together with the material index (0 for gold, 1 for titania) and the logarithm of the radius of the NP expressed in nanometres. If a larger set of materials is available, one could replace this index by a physics material descriptor to make a universal model. We divide the binding energies by the average Hamaker constant for that material calculated using Equation (5), ⟨ ⟩ = 72.1k , ⟨ ⟩ = 8.78k , such that the network predicts a dimensionless variable with a similar range for both materials.
To optimise the neural network with respect to the depth of the network and number of PCA variables to use as input, we perform a brute-force search over these two parameters and calculate the R 2 coefficient between the predicted and input binding (dimensionless) energies. The results are shown in Figure 3, from which it can be seen that R 2 effectively saturates above four layers and quite rapidly with respect to the number of PCA variables included. A network with too many free parameters, e.g., number of layers and number of variables, is prone to overfitting, limiting its ability to make meaningful predictions for variables outside the training set. Thus, we select four layers and the first three PCA variables as a trade-off between accuracy and preventing overfitting. We then train an ensemble of 30 networks on bootstrap samples of the training set of two-thirds of the data, with the first three PCA variables, NP material and NP radius as input and the scaled binding energy as output. This training set is selected at random such that all proteins and NPs of all radii for both materials are sampled in the initial set. The results are shown in Figure 4, showing predictions for both the training set and the validation set consisting of the remaining data. For clarity and to distinguish between the TiO2 and Au NPs, we have rescaled the energies back from the dimensionless units used by the network to units of kBT. We also plot histograms of the relative errors in the binding energy for these two sets; it can be seen that there is a slight tendency to underpredict the binding energy in both training and validation sets, but overall the relative error is quite small. The agreement is worse for the validation set, as expected, but indicates that the ANN model is capable of predicting binding energies within a reasonable margin of error. In terms of the ranking of binding energies (again in the dimensionless units output from the network to enable a fair comparison between different materials), we find Kendall tau coefficients of 0.77 for the training set and 0.68 for the validation set, confirming that the ANN approach successfully predicts the correct ordering for most proteins by binding affinity to the NP. We then train an ensemble of 30 networks on bootstrap samples of the training set of two-thirds of the data, with the first three PCA variables, NP material and NP radius as input and the scaled binding energy as output. This training set is selected at random such that all proteins and NPs of all radii for both materials are sampled in the initial set. The results are shown in Figure 4, showing predictions for both the training set and the validation set consisting of the remaining data. For clarity and to distinguish between the TiO 2 and Au NPs, we have rescaled the energies back from the dimensionless units used by the network to units of k B T. We also plot histograms of the relative errors in the binding energy for these two sets; it can be seen that there is a slight tendency to underpredict the binding energy in both training and validation sets, but overall the relative error is quite small. The agreement is worse for the validation set, as expected, but indicates that the ANN model is capable of predicting binding energies within a reasonable margin of error. In terms of the ranking of binding energies (again in the dimensionless units output from the network to enable a fair comparison between different materials), we find Kendall tau coefficients of 0.77 for the training set and 0.68 for the validation set, confirming that the ANN approach successfully predicts the correct ordering for most proteins by binding affinity to the NP. Since the network is trained on a range of NP sizes, it is of interest to see if it can successfully interpolate between these to be able to predict the binding energy of a protein to an NP of radius between 1-200 nm other than those sizes provided in the initial training set. In Figure 5, we plot the binding energy as a function of radius for a given protein (1AX8) with the binding energies at known radii shown as points. The model has produced a physically realistic interpolation in most cases, that is, the behaviour is reasonably smooth, indicating that the network is not overfitting to the data and can be used to produce a first estimate of the binding energies for a wider range of radii. In this case, however, the model fails to accurately predict the binding energy for this protein to either material within one standard deviation. This is likely due to the limited size of the training set and the simple variable used to distinguish between gold and titania NPs; with a more physically meaningful variable, it is likely that the accuracy can be increased further.  Since the network is trained on a range of NP sizes, it is of interest to see if it can successfully interpolate between these to be able to predict the binding energy of a protein to an NP of radius between 1-200 nm other than those sizes provided in the initial training set. In Figure 5, we plot the binding energy as a function of radius for a given protein (1AX8) with the binding energies at known radii shown as points. The model has produced a physically realistic interpolation in most cases, that is, the behaviour is reasonably smooth, indicating that the network is not overfitting to the data and can be used to produce a first estimate of the binding energies for a wider range of radii. In this case, however, the model fails to accurately predict the binding energy for this protein to either material within one standard deviation. This is likely due to the limited size of the training set and the simple variable used to distinguish between gold and titania NPs; with a more physically meaningful variable, it is likely that the accuracy can be increased further. Since the network is trained on a range of NP sizes, it is of interest to see if it can successfully interpolate between these to be able to predict the binding energy of a protein to an NP of radius between 1-200 nm other than those sizes provided in the initial training set. In Figure 5, we plot the binding energy as a function of radius for a given protein (1AX8) with the binding energies at known radii shown as points. The model has produced a physically realistic interpolation in most cases, that is, the behaviour is reasonably smooth, indicating that the network is not overfitting to the data and can be used to produce a first estimate of the binding energies for a wider range of radii. In this case, however, the model fails to accurately predict the binding energy for this protein to either material within one standard deviation. This is likely due to the limited size of the training set and the simple variable used to distinguish between gold and titania NPs; with a more physically meaningful variable, it is likely that the accuracy can be increased further.

Discussion
In this paper, we computationally predict a set of 59 human blood serum proteins structures from their AA sequences using the I-TASSER software suite. The binding energies for these structures on Au and TiO 2 NPs over a range of radii are in relatively good agreement with the values calculated for the experimental structures obtained from the PDB. From the resulting R 2 coefficients (typically greater than 0.6), it can clearly be seen that there is a strong linear correlation between the binding energies computed for protein structures taken from the PDB and calculated using I-TASSER, with slightly better agreement found for larger NPs. We attribute this to the increased relevance of bulk van der Waals forces, which are less strongly influenced by the exact structure of the protein and depend more on its overall volume. In general, the good agreement between the binding energies obtained for computational and experimental structures, both in terms of ranking the protein-NP affinity and the linear correlation between these two energies, confirms that in the absence of an experimental structure the binding energy of a single-chain protein can be reasonably well estimated from a structure obtained using I-TASSER. Indeed, given the fact that many experimental sequences are missing residues or do not cover the entire protein, it may even be preferable to use a computational structure, or multiple structures if these are available to allow for an averaging over small errors in the placement of residues. This good agreement does not necessarily imply that the binding energy obtained using either of these two types of structures is accurate to the binding energy that could be found experimentally or using a more complex simulation. Rather, it indicates that the accuracy of the UA model is not significantly impacted by using a protein structure obtained via I-TASSER.
As mentioned earlier, both nanomaterials tend to have different ranges for the calculated binding energies as can be seen in Figure 1. The Au NPs have a range between −100 to −350k B T, whereas the TiO 2 NPs have an energy range between 0 to −50k B T. The main reason for this is the differing strength of the total van der Waals interaction for each nanomaterial, as parameterised by the Hamaker constant, and different degree of hydrophilicity of the two materials. In the UA model employed, the Hamaker constant is calculated between each nanomaterial and AA residue and used to determine the strength of the long-range interaction. Typically, the Hamaker constant for an Au-AA interaction is on the order of 70k B T (174.58 kJ·mol −1 ), an order of magnitude larger than that for a TiO 2 interaction 7k B T (17.458 kJ·mol −1 ). Likewise, the potentials of mean force describing the surface interaction between an NP and an AA are also typically much stronger for Au than for TiO 2 , reflecting the strong preferential water adsorption at the TiO 2 surface. For most AA on TiO 2 , their preferred position is some distance away from the solid, allowing for a water layer to stay in between. Titania materials are known for their extreme hydrophilicity [19,21]. Regardless of the material, we find that the absolute values of the binding energies calculated for experimental and computational structures are reasonably close to each other.
The calculated binding energies can, in theory, be used for predicting the content of NP protein coronas [24,25]. However, a direct comparison of the ranking predicted from the adsorption energies to the experimentally measured protein abundances in the corona would not produce meaningful results. Firstly, the actual abundance of the protein in the corona would depend on the concentration of the specific protein in the medium. These concentrations may differ by several orders of magnitude, and the entropic contribution could dominate the adsorption rate such that a protein that is weakly binding but highly abundant in the medium could be more strongly present in the corona than a protein that is less abundant but more strongly binding. In addition, for materials like gold, the NP-protein interaction leads to irreversible adsorption, again favouring proteins that are more abundant in the medium and so are adsorbed more quickly. This leads to the requirement to generate a more advanced model of corona formation, which is currently under development.
To validate our model of binding free energies, we compared the binding energy with available experimental data. The binding energies of proteins to gold NPs have been previously experimentally measured for insulin (PDB-ID: 4INS) and liver alcohol dehydrogenase (LAHD, PDB-ID: 1HET) [26]. The calculated energy of adsorption from UA for insulin on the smallest NP in [26] and employing a Boltzmann average over orientations is −411 kJ/mol, which is higher than the experimental value reported −51.6 kJ/mol. For LAHD, the calculated adsorption energy is −428 kJ/mol and the experimental value recorded is −54.0 kJ/mol. The binding energies obtained from the UA model are very large, which is a result of very strong contact attractions we see for most AA PMFs at the surface and high van der Waals attractions. Although the approximations made in the UA model can be partly blamed for the exaggeration of the attraction, we find that the atomistic force field is responsible for most of the effect. Nevertheless, the UA model correctly predicts a stronger binding for LAHD in agreement with the experiment. We note, furthermore, that given the strength of the interaction, the adsorption must be irreversible on experimental timescales for most proteins on particles of sufficiently large size (R > 5 nm), which implies that this discrepancy could be caused by the inability of most experimental techniques to accurately assign an adsorption energy when the binding is essentially irreversible. If we instead employ a simple average over orientations, more closely approximating the binding of proteins before an equilibrium is reached, we obtain binding energies of −157 kJ/mol (insulin) and −153 kJ/mol (LAHD), which are closer to the experimental values but still more strongly binding.
With regards to the prediction of adsorption energy from first principles, we cannot yet apply the method to an arbitrary material. A key limitation of our neural network model is that, at present, we do not yet have a well-defined means to fully distinguish between the two materials considered. The difference in their long-range van der Waals interaction is accounted for by scaling the binding energies by the average Hamaker constant, such that the output from the model need only be multiplied by this value to obtain a material-specific binding energy. However, the short-range interactions of these materials with AAs differ significantly in terms of their relative hydrophilicities, in that gold NPs are relatively insensitive to the exact type of AA to which they bind, whereas TiO 2 NPs are much more selective. In the current model, this is simply encoded in terms of a variable set to 0 for gold and 1 for TiO 2 . A clear goal for future work would be to perform calculations for a greater range of materials and build these results into the network along with a more physically meaningful interpretation of this variable. For example, it is possible that this variable can be replaced by the surface charge or some measure of the immersion enthalpy. With only two possible materials, such a choice is arbitrary as any such measure can be rescaled to fall in the range of 0-1 and so reduces to the variable employed here. Including more materials, however, would enable a more meaningful choice to be made, and ideally the network could then be employed to predict binding energies of NPs of a range of materials outside the training set, provided that these can be adequately represented by a Hamaker constant and the unknown surface-dependent variable. This remains outside the scope of the present paper, but the calculation of binding energies for a wide range of nanomaterials is presently underway. Likewise, the model as presented here is limited to spherical NPs, whereas there is clear interest in other geometries e.g., cylindrical NPs as models for rod-like nanomaterials and carbon nanotubes. We have extended the UnitedAtom methodology to predict binding energies of proteins on cylindrical NPs, but the modifications required are extensive and will be reported elsewhere. Generally, we find that cylindrical geometry facilitates the protein binding as it allows more contacts with the NP for the same protein. As an example, the adhesion of the I-TASSER structure for the strongest-binding protein 6NCO onto gold rods is considerably stronger for a rod of radius 5 nm (−321k B T) than for a sphere of the same radius (−210k B T). At larger radii R = 200 nm, we observe binding energies of −298k B T for a sphere and −306k B T for a rod, suggesting that the main effect is due to the difference in surface curvature for the two geometries, which is less pronounced at greater radii.
Despite the above limitations, we find that the set of predictors used here is capable of training a neural network to successfully predict the binding energy of proteins on NPs of two materials and a range of sizes and appears to allow for the interpolation between these sizes, thus enabling the rapid evaluation of the protein affinity of NPs of a known material but differing size for this set of proteins. An additional goal for further work would be to refine the set of predictors and the network architecture to enable the accurate prediction of the binding energies of proteins from outside the set used for the training of the network. This, however, will likely take a much larger set of proteins for training due to the large number of predictors required to describe a protein. We stress also that although the binding energies used here have been calculated using the UA method, in principle the ANN approach could be applied equally well to predict binding energies calculated using more in-depth simulations or obtained experimentally.

Conclusions
In this work, we have calculated the binding energy of proteins to Au and TiO 2 NPs of various sizes using a multiscale interaction model and protein structures obtained both experimentally and computationally from AA sequences. We find that the calculation based on predicted protein 3D structures provides a good approximation to the binding energy, thus it can be reliably used for cases where a protein 3D structure is not known. The binding energy range tends to be higher on Au NP surface as compared to TiO 2 due to stronger van der Waals core interaction and closer approach of the AA to the surface atoms. The weaker attraction of the proteins to the titania NPs can be attributed to strong hydrophilicity of the surfaces, which prefer to bind water rather than AAs. For both materials, we observe that binding energies are typically greater for larger proteins and NPs. We introduced several novel protein descriptors based on their 3D structure and have developed a neural network model that predicts the protein adsorption energy from basic nanomaterial and protein descriptors, furthering the ability to estimate the affinity of a protein to a given NP when only an approximate structure can be calculated. The binding energies calculated here are of use in further studies, e.g., for the prediction of the corona content of NPs immersed in biological media, and themselves represent important quantities parameterizing the interactions of nanomaterials with biomaterials and serving as predictors of their biological activity. For example, we are presently exploring the use of binding energies calculated using this method as input for quantitative structure-activity relationship models to predict the inflammation response caused by nanomaterials. Previous examples of predictive models for NP uptake and association with live cells demonstrate that this may be indeed possible [24,25].
Supplementary Materials: The following are available online at http://www.mdpi.com/2079-4991/10/10/1967/s1, Figure S1: Potentials of mean force for amino acid-TiO 2 interaction, Figure S2: Potentials of mean force for amino acid -gold interaction, Table S1: Adsorption free energies and ranking for 59 proteins on Au and TiO 2 NPs of four different sizes calculated for PDB structures and I-TASSER structures with and without mask, Mathematica notebook for the training of the neural network "ProteinBindingEnergyNN-59.nb" and tabulated predictors and adsorption energies used as input "ProteinPredictorSet.csv".

Conflicts of Interest:
The authors declare no conflict of interest.

Protein Descriptors
Below, we summarise the predictors calculated from protein structures for use in the metamodel in addition to those calculated using Emboss pepstats [18]: • Solvent-accessible surface area: This the surface area of a protein that can be accessed by solvents. We will assume that the quantity represents the geometric surface area of the protein. It makes sense that proteins with a larger surface area should be more reactive relative to equally massive proteins with a smaller surface area, as chemical reactions take place on the surface of objects.
This predictor is calculated using the Shrake-Rupley algorithm. Place a sphere at the C α atom of every AA in the PDB. The radii of these spheres are the same as the radii derived in the Appendix section.

1.
Place N i points uniformly on the surface of each sphere i (a golden spiral approximation was used to achieve this uniform distribution in our implementation).

2.
For each sphere i, check all of its N i points to see if any of them lie within the volume of another sphere. If it does, remove that point from the surface of sphere i. After this process, there will be n i points remaining on the surface of each sphere, such that n i /N i gives the fraction of the exposed area of the sphere relative to the total area of that sphere. 3.
The total surface area of the protein is given by the sum of the fractional surface areas contributed by each sphere to the total surface, Here, r i is the radius of the sphere of type i. The final value for surface area is given in nm 2 . Volume: A brute force check and iteration algorithm were employed to calculate the volume of a protein: 1. Place a sphere at the C α of every AA in the PDB. The radii of these spheres are the same as the radii derived in Appendix B.
2. Place this protein into a box, which itself is divided up into n small volume elements of volume ∆V.
3. Iterate through every volume element and count the number of volume elements with centres which lie within the volume of any sphere from the protein.
4. The total volume of the protein is the number of volume elements counted times the volume of a single volume element.
The surface area and volume of a protein in and of itself may not be a predictive property, but they can be combined with and enhance other properties in a way that will be discussed in the next sections. The final value for volume is given in nm 3 . This method was used to calculate the volume of all of the proteins considered in the paper when trying to determine the AA radii.
• Sphericity: How close a particle resembles a sphere. A highly spherical particle will have a value close to 1. The value is given by where V and A are the protein's volume and area as previously defined.
where δ ij is equal to 1 for amino acids of the same type, and 0 otherwise. This methodology is used to calculate the remaining surface properties. • Amino Acid percentage on the surface: The percentage of each AA species that appears on the surface. These two methods provided very different values for the radii of each AA. The ratio of the radii between these two different methods is mostly constant, meaning that the relative size of each AA is correct, but we need to adjust the scale. This relationship can be seen in Figure A1. This consistency satisfies our first criterion that the AAs should have radii reflective of their relative size. The only outlier is Histidine, which is likely because Histidine is the only charged AA that does not have a long polymer tail. Histidine's charge would explain its large van der Waals radius as it would strongly repel the nearest neighbour's, while the lack of a tail would account for the small etch radius.
The next question is to determine which of these two series best represent our model. The second criterion for the choice of radii was that the resulting protein density would match real world values. For this, we refer to [27]. In this paper, the authors measured the mass and density of various proteins and found the following relationship between protein mass and density  These two methods provided very different values for the radii of each AA. The ratio of the radii between these two different methods is mostly constant, meaning that the relative size of each AA is correct, but we need to adjust the scale. This relationship can be seen in Figure A1. This consistency satisfies our first criterion that the AAs should have radii reflective of their relative size. The only outlier is Histidine, which is likely because Histidine is the only charged AA that does not have a long polymer tail. Histidine's charge would explain its large van der Waals radius as it would strongly repel the nearest neighbour's, while the lack of a tail would account for the small etch radius.
The next question is to determine which of these two series best represent our model. The second criterion for the choice of radii was that the resulting protein density would match real world values. For this, we refer to [27]. In this paper, the authors measured the mass and density of various proteins and found the following relationship between protein mass and density These two methods provided very different values for the radii of each AA. The ratio of the radii between these two different methods is mostly constant, meaning that the relative size of each AA is correct, but we need to adjust the scale. This relationship can be seen in Figure A1. This consistency satisfies our first criterion that the AAs should have radii reflective of their relative size. The only outlier is Histidine, which is likely because Histidine is the only charged AA that does not have a long polymer tail. Histidine's charge would explain its large van der Waals radius as it would strongly repel the nearest neighbour's, while the lack of a tail would account for the small etch radius.
The next question is to determine which of these two series best represent our model. The second criterion for the choice of radii was that the resulting protein density would match real world values. For this, we refer to [27]. In this paper, the authors measured the mass and density of various proteins and found the following relationship between protein mass and density  where M is the mass of the protein. We expect that the density profile of our model proteins will follow this general trend. Using collected protein structure and mass data, it was possible to calculate the density profile for various proteins over a spectrum of different masses. The predicted density profiles from both models are plotted in Figure A2. It is clear from this figure that while both models show the same trend as predicted by Equation (A7), they are off-scale: The etch model is overestimating the density (meaning the radii must be an underestimate), while the opposite is true for r W . where M is the mass of the protein. We expect that the density profile of our model proteins will follow this general trend. Using collected protein structure and mass data, it was possible to calculate the density profile for various proteins over a spectrum of different masses. The predicted density profiles from both models are plotted in Figure A2. It is clear from this figure that while both models show the same trend as predicted by Equation (A7), they are off-scale: The etch model is overestimating the density (meaning the radii must be an underestimate), while the opposite is true for . Figure A3. A distribution of density versus mass for a range of proteins of various sizes. Blue: van der Waals radius. Green: Etch radius Red: Experimentally desired result Black: Optimised linear combination of van der Waals radius and etch radius.
To fit the experimental data, we use a linear combination of both data series that gives the desired density profile. This involves minimizing the following function * = where a is the mixing factor. Carrying out this operation gives a value of α = 0.49. A plot of the new density profile derived using this method can be seen in Figure A3. The final set of radii denoted * , which most accurately capture the density and structure, was then calculated as * = 1 − + / (A9) The AA radii following from this procedure are shown in Table A1.
The parameters required to calculate the interaction energy between an AA and a NP's core are the Hamaker constants. This constant of interaction depends on the polarizability of each material and absorption capability. In addition to this, the induced electric field generated by a dipole would also effect, and be affected by, the polarizability of the intervening media between the two materials. As such, the constant that determines the strength of the core potential will depend on the polarizability of three different materials. We have previously calculated these values [10,21] and for convenience we list these values in Table A1. To fit the experimental data, we use a linear combination of both data series that gives the desired density profile. This involves minimizing the following function where a is the mixing factor. Carrying out this operation gives a value of α = 0.49. A plot of the new density profile derived using this method can be seen in Figure A3. The final set of radii denoted r * , which most accurately capture the density and structure, was then calculated as The AA radii following from this procedure are shown in Table A1.
The parameters required to calculate the interaction energy between an AA and a NP's core are the Hamaker constants. This constant of interaction depends on the polarizability of each material and absorption capability. In addition to this, the induced electric field generated by a dipole would also effect, and be affected by, the polarizability of the intervening media between the two materials. As such, the constant that determines the strength of the core potential will depend on the polarizability of three different materials. We have previously calculated these values [10,21] and for convenience we list these values in Table A1.